Track spill read-back memory in SMJ by SubhamSinghal · Pull Request #22103 · apache/datafusion

SubhamSinghal · 2026-05-11T06:57:50Z

Which issue does this PR close?

Follow-up to #21962.

Rationale for this change

After #21962, the memory pool accurately tracks residual join_arrays memory that remains after a BufferedBatch is
spilled to disk. However, when spilled batches are read back from disk during output materialization in
materialize_right_columns, the deserialized data temporarily exists in memory without any pool reservation.

Single-source path: one full batch loaded without reservation
Multi-source interleave path: ALL referenced spilled batches loaded simultaneously — N × batch_size untracked

The pool thinks these batches cost 0 bytes during read-back. Under memory pressure (the reason they were spilled), other
operators see stale headroom and may over-allocate, risking OOM.

What changes are included in this PR?

Changed materialize_right_columns from &self to &mut self and added grow/shrink at the exact points where spilled data is read from disk:

Path A (single source spilled):

grow(size_estimation) immediately before fetch_right_columns_by_idxs
shrink(size_estimation) immediately after

Path B (multi-source interleave):

Sum size_estimation for all spilled sources
grow(total) before source_data loading
shrink(total) after interleave completes

Uses unconditional grow() because the data must be read to produce output — there is no fallback. Same rationale as
#21962: if memory physically exists, the pool must reflect it.

Are these changes tested?

Yes — two new tests:

spill_read_back_memory_accounting: multiple buffered batches for same key (multi-source Path B) — verifies
peak_mem_used >= size_estimation and pool.reserved() == 0 at end
spill_read_back_single_source: distinct keys with one batch per group (single-source Path A) — same assertions

Are there any user-facing changes?

No.

Track spill read-back memory in SMJ

432f5c7

github-actions Bot added the physical-plan Changes to the physical-plan crate label May 11, 2026

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Track spill read-back memory in SMJ#22103

Track spill read-back memory in SMJ#22103
SubhamSinghal wants to merge 1 commit into
apache:mainfrom
SubhamSinghal:smj-spill-read-back-memory-accounting

SubhamSinghal commented May 11, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

SubhamSinghal commented May 11, 2026

Which issue does this PR close?

Rationale for this change

What changes are included in this PR?

Are these changes tested?

Are there any user-facing changes?

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant