Fix massive spill files for StringView/BinaryView columns II by adriangb · Pull Request #21633 · apache/datafusion

adriangb · 2026-04-14T22:35:57Z

Replaces #19444 which seems stuck.

Add garbage collection for StringView and BinaryView arrays before spilling to disk. This prevents sliced arrays from carrying their entire original buffers when written to spill files. Changes: - Add gc_view_arrays() function to apply GC on view arrays - Integrate GC into InProgressSpillFile::append_batch() - Use simple threshold-based heuristic (100+ rows, 10KB+ buffer size) Fixes apache#19414 where GROUP BY on StringView columns created 820MB spill files instead of 33MB due to sliced arrays maintaining references to original buffers. Testing shows 80-98% reduction in spill file sizes for typical GROUP BY workloads.

- Replace row count heuristic with 10KB memory threshold - Improve documentation and add inline comments - Remove redundant test_exact_clickbench_issue_19414 - Maintains 96% reduction in spill file sizes

The SpillManager now handles GC for StringView/BinaryView arrays internally via gc_view_arrays(), making the organize_stringview_arrays() function in external sort redundant. Changes: - Remove organize_stringview_arrays() call and function from sort.rs - Use batch.clone() for early return (cheaper than creating new batch) - Use arrow_data::MAX_INLINE_VIEW_LEN constant instead of custom constant - Update comment in spill_manager.rs to reference gc_view_arrays()

Address review comments from PR apache#19444: - Replace row count heuristic with 10KB memory threshold - Add comprehensive documentation explaining GC rationale and mechanism - Use direct array parameter for better type safety - Maintain early return optimization for non-view arrays The GC now triggers based on actual buffer memory usage rather than row counts, providing more accurate and efficient garbage collection for sliced StringView/BinaryView arrays during spilling. Tests confirm 80%+ reduction in spill file sizes for pathological cases like ClickBench (820MB -> 33MB).

- Return post-GC sliced size from append_batch so callers use the correct post-GC size for memory accounting (fixes cetra3's CHANGES_REQUESTED: max_record_batch_size was measured pre-GC in sort.rs and spill_manager.rs) - Fix incorrect comment claiming Arrow gc() is a no-op; it always allocates new compact buffers - Add comment in should_gc_view_array explaining why we sum data_buffers directly instead of using get_buffer_memory_size() - Enhance append_batch doc comment with GC rationale per reviewer request - Reduce row counts in heavy GC tests

Address PR review: avoid duplicating data-buffer size calculation by deriving it from get_buffer_memory_size minus the views buffer.

martin-g · 2026-04-15T06:02:56Z


 [dependencies]
 arrow = { workspace = true }
+arrow-data = { workspace = true }


This dependency seems unused.
The only occurrence of arrow_data is at https://github.com/apache/datafusion/pull/21633/changes#diff-1f7d15c867929af294664ebbde4e8c9038186222cbb95ed86e527406cf066e84R463 for a test helper.

martin-g · 2026-04-15T06:16:48Z

            // on top of the sliced size for views buffer. This matches the intended semantics of
            // "bytes needed if we materialized exactly this slice into fresh buffers".
            // This is a workaround until https://github.com/apache/arrow-rs/issues/8230
            if let Some(sv) = array.as_any().downcast_ref::<StringViewArray>() {


The same is needed for BinaryViewArray, no ?

martin-g · 2026-04-15T06:23:29Z

+    }
+
+    // Always return a new batch for consistency
+    Ok(RecordBatch::try_new(batch.schema(), new_columns)?)


Would there be a noticeable gain if the batch is just cloned when there were no garbage collections ?
I.e.

let mut mutated = false; ... if should_gc_view_array(a_view) { mutated = true; ... } ... if mutated { Ok(RecordBatch::try_new(batch.schema(), new_columns)?) } else { Ok(batch.clone()) }

martin-g · 2026-04-15T06:30:56Z

+        let gc_batch = gc_view_arrays(&batch)?;
+
+        // The batch should be unchanged (cloned, not GC'd)
+        assert_eq!(gc_batch.num_rows(), batch.num_rows());


This would be true even if there was a GC, no ?
Wouldn't it better to assert let should_gc = should_gc_view_array(string_array); assert!(!should_gc); ?

martin-g · 2026-04-15T08:53:48Z

+    let has_view_arrays = batch.columns().iter().any(|array| {
+        matches!(
+            array.data_type(),
+            arrow::datatypes::DataType::Utf8View | arrow::datatypes::DataType::BinaryView


What about nested types (List, Map, Union, Dictionary) which contain these views ?

EeshanBembi and others added 8 commits April 14, 2026 17:31

Address PR review feedback for StringView/BinaryView GC

05de09a

- Replace row count heuristic with 10KB memory threshold - Improve documentation and add inline comments - Remove redundant test_exact_clickbench_issue_19414 - Maintains 96% reduction in spill file sizes

Apply cargo fmt

9cb3aef

fix: remove unused import Array to fix clippy

2782bc5

Use get_buffer_memory_size in should_gc_view_array

edc19f9

Address PR review: avoid duplicating data-buffer size calculation by deriving it from get_buffer_memory_size minus the views buffer.

github-actions bot added the physical-plan Changes to the physical-plan crate label Apr 14, 2026

adriangb mentioned this pull request Apr 14, 2026

Fix massive spill files for StringView/BinaryView columns #19444

Open

adriangb changed the title ~~fix: gc StringView/BinaryView arrays before spilling to prevent write amplification~~ Fix massive spill files for StringView/BinaryView columns rev2 Apr 14, 2026

adriangb changed the title ~~Fix massive spill files for StringView/BinaryView columns rev2~~ Fix massive spill files for StringView/BinaryView columns II Apr 14, 2026

adriangb requested a review from alamb April 14, 2026 22:40

martin-g reviewed Apr 15, 2026

View reviewed changes

Address review feedback on spill view GC

bd7fa4c

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Fix massive spill files for StringView/BinaryView columns II#21633

Fix massive spill files for StringView/BinaryView columns II#21633
adriangb wants to merge 9 commits intoapache:mainfrom
pydantic:fix-stringview-spill-gc-2

adriangb commented Apr 14, 2026 •

edited

Loading

Uh oh!

martin-g Apr 15, 2026

Uh oh!

martin-g Apr 15, 2026

Uh oh!

martin-g Apr 15, 2026

Uh oh!

martin-g Apr 15, 2026

Uh oh!

martin-g Apr 15, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

Conversation

adriangb commented Apr 14, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

martin-g Apr 15, 2026

Choose a reason for hiding this comment

Uh oh!

martin-g Apr 15, 2026

Choose a reason for hiding this comment

Uh oh!

martin-g Apr 15, 2026

Choose a reason for hiding this comment

Uh oh!

martin-g Apr 15, 2026

Choose a reason for hiding this comment

Uh oh!

martin-g Apr 15, 2026

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

adriangb commented Apr 14, 2026 •

edited

Loading