Fix massive spill files for StringView/BinaryView columns II#21633
Fix massive spill files for StringView/BinaryView columns II#21633adriangb wants to merge 9 commits intoapache:mainfrom
Conversation
Add garbage collection for StringView and BinaryView arrays before spilling to disk. This prevents sliced arrays from carrying their entire original buffers when written to spill files. Changes: - Add gc_view_arrays() function to apply GC on view arrays - Integrate GC into InProgressSpillFile::append_batch() - Use simple threshold-based heuristic (100+ rows, 10KB+ buffer size) Fixes apache#19414 where GROUP BY on StringView columns created 820MB spill files instead of 33MB due to sliced arrays maintaining references to original buffers. Testing shows 80-98% reduction in spill file sizes for typical GROUP BY workloads.
- Replace row count heuristic with 10KB memory threshold - Improve documentation and add inline comments - Remove redundant test_exact_clickbench_issue_19414 - Maintains 96% reduction in spill file sizes
The SpillManager now handles GC for StringView/BinaryView arrays internally via gc_view_arrays(), making the organize_stringview_arrays() function in external sort redundant. Changes: - Remove organize_stringview_arrays() call and function from sort.rs - Use batch.clone() for early return (cheaper than creating new batch) - Use arrow_data::MAX_INLINE_VIEW_LEN constant instead of custom constant - Update comment in spill_manager.rs to reference gc_view_arrays()
Address review comments from PR apache#19444: - Replace row count heuristic with 10KB memory threshold - Add comprehensive documentation explaining GC rationale and mechanism - Use direct array parameter for better type safety - Maintain early return optimization for non-view arrays The GC now triggers based on actual buffer memory usage rather than row counts, providing more accurate and efficient garbage collection for sliced StringView/BinaryView arrays during spilling. Tests confirm 80%+ reduction in spill file sizes for pathological cases like ClickBench (820MB -> 33MB).
- Return post-GC sliced size from append_batch so callers use the correct post-GC size for memory accounting (fixes cetra3's CHANGES_REQUESTED: max_record_batch_size was measured pre-GC in sort.rs and spill_manager.rs) - Fix incorrect comment claiming Arrow gc() is a no-op; it always allocates new compact buffers - Add comment in should_gc_view_array explaining why we sum data_buffers directly instead of using get_buffer_memory_size() - Enhance append_batch doc comment with GC rationale per reviewer request - Reduce row counts in heavy GC tests
Address PR review: avoid duplicating data-buffer size calculation by deriving it from get_buffer_memory_size minus the views buffer.
|
|
||
| [dependencies] | ||
| arrow = { workspace = true } | ||
| arrow-data = { workspace = true } |
There was a problem hiding this comment.
This dependency seems unused.
The only occurrence of arrow_data is at https://github.com/apache/datafusion/pull/21633/changes#diff-1f7d15c867929af294664ebbde4e8c9038186222cbb95ed86e527406cf066e84R463 for a test helper.
| // on top of the sliced size for views buffer. This matches the intended semantics of | ||
| // "bytes needed if we materialized exactly this slice into fresh buffers". | ||
| // This is a workaround until https://github.com/apache/arrow-rs/issues/8230 | ||
| if let Some(sv) = array.as_any().downcast_ref::<StringViewArray>() { |
There was a problem hiding this comment.
The same is needed for BinaryViewArray, no ?
| } | ||
|
|
||
| // Always return a new batch for consistency | ||
| Ok(RecordBatch::try_new(batch.schema(), new_columns)?) |
There was a problem hiding this comment.
Would there be a noticeable gain if the batch is just cloned when there were no garbage collections ?
I.e.
let mut mutated = false;
...
if should_gc_view_array(a_view) {
mutated = true;
...
}
...
if mutated {
Ok(RecordBatch::try_new(batch.schema(), new_columns)?)
} else {
Ok(batch.clone())
}| let gc_batch = gc_view_arrays(&batch)?; | ||
|
|
||
| // The batch should be unchanged (cloned, not GC'd) | ||
| assert_eq!(gc_batch.num_rows(), batch.num_rows()); |
There was a problem hiding this comment.
This would be true even if there was a GC, no ?
Wouldn't it better to assert let should_gc = should_gc_view_array(string_array); assert!(!should_gc); ?
| let has_view_arrays = batch.columns().iter().any(|array| { | ||
| matches!( | ||
| array.data_type(), | ||
| arrow::datatypes::DataType::Utf8View | arrow::datatypes::DataType::BinaryView |
There was a problem hiding this comment.
What about nested types (List, Map, Union, Dictionary) which contain these views ?
Replaces #19444 which seems stuck.