Skip to content

Fix massive spill files for StringView/BinaryView columns II#21633

Open
adriangb wants to merge 9 commits intoapache:mainfrom
pydantic:fix-stringview-spill-gc-2
Open

Fix massive spill files for StringView/BinaryView columns II#21633
adriangb wants to merge 9 commits intoapache:mainfrom
pydantic:fix-stringview-spill-gc-2

Conversation

@adriangb
Copy link
Copy Markdown
Contributor

@adriangb adriangb commented Apr 14, 2026

Replaces #19444 which seems stuck.

EeshanBembi and others added 8 commits April 14, 2026 17:31
Add garbage collection for StringView and BinaryView arrays before spilling
to disk. This prevents sliced arrays from carrying their entire original
buffers when written to spill files.

Changes:
- Add gc_view_arrays() function to apply GC on view arrays
- Integrate GC into InProgressSpillFile::append_batch()
- Use simple threshold-based heuristic (100+ rows, 10KB+ buffer size)

Fixes apache#19414 where GROUP BY on StringView columns created 820MB spill files
instead of 33MB due to sliced arrays maintaining references to original buffers.

Testing shows 80-98% reduction in spill file sizes for typical GROUP BY workloads.
- Replace row count heuristic with 10KB memory threshold
- Improve documentation and add inline comments
- Remove redundant test_exact_clickbench_issue_19414
- Maintains 96% reduction in spill file sizes
The SpillManager now handles GC for StringView/BinaryView arrays internally
via gc_view_arrays(), making the organize_stringview_arrays() function in
external sort redundant.

Changes:
- Remove organize_stringview_arrays() call and function from sort.rs
- Use batch.clone() for early return (cheaper than creating new batch)
- Use arrow_data::MAX_INLINE_VIEW_LEN constant instead of custom constant
- Update comment in spill_manager.rs to reference gc_view_arrays()
Address review comments from PR apache#19444:
- Replace row count heuristic with 10KB memory threshold
- Add comprehensive documentation explaining GC rationale and mechanism
- Use direct array parameter for better type safety
- Maintain early return optimization for non-view arrays

The GC now triggers based on actual buffer memory usage rather than
row counts, providing more accurate and efficient garbage collection
for sliced StringView/BinaryView arrays during spilling.

Tests confirm 80%+ reduction in spill file sizes for pathological cases
like ClickBench (820MB -> 33MB).
- Return post-GC sliced size from append_batch so callers use the
  correct post-GC size for memory accounting (fixes cetra3's
  CHANGES_REQUESTED: max_record_batch_size was measured pre-GC in
  sort.rs and spill_manager.rs)
- Fix incorrect comment claiming Arrow gc() is a no-op; it always
  allocates new compact buffers
- Add comment in should_gc_view_array explaining why we sum
  data_buffers directly instead of using get_buffer_memory_size()
- Enhance append_batch doc comment with GC rationale per reviewer request
- Reduce row counts in heavy GC tests
Address PR review: avoid duplicating data-buffer size calculation by
deriving it from get_buffer_memory_size minus the views buffer.
@github-actions github-actions bot added the physical-plan Changes to the physical-plan crate label Apr 14, 2026
@adriangb adriangb changed the title fix: gc StringView/BinaryView arrays before spilling to prevent write amplification Fix massive spill files for StringView/BinaryView columns rev2 Apr 14, 2026
@adriangb adriangb changed the title Fix massive spill files for StringView/BinaryView columns rev2 Fix massive spill files for StringView/BinaryView columns II Apr 14, 2026
@adriangb adriangb requested a review from alamb April 14, 2026 22:40

[dependencies]
arrow = { workspace = true }
arrow-data = { workspace = true }
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This dependency seems unused.
The only occurrence of arrow_data is at https://github.com/apache/datafusion/pull/21633/changes#diff-1f7d15c867929af294664ebbde4e8c9038186222cbb95ed86e527406cf066e84R463 for a test helper.

// on top of the sliced size for views buffer. This matches the intended semantics of
// "bytes needed if we materialized exactly this slice into fresh buffers".
// This is a workaround until https://github.com/apache/arrow-rs/issues/8230
if let Some(sv) = array.as_any().downcast_ref::<StringViewArray>() {
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The same is needed for BinaryViewArray, no ?

}

// Always return a new batch for consistency
Ok(RecordBatch::try_new(batch.schema(), new_columns)?)
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Would there be a noticeable gain if the batch is just cloned when there were no garbage collections ?
I.e.

let mut mutated = false;
...
if should_gc_view_array(a_view) {
   mutated = true;
   ...
}
...

if mutated {
    Ok(RecordBatch::try_new(batch.schema(), new_columns)?)
} else {
    Ok(batch.clone())
}

let gc_batch = gc_view_arrays(&batch)?;

// The batch should be unchanged (cloned, not GC'd)
assert_eq!(gc_batch.num_rows(), batch.num_rows());
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This would be true even if there was a GC, no ?
Wouldn't it better to assert let should_gc = should_gc_view_array(string_array); assert!(!should_gc); ?

let has_view_arrays = batch.columns().iter().any(|array| {
matches!(
array.data_type(),
arrow::datatypes::DataType::Utf8View | arrow::datatypes::DataType::BinaryView
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What about nested types (List, Map, Union, Dictionary) which contain these views ?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

physical-plan Changes to the physical-plan crate

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants