fix: gc StringView/BinaryView arrays before spilling to prevent write amplification#21325
fix: gc StringView/BinaryView arrays before spilling to prevent write amplification#21325damahua wants to merge 2 commits intoapache:mainfrom
Conversation
… amplification After operations like `take` or `interleave`, view-type arrays retain shared references to large original data buffers. When these batches are written to spill files individually, the IPC writer duplicates all referenced buffers for every batch — measured at 8.5× write amplification with 10 chunks of 1000 rows from a 10,000-row StringView dataset. Changes: - Add `gc_view_arrays()` utility in spill/mod.rs that compacts both StringViewArray and BinaryViewArray in a RecordBatch - Apply gc to hash aggregation spill path (IncrementalSortIterator output chunks share parent batch buffers via take_record_batch) - Apply gc to sort-merge join bitwise_stream spill path (inner_key_buffer contains sliced batches sharing original batch buffers) - Extend sort operator's organize_stringview_arrays to also handle BinaryViewArray (was previously StringView-only) The sort operator already had StringView gc (organize_stringview_arrays), but the hash aggregation and sort-merge join spill paths were missing it. This is the same class of bug fixed by PR apache#19444 for sort spilling. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
…B test Adds a targeted benchmark that exercises the hash aggregation spill path with StringViewArray columns (Utf8View, non-inline 50+ byte strings). Uses EXPLAIN ANALYZE to capture spill_count and spilled_bytes metrics. A/B results (20 MB pool, 100K rows, 50K groups, N=3): Baseline: 39.50 MB spilled (5× write amplification from shared buffers) Optimized: 7.90 MB spilled (80% reduction) Query time: unchanged (~320 ms) With 8 MB pool: baseline OOMs during sort reservation, optimized succeeds. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
614ad8e to
4cff4e7
Compare
| mean_spill / 1024.0 / 1024.0, | ||
| stddev_spill / 1024.0 / 1024.0, | ||
| ); | ||
| eprintln!("Individual spill bytes: {:?}", spill_bytes_vec); |
There was a problem hiding this comment.
Unless we make assertions on the sizes (i.e. if all we do is print them out) we won't catch regressions.
Do the numbers change? If not we could make it an SLT test.
| /// | ||
| /// Returns the batch unchanged (no allocation) when it contains no view-type | ||
| /// columns. | ||
| pub fn gc_view_arrays(batch: &RecordBatch) -> Result<RecordBatch> { |
There was a problem hiding this comment.
Could this return Option<RecordBatch> to indicate if no work was done? Then it could be used from organize_stringview_arrays
| @@ -0,0 +1,295 @@ | |||
| // Licensed to the Apache Software Foundation (ASF) under one | |||
There was a problem hiding this comment.
We should commit this as it's own PR so that it shows the improvement
|
run benchmarks env:
DATAFUSION_RUNTIME_MEMORY_LIMIT: 500M |
|
🤖 Benchmark running (GKE) | trigger CPU Details (lscpu)Comparing gc-view-arrays-spill (4cff4e7) to 2818abb (merge-base) diff using: tpch File an issue against this benchmark runner |
|
🤖 Benchmark running (GKE) | trigger CPU Details (lscpu)Comparing gc-view-arrays-spill (4cff4e7) to 2818abb (merge-base) diff using: clickbench_partitioned File an issue against this benchmark runner |
|
🤖 Benchmark running (GKE) | trigger CPU Details (lscpu)Comparing gc-view-arrays-spill (4cff4e7) to 2818abb (merge-base) diff using: tpcds File an issue against this benchmark runner |
|
🤖 Benchmark completed (GKE) | trigger Instance: CPU Details (lscpu)Details
Resource Usagetpcds — base (merge-base)
tpcds — branch
File an issue against this benchmark runner |
Which issue does this PR close?
Related to #19444 (sort spill StringView gc) and #20500 (repartition StringView gc). This PR extends the same fix to hash aggregation and sort-merge join spill paths, and adds BinaryViewArray support to the sort operator.
Rationale
After operations like
takeorslice,StringViewArrayandBinaryViewArrayretain shared references to all original data buffers. When these batches are written to spill files individually, the IPC writer must include every referenced buffer for every batch, causing massive write amplification.The sort operator already had a fix for this (
organize_stringview_arraysinsort.rs), but the hash aggregation and sort-merge join spill paths were missing it. Additionally, the sort operator's fix only handledStringViewArray, notBinaryViewArray.Hash aggregation spill path
In
row_hash.rs,IncrementalSortIteratorproduces output chunks viatake_record_batch. Each chunk shares the same StringView data buffers as the parent emitted batch. Withoutgc(), spilling N chunks writes N copies of all shared buffers.Sort-merge join spill path
In
bitwise_stream.rs,inner_key_buffercontains sliced batches that share StringView data buffers with the original unsliced batches.Changes
gc_view_arrays()utility inspill/mod.rs— compacts bothStringViewArrayandBinaryViewArrayin aRecordBatch, returning the batch unchanged (no allocation) when no view-type columns existrow_hash.rs) — gc eachIncrementalSortIteratoroutput batch before writing to the spill filebitwise_stream.rs) — gc slicedinner_key_bufferbatches before spillingsort.rs) — extended existingorganize_stringview_arraysto also handleBinaryViewArrayA/B Benchmark Results
Workload:
SELECT group_key, COUNT(*), SUM(value) FROM t GROUP BY group_keygroup_key:Utf8View(StringViewArray) with 50+ byte non-inline stringsWith a tighter 8 MB pool: baseline OOMs (
ResourcesExhausted: Failed to allocate additional 10.4 MB for GroupedHashAggregateStream) because inflated StringView buffers cause the sort memory estimate to exceed the pool. Optimized completes successfully (5 spills, 20.9 MB spilled).Tests
datafusion-physical-planlib tests pass (1 pre-existing zstd feature failure)memory_limitintegration tests passsort_merge_jointests passtest_gc_view_arrays_reduces_spill_size— verifies gc compacts taken StringView/BinaryView batchestest_gc_view_arrays_write_amplification— demonstrates 8.5× write amplification without gctest_gc_view_arrays_noop_for_non_view_types— verifies no overhead for non-view typesbench_stringview_aggregate_spill— end-to-end benchmark with EXPLAIN ANALYZE metrics