Improve `coalesce` and `concat` performance for views #7614

Dandandan · 2025-06-05T16:10:57Z

Which issue does this PR close?

Closes Improve performance of coalesce and concat for views #7615
Follow on to Add coalesce kernel andBatchCoalescer for statefully combining selected b…atches: #7597

Rationale for this change

Improve performance of gc_string_view_batch

filter: mixed_utf8view, 8192, nulls: 0, selectivity: 0.001       1.00     30.4±1.05ms        ? ?/sec    1.29     39.3±0.88ms        ? ?/sec
filter: mixed_utf8view, 8192, nulls: 0, selectivity: 0.01        1.00      4.3±0.17ms        ? ?/sec    1.20      5.2±0.15ms        ? ?/sec
filter: mixed_utf8view, 8192, nulls: 0, selectivity: 0.1         1.00  1805.1±25.77µs        ? ?/sec    1.32      2.4±0.20ms        ? ?/sec
filter: mixed_utf8view, 8192, nulls: 0, selectivity: 0.8         1.00      2.6±0.12ms        ? ?/sec    1.48      3.8±0.11ms        ? ?/sec
filter: mixed_utf8view, 8192, nulls: 0.1, selectivity: 0.001     1.00     42.5±0.48ms        ? ?/sec    1.23     52.2±1.33ms        ? ?/sec
filter: mixed_utf8view, 8192, nulls: 0.1, selectivity: 0.01      1.00      5.8±0.12ms        ? ?/sec    1.28      7.4±0.20ms        ? ?/sec
filter: mixed_utf8view, 8192, nulls: 0.1, selectivity: 0.1       1.00      2.2±0.02ms        ? ?/sec    1.37      3.1±0.18ms        ? ?/sec
filter: mixed_utf8view, 8192, nulls: 0.1, selectivity: 0.8       1.00      3.6±0.15ms        ? ?/sec    1.43      5.1±0.12ms        ? ?/sec
filter: single_utf8view, 8192, nulls: 0, selectivity: 0.001      1.00     51.0±0.59ms        ? ?/sec    1.38     70.3±1.11ms        ? ?/sec
filter: single_utf8view, 8192, nulls: 0, selectivity: 0.01       1.00      6.7±0.03ms        ? ?/sec    1.32      8.8±0.16ms        ? ?/sec
filter: single_utf8view, 8192, nulls: 0, selectivity: 0.1        1.00      3.0±0.01ms        ? ?/sec    1.41      4.3±0.09ms        ? ?/sec
filter: single_utf8view, 8192, nulls: 0, selectivity: 0.8        1.00      4.5±0.34ms        ? ?/sec    1.71      7.7±0.28ms        ? ?/sec
filter: single_utf8view, 8192, nulls: 0.1, selectivity: 0.001    1.00     64.2±0.74ms        ? ?/sec    1.33     85.1±1.52ms        ? ?/sec
filter: single_utf8view, 8192, nulls: 0.1, selectivity: 0.01     1.00      9.4±0.09ms        ? ?/sec    1.35     12.6±0.26ms        ? ?/sec
filter: single_utf8view, 8192, nulls: 0.1, selectivity: 0.1      1.00      3.8±0.03ms        ? ?/sec    1.46      5.6±0.11ms        ? ?/sec
filter: single_utf8view, 8192, nulls: 0.1, selectivity: 0.8      1.00      5.7±0.28ms        ? ?/sec    1.73      9.9±0.27ms        ? ?/sec

What changes are included in this PR?

Avoiding recreating the views from scratch.
Specialize concat for view types
Takes owned RecordBatch (effect on performance is small, might be measurable with smaller batch size / more columns).

Are there any user-facing changes?

no

Dandandan · 2025-06-06T07:56:13Z

@alamb FYI

Copilot

Pull Request Overview

This PR targets performance improvements for view concatenation and coalescing in Arrow by specializing functions and reducing unnecessary memory copies. Key changes include:

Adding specialized concatenation functions (concat_byte_view and concat_string_view) for view types.
Refactoring gc_string_view_batch to take ownership of RecordBatches to avoid redundant cloning.
Updating the GenericByteViewBuilder implementation to use a vector-based buffer instead of a BufferBuilder for better performance and clarity.

Reviewed Changes

Copilot reviewed 3 out of 3 changed files in this pull request and generated no comments.

File	Description
arrow-select/src/concat.rs	Introduces optimized concat functions for Utf8View and BinaryView.
arrow-select/src/coalesce.rs	Refactors gc_string_view_batch for improved efficiency and updates associated tests.
arrow-array/src/builder/generic_bytes_view_builder.rs	Replaces BufferBuilder with Vec for views and updates related debug output.

Comments suppressed due to low confidence (3)

arrow-select/src/concat.rs:89

Consider adding inline documentation for the concat_byte_view function to explain its purpose, expected input, and output behavior.

fn concat_byte_view(arrays: &[&dyn Array]) -> Result<ArrayRef, ArrowError> {

arrow-select/src/coalesce.rs:305

Add a comment to explain the safety guarantees around using the unsafe StringViewArray::new_unchecked call and why it is acceptable in this context.

let gc_string = unsafe {

arrow-array/src/builder/generic_bytes_view_builder.rs:449

Update the debug field label from "views_builder" to "views_buffer" so that the debug output accurately reflects the name of the underlying variable.

.field("views_builder", &self.views_buffer)

alamb · 2025-06-06T15:00:31Z

LOVE IT!

alamb · 2025-06-06T15:06:13Z

🤖 ./gh_compare_arrow.sh Benchmark Script Running
Linux aal-dev 6.11.0-1013-gcp #13~24.04.1-Ubuntu SMP Wed Apr 2 16:34:16 UTC 2025 x86_64 x86_64 x86_64 GNU/Linux
Comparing coalesce_improvements (7b408ad) to f92ff18 diff
BENCH_NAME=concatenate_kernel
BENCH_COMMAND=cargo bench --features=arrow,async,test_common,experimental --bench concatenate_kernel
BENCH_FILTER=
BENCH_BRANCH_NAME=coalesce_improvements
Results will be posted here when complete

alamb

This is a thing of Beauty @Dandandan

I kicked off some other benchmarks just to be sure and left some small potential improvements, but I also think this is better than what is currently on main so we could merge it as is and keep improving as a follow on

alamb · 2025-06-06T15:01:06Z

arrow-array/src/builder/generic_bytes_view_builder.rs

 /// using [`GenericByteViewBuilder::try_append_view`]
 pub struct GenericByteViewBuilder<T: ByteViewType + ?Sized> {
-    views_builder: BufferBuilder<u128>,
+    views_buffer: Vec<u128>,


this is a great idea 💯

arrow-select/src/concat.rs

alamb · 2025-06-06T15:08:05Z

arrow-array/src/builder/generic_bytes_view_builder.rs

+        self.flush_in_progress();
+        self.completed.extend(array.data_buffers().iter().cloned());
+
+        if self.completed.is_empty() {


This checks completed.is_empty after pushing new data buffers, which I think means the fast path will never be taken. I think the check could be done prior to calling self.completed.extend and improve performance

I couldn't resist and tried the following change locally and saw some additional improvements in concat but also some slowdowns with entirely inlined views, which I don't understand 🤔

diff --git a/arrow-array/src/builder/generic_bytes_view_builder.rs b/arrow-array/src/builder/generic_bytes_view_builder.rs index 1e06cd7698..3936ca06d4 100644 --- a/arrow-array/src/builder/generic_bytes_view_builder.rs +++ b/arrow-array/src/builder/generic_bytes_view_builder.rs @@ -121,7 +121,7 @@ impl<T: ByteViewType + ?Sized> GenericByteViewBuilder<T> { /// growing buffer size exponentially from 8KB up to 2MB. The /// first buffer allocated is 8KB, then 16KB, then 32KB, etc up to 2MB. /// - /// If this method is used, any new buffers allocated are + /// If this method is used, any new buffers allocated are /// exactly this size. This can be useful for advanced users /// that want to control the memory usage and buffer count. /// @@ -210,9 +210,11 @@ impl<T: ByteViewType + ?Sized> GenericByteViewBuilder<T> { /// and add the (adapted) views. pub fn append_array(&mut self, array: &GenericByteViewArray<T>) { self.flush_in_progress(); + let update_views = !(self.completed.is_empty() || array.data_buffers().is_empty()); + self.completed.extend(array.data_buffers().iter().cloned()); - if self.completed.is_empty() { + if update_views { self.views_buffer.extend_from_slice(array.views()); } else { let starting_buffer = self.completed.len() as u32;

group coalesce_improvements coalesce_improvements_fix ----- --------------------- ------------------------- concat utf8_view max_str_len=128 null_density=0 1.53 80.4±0.42µs ? ?/sec 1.00 52.5±2.85µs ? ?/sec concat utf8_view max_str_len=128 null_density=0.2 1.51 86.4±0.34µs ? ?/sec 1.00 57.3±2.97µs ? ?/sec concat utf8_view max_str_len=20 null_density=0 1.58 80.2±0.36µs ? ?/sec 1.00 50.6±2.33µs ? ?/sec concat utf8_view max_str_len=20 null_density=0.2 1.65 98.9±0.36µs ? ?/sec 1.00 59.9±2.19µs ? ?/sec concat utf8_view all_inline max_str_len=12 null_density=0 1.00 40.6±2.82µs ? ?/sec 1.94 78.7±1.18µs ? ?/sec concat utf8_view all_inline max_str_len=12 null_density=0.2 1.00 53.9±2.78µs ? ?/sec 1.95 105.1±0.26µs ? ?/sec

Shouldn't this be:

@@ -210,9 +210,11 @@ impl<T: ByteViewType + ?Sized> GenericByteViewBuilder<T> { /// and add the (adapted) views. pub fn append_array(&mut self, array: &GenericByteViewArray<T>) { self.flush_in_progress(); + let keep_views = self.completed.is_empty() || array.data_buffers().is_empty(); + self.completed.extend(array.data_buffers().iter().cloned()); - if self.completed.is_empty() { + if keep_views { self.views_buffer.extend_from_slice(array.views()); } else { let starting_buffer = self.completed.len() as u32;

For me this is a all-positive change 🎉 I really like the >2x improvement on inlined views.

concat utf8_view all_inline max_str_len=12 null_density=0 time: [25.775 µs 27.046 µs 28.546 µs] change: [−55.148% −53.600% −51.725%] (p = 0.00 < 0.05) Performance has improved. Found 23 outliers among 100 measurements (23.00%) 7 (7.00%) low mild 5 (5.00%) high mild 11 (11.00%) high severe concat utf8_view max_str_len=20 null_density=0 time: [53.475 µs 53.934 µs 54.689 µs] change: [−7.1979% −6.4937% −5.5943%] (p = 0.00 < 0.05) Performance has improved. Found 7 outliers among 100 measurements (7.00%) 1 (1.00%) low mild 3 (3.00%) high mild 3 (3.00%) high severe concat utf8_view max_str_len=128 null_density=0 time: [52.999 µs 53.106 µs 53.245 µs] change: [−7.0948% −6.6147% −6.0319%] (p = 0.00 < 0.05) Performance has improved. Found 15 outliers among 100 measurements (15.00%) 7 (7.00%) high mild 8 (8.00%) high severe concat utf8_view all_inline max_str_len=12 null_density=0.2 time: [25.532 µs 25.741 µs 25.943 µs] change: [−56.519% −56.108% −55.710%] (p = 0.00 < 0.05) Performance has improved. Found 11 outliers among 100 measurements (11.00%) 8 (8.00%) low mild 3 (3.00%) high mild concat utf8_view max_str_len=20 null_density=0.2 time: [56.062 µs 56.342 µs 56.688 µs] change: [−6.1050% −5.3513% −4.6591%] (p = 0.00 < 0.05) Performance has improved. Found 5 outliers among 100 measurements (5.00%) 1 (1.00%) low mild 4 (4.00%) high severe concat utf8_view max_str_len=128 null_density=0.2 time: [55.406 µs 55.623 µs 55.891 µs] change: [−6.0536% −5.4045% −4.8059%] (p = 0.00 < 0.05) Performance has improved. Found 14 outliers among 100 measurements (14.00%) 1 (1.00%) low mild 1 (1.00%) high mild 12 (12.00%) high severe

Shouldn't this be:

🤦

arrow-array/src/builder/generic_bytes_view_builder.rs

alamb · 2025-06-06T15:13:21Z

🤖: Benchmark completed

Details

group                                                   coalesce_improvements                  main
-----                                                   ---------------------                  ----
concat 1024 arrays boolean 4                            1.00     28.7±0.04µs        ? ?/sec    1.01     28.9±0.05µs        ? ?/sec
concat 1024 arrays i32 4                                1.00     15.9±0.05µs        ? ?/sec    1.02     16.2±0.03µs        ? ?/sec
concat 1024 arrays str 4                                1.03     58.4±0.44µs        ? ?/sec    1.00     56.9±0.40µs        ? ?/sec
concat boolean 1024                                     1.00    437.8±1.03ns        ? ?/sec    1.04    457.4±2.22ns        ? ?/sec
concat boolean 8192 over 100 arrays                     1.00     50.9±0.08µs        ? ?/sec    1.00     51.0±0.06µs        ? ?/sec
concat boolean nulls 1024                               1.00    775.3±4.38ns        ? ?/sec    1.02    789.7±0.85ns        ? ?/sec
concat boolean nulls 8192 over 100 arrays               1.00    109.7±0.17µs        ? ?/sec    1.00    109.7±0.20µs        ? ?/sec
concat fixed size lists                                 1.00   714.2±17.89µs        ? ?/sec    1.01   721.3±17.25µs        ? ?/sec
concat i32 1024                                         1.01    446.1±0.77ns        ? ?/sec    1.00    439.6±0.93ns        ? ?/sec
concat i32 8192 over 100 arrays                         1.00   206.2±11.75µs        ? ?/sec    1.03    211.9±5.68µs        ? ?/sec
concat i32 nulls 1024                                   1.00    752.6±1.87ns        ? ?/sec    1.02    764.4±1.02ns        ? ?/sec
concat i32 nulls 8192 over 100 arrays                   1.00   286.0±11.26µs        ? ?/sec    1.03    294.2±8.00µs        ? ?/sec
concat str 1024                                         1.00     13.2±0.88µs        ? ?/sec    1.05     13.8±1.19µs        ? ?/sec
concat str 8192 over 100 arrays                         1.01    102.5±0.99ms        ? ?/sec    1.00    101.9±0.92ms        ? ?/sec
concat str nulls 1024                                   1.01      6.9±0.71µs        ? ?/sec    1.00      6.8±0.81µs        ? ?/sec
concat str nulls 8192 over 100 arrays                   1.00     52.3±0.42ms        ? ?/sec    1.00     52.1±0.47ms        ? ?/sec
concat str_dict 1024                                    1.00      2.8±0.01µs        ? ?/sec    1.00      2.8±0.02µs        ? ?/sec
concat str_dict_sparse 1024                             1.00      6.9±0.02µs        ? ?/sec    1.01      6.9±0.03µs        ? ?/sec
concat struct with int32 and dicts size=1024 count=2    1.00      6.4±0.15µs        ? ?/sec    1.03      6.6±0.18µs        ? ?/sec

alamb · 2025-06-06T15:13:24Z

🤖 ./gh_compare_arrow.sh Benchmark Script Running
Linux aal-dev 6.11.0-1013-gcp #13~24.04.1-Ubuntu SMP Wed Apr 2 16:34:16 UTC 2025 x86_64 x86_64 x86_64 GNU/Linux
Comparing coalesce_improvements (7b408ad) to f92ff18 diff
BENCH_NAME=coalesce_kernels
BENCH_COMMAND=cargo bench --features=arrow,async,test_common,experimental --bench coalesce_kernels
BENCH_FILTER=
BENCH_BRANCH_NAME=coalesce_improvements
Results will be posted here when complete

alamb · 2025-06-06T15:27:32Z

🤖: Benchmark completed

Details

group                                                            coalesce_improvements                  main
-----                                                            ---------------------                  ----
filter: mixed_dict, 8192, nulls: 0, selectivity: 0.001           1.00    256.9±2.38ms        ? ?/sec    1.15    296.4±1.90ms        ? ?/sec
filter: mixed_dict, 8192, nulls: 0, selectivity: 0.01            1.00      8.9±0.10ms        ? ?/sec    1.02      9.1±0.10ms        ? ?/sec
filter: mixed_dict, 8192, nulls: 0, selectivity: 0.1             1.00      4.4±0.11ms        ? ?/sec    1.01      4.4±0.11ms        ? ?/sec
filter: mixed_dict, 8192, nulls: 0, selectivity: 0.8             1.00      3.5±0.03ms        ? ?/sec    1.00      3.5±0.03ms        ? ?/sec
filter: mixed_dict, 8192, nulls: 0.1, selectivity: 0.001         1.00    252.2±2.03ms        ? ?/sec    1.13    285.7±2.30ms        ? ?/sec
filter: mixed_dict, 8192, nulls: 0.1, selectivity: 0.01          1.00     10.4±0.09ms        ? ?/sec    1.02     10.5±0.08ms        ? ?/sec
filter: mixed_dict, 8192, nulls: 0.1, selectivity: 0.1           1.00      4.9±0.11ms        ? ?/sec    1.01      4.9±0.10ms        ? ?/sec
filter: mixed_dict, 8192, nulls: 0.1, selectivity: 0.8           1.00      4.7±0.02ms        ? ?/sec    1.01      4.7±0.02ms        ? ?/sec
filter: mixed_utf8, 8192, nulls: 0, selectivity: 0.001           1.01     67.6±2.03ms        ? ?/sec    1.00     67.2±0.53ms        ? ?/sec
filter: mixed_utf8, 8192, nulls: 0, selectivity: 0.01            1.00     12.9±0.23ms        ? ?/sec    1.00     12.8±0.17ms        ? ?/sec
filter: mixed_utf8, 8192, nulls: 0, selectivity: 0.1             1.00     10.1±0.40ms        ? ?/sec    1.00     10.0±0.23ms        ? ?/sec
filter: mixed_utf8, 8192, nulls: 0, selectivity: 0.8             1.33     10.8±0.27ms        ? ?/sec    1.00      8.1±0.13ms        ? ?/sec
filter: mixed_utf8, 8192, nulls: 0.1, selectivity: 0.001         1.00     83.6±0.57ms        ? ?/sec    1.05     87.8±0.64ms        ? ?/sec
filter: mixed_utf8, 8192, nulls: 0.1, selectivity: 0.01          1.02     14.9±0.13ms        ? ?/sec    1.00     14.6±0.13ms        ? ?/sec
filter: mixed_utf8, 8192, nulls: 0.1, selectivity: 0.1           1.02     10.7±0.27ms        ? ?/sec    1.00     10.4±0.28ms        ? ?/sec
filter: mixed_utf8, 8192, nulls: 0.1, selectivity: 0.8           1.02     10.3±0.19ms        ? ?/sec    1.00     10.1±0.21ms        ? ?/sec
filter: mixed_utf8view, 8192, nulls: 0, selectivity: 0.001       1.00     60.5±0.53ms        ? ?/sec    1.30     78.5±0.94ms        ? ?/sec
filter: mixed_utf8view, 8192, nulls: 0, selectivity: 0.01        1.00      7.4±0.03ms        ? ?/sec    1.23      9.1±0.03ms        ? ?/sec
filter: mixed_utf8view, 8192, nulls: 0, selectivity: 0.1         1.00      3.2±0.05ms        ? ?/sec    1.31      4.2±0.16ms        ? ?/sec
filter: mixed_utf8view, 8192, nulls: 0, selectivity: 0.8         1.00      2.7±0.01ms        ? ?/sec    1.71      4.6±0.02ms        ? ?/sec
filter: mixed_utf8view, 8192, nulls: 0.1, selectivity: 0.001     1.00     79.4±0.47ms        ? ?/sec    1.19     94.4±1.03ms        ? ?/sec
filter: mixed_utf8view, 8192, nulls: 0.1, selectivity: 0.01      1.00     11.0±0.04ms        ? ?/sec    1.22     13.4±0.06ms        ? ?/sec
filter: mixed_utf8view, 8192, nulls: 0.1, selectivity: 0.1       1.00      4.5±0.07ms        ? ?/sec    1.36      6.2±0.17ms        ? ?/sec
filter: mixed_utf8view, 8192, nulls: 0.1, selectivity: 0.8       1.00      5.2±0.02ms        ? ?/sec    1.48      7.7±0.02ms        ? ?/sec
filter: single_utf8view, 8192, nulls: 0, selectivity: 0.001      1.00     85.9±0.32ms        ? ?/sec    1.49    127.6±0.56ms        ? ?/sec
filter: single_utf8view, 8192, nulls: 0, selectivity: 0.01       1.00     11.5±0.03ms        ? ?/sec    1.34     15.4±0.08ms        ? ?/sec
filter: single_utf8view, 8192, nulls: 0, selectivity: 0.1        1.00      5.3±0.23ms        ? ?/sec    1.42      7.5±0.15ms        ? ?/sec
filter: single_utf8view, 8192, nulls: 0, selectivity: 0.8        1.00      4.2±0.03ms        ? ?/sec    2.16      9.1±0.02ms        ? ?/sec
filter: single_utf8view, 8192, nulls: 0.1, selectivity: 0.001    1.00    113.3±1.07ms        ? ?/sec    1.34    152.3±1.56ms        ? ?/sec
filter: single_utf8view, 8192, nulls: 0.1, selectivity: 0.01     1.00     15.8±0.09ms        ? ?/sec    1.35     21.3±0.07ms        ? ?/sec
filter: single_utf8view, 8192, nulls: 0.1, selectivity: 0.1      1.00      6.7±0.10ms        ? ?/sec    1.54     10.4±0.22ms        ? ?/sec
filter: single_utf8view, 8192, nulls: 0.1, selectivity: 0.8      1.00      6.8±0.03ms        ? ?/sec    1.89     12.8±0.04ms        ? ?/sec

alamb · 2025-06-06T15:27:36Z

🤖 ./gh_compare_arrow.sh Benchmark Script Running
Linux aal-dev 6.11.0-1013-gcp #13~24.04.1-Ubuntu SMP Wed Apr 2 16:34:16 UTC 2025 x86_64 x86_64 x86_64 GNU/Linux
Comparing coalesce_improvements (7b408ad) to f92ff18 diff
BENCH_NAME=filter_kernels
BENCH_COMMAND=cargo bench --features=arrow,async,test_common,experimental --bench filter_kernels
BENCH_FILTER=
BENCH_BRANCH_NAME=coalesce_improvements
Results will be posted here when complete

alamb · 2025-06-06T15:35:56Z

🤖: Benchmark completed

🤔 seems there is no concat benchmark for StringView. I may go try and add one later todya

alamb · 2025-06-06T18:00:32Z

I also ran this PR with a new concat benchmark in

Add concatenate kernel benchmark for StringViewArray #7617

And it also shows pretty sweet gains

group                                                                         coalesce_improvements                  main
-----                                                                         ---------------------                  ----
concat utf8_view  max_str_len=128 null_density=0                              1.00     80.3±0.52µs        ? ?/sec    1.12     89.6±0.27µs        ? ?/sec
concat utf8_view  max_str_len=128 null_density=0.2                            1.00     86.1±0.48µs        ? ?/sec    1.73    148.9±0.75µs        ? ?/sec
concat utf8_view  max_str_len=20 null_density=0                               1.00     79.9±0.34µs        ? ?/sec    1.24     99.2±0.23µs        ? ?/sec
concat utf8_view  max_str_len=20 null_density=0.2                             1.00     98.4±0.30µs        ? ?/sec    1.48    145.5±0.52µs        ? ?/sec
concat utf8_view all_inline max_str_len=12 null_density=0                     1.00     40.2±3.69µs        ? ?/sec    1.80     72.4±0.71µs        ? ?/sec
concat utf8_view all_inline max_str_len=12 null_density=0.2                   1.00     52.0±2.95µs        ? ?/sec    1.73     90.0±0.98µs        ? ?/sec

Co-authored-by: Andrew Lamb <andrew@nerdnetworks.org>

# Which issue does this PR close? - Related to #7614 # Rationale for this change We expect #7614 to improve concat performance, but we don't have any concat benchmarks for StringViewArray Let's fix that # What changes are included in this PR? 1. Add a `concat` benchmark for StringViewArray # Are there any user-facing changes? No this is an internal benchmarking tool only

alamb · 2025-06-06T23:45:52Z

🤖 ./gh_compare_arrow.sh Benchmark Script Running
Linux aal-dev 6.11.0-1013-gcp #13~24.04.1-Ubuntu SMP Wed Apr 2 16:34:16 UTC 2025 x86_64 x86_64 x86_64 GNU/Linux
Comparing coalesce_improvements (a94a096) to 44d7194 diff
BENCH_NAME=concatenate_kernel
BENCH_COMMAND=cargo bench --features=arrow,async,test_common,experimental --bench concatenate_kernel
BENCH_FILTER=
BENCH_BRANCH_NAME=coalesce_improvements
Results will be posted here when complete

alamb

Looking great -- I kicked off some more tests

alamb · 2025-06-06T23:49:47Z

arrow-select/src/coalesce.rs

-        .iter()
+fn gc_string_view_batch(batch: RecordBatch) -> RecordBatch {
+    let (schema, columns, num_rows) = batch.into_parts();
+    let new_columns: Vec<ArrayRef> = columns


I don't know if it matters but in #7620 I simply modified the existing Vec rather than collecting into a new one to save maybe an allocation

for mut c in columns.iter_mut() { ... }

AFAIK there is a Rust compiler optimization that will reuse a Vec allocation if using into_iter / collect (on owned Vec) 🤔

This is one of the reasons I think using Vec arrow-rs makes sense, there are many optimizations that are not feasible to implement in arrow-rs.

Yeah I think using Vec is a great idea - it is both simpler and as can be seen by this PR often quite a bit faster

alamb · 2025-06-06T23:52:09Z

arrow-select/src/coalesce.rs

+                    })
+                    .collect();
+
+                let buffers = if buffer.is_empty() {


I think the only way this could happen is if ideal_buffer_size is zero, right? In that case we could make a more efficient inner loop that simply copied the views without checking length, etc

if ideal_buffer_size == 0 { // just reuse views as is... }

alamb · 2025-06-06T23:54:56Z

🤖: Benchmark completed

Details

group                                                          coalesce_improvements                  main
-----                                                          ---------------------                  ----
concat 1024 arrays boolean 4                                   1.00     27.8±0.03µs        ? ?/sec    1.00     27.8±0.05µs        ? ?/sec
concat 1024 arrays i32 4                                       1.02     12.9±0.03µs        ? ?/sec    1.00     12.6±0.03µs        ? ?/sec
concat 1024 arrays str 4                                       1.01     55.5±0.41µs        ? ?/sec    1.00     54.8±0.25µs        ? ?/sec
concat boolean 1024                                            1.00    430.1±0.99ns        ? ?/sec    1.03    441.0±0.45ns        ? ?/sec
concat boolean 8192 over 100 arrays                            1.00     50.9±0.47µs        ? ?/sec    1.00     50.9±0.12µs        ? ?/sec
concat boolean nulls 1024                                      1.00    760.0±0.89ns        ? ?/sec    1.05    801.7±2.11ns        ? ?/sec
concat boolean nulls 8192 over 100 arrays                      1.00    110.0±0.20µs        ? ?/sec    1.00    109.8±0.27µs        ? ?/sec
concat fixed size lists                                        1.00   744.1±30.52µs        ? ?/sec    1.07   799.4±23.49µs        ? ?/sec
concat i32 1024                                                1.00    437.3±0.71ns        ? ?/sec    1.05    458.3±0.82ns        ? ?/sec
concat i32 8192 over 100 arrays                                1.00    211.9±5.17µs        ? ?/sec    1.05    223.3±7.68µs        ? ?/sec
concat i32 nulls 1024                                          1.00    746.0±2.06ns        ? ?/sec    1.06    792.1±1.96ns        ? ?/sec
concat i32 nulls 8192 over 100 arrays                          1.00    268.7±3.93µs        ? ?/sec    1.06    285.8±6.43µs        ? ?/sec
concat str 1024                                                1.08     14.3±1.22µs        ? ?/sec    1.00     13.3±0.96µs        ? ?/sec
concat str 8192 over 100 arrays                                1.00    105.5±0.84ms        ? ?/sec    1.01    106.2±0.77ms        ? ?/sec
concat str nulls 1024                                          1.06      7.4±0.71µs        ? ?/sec    1.00      7.0±0.83µs        ? ?/sec
concat str nulls 8192 over 100 arrays                          1.00     52.6±0.47ms        ? ?/sec    1.01     53.1±0.35ms        ? ?/sec
concat str_dict 1024                                           1.00      2.9±0.01µs        ? ?/sec    1.06      3.0±0.01µs        ? ?/sec
concat str_dict_sparse 1024                                    1.00      6.8±0.02µs        ? ?/sec    1.03      7.1±0.03µs        ? ?/sec
concat struct with int32 and dicts size=1024 count=2           1.00      6.6±0.04µs        ? ?/sec    1.05      7.0±0.06µs        ? ?/sec
concat utf8_view  max_str_len=128 null_density=0               1.00     78.2±1.70µs        ? ?/sec    1.28    100.3±0.45µs        ? ?/sec
concat utf8_view  max_str_len=128 null_density=0.2             1.00     84.2±0.32µs        ? ?/sec    1.84    154.8±1.36µs        ? ?/sec
concat utf8_view  max_str_len=20 null_density=0                1.00     90.0±0.90µs        ? ?/sec    1.10     99.2±0.34µs        ? ?/sec
concat utf8_view  max_str_len=20 null_density=0.2              1.00     96.4±1.09µs        ? ?/sec    1.56    150.5±0.46µs        ? ?/sec
concat utf8_view all_inline max_str_len=12 null_density=0      1.00     46.8±2.89µs        ? ?/sec    1.56     72.9±0.97µs        ? ?/sec
concat utf8_view all_inline max_str_len=12 null_density=0.2    1.00     53.7±2.96µs        ? ?/sec    1.46     78.5±0.80µs        ? ?/sec

alamb · 2025-06-06T23:54:59Z

🤖 ./gh_compare_arrow.sh Benchmark Script Running
Linux aal-dev 6.11.0-1013-gcp #13~24.04.1-Ubuntu SMP Wed Apr 2 16:34:16 UTC 2025 x86_64 x86_64 x86_64 GNU/Linux
Comparing coalesce_improvements (a94a096) to 44d7194 diff
BENCH_NAME=coalesce_kernels
BENCH_COMMAND=cargo bench --features=arrow,async,test_common,experimental --bench coalesce_kernels
BENCH_FILTER=
BENCH_BRANCH_NAME=coalesce_improvements
Results will be posted here when complete

alamb · 2025-06-07T00:12:00Z

🤖: Benchmark completed

Details

group                                                                                coalesce_improvements                  main
-----                                                                                ---------------------                  ----
filter: mixed_dict, 8192, nulls: 0, selectivity: 0.001                               1.00    300.7±1.94ms        ? ?/sec    1.00    300.1±2.31ms        ? ?/sec
filter: mixed_dict, 8192, nulls: 0, selectivity: 0.01                                1.00      8.9±0.09ms        ? ?/sec    1.01      9.1±0.10ms        ? ?/sec
filter: mixed_dict, 8192, nulls: 0, selectivity: 0.1                                 1.01      4.3±0.11ms        ? ?/sec    1.00      4.3±0.11ms        ? ?/sec
filter: mixed_dict, 8192, nulls: 0, selectivity: 0.8                                 1.00      3.5±0.03ms        ? ?/sec    1.02      3.5±0.02ms        ? ?/sec
filter: mixed_dict, 8192, nulls: 0.1, selectivity: 0.001                             1.12    294.6±2.72ms        ? ?/sec    1.00    262.6±2.47ms        ? ?/sec
filter: mixed_dict, 8192, nulls: 0.1, selectivity: 0.01                              1.00     10.1±0.08ms        ? ?/sec    1.04     10.5±0.08ms        ? ?/sec
filter: mixed_dict, 8192, nulls: 0.1, selectivity: 0.1                               1.00      4.6±0.05ms        ? ?/sec    1.03      4.7±0.10ms        ? ?/sec
filter: mixed_dict, 8192, nulls: 0.1, selectivity: 0.8                               1.01      4.6±0.02ms        ? ?/sec    1.00      4.6±0.02ms        ? ?/sec
filter: mixed_utf8, 8192, nulls: 0, selectivity: 0.001                               1.00     66.0±0.55ms        ? ?/sec    1.06     69.7±0.71ms        ? ?/sec
filter: mixed_utf8, 8192, nulls: 0, selectivity: 0.01                                1.00     12.8±0.18ms        ? ?/sec    1.00     12.9±0.11ms        ? ?/sec
filter: mixed_utf8, 8192, nulls: 0, selectivity: 0.1                                 1.05     10.4±0.34ms        ? ?/sec    1.00      9.9±0.28ms        ? ?/sec
filter: mixed_utf8, 8192, nulls: 0, selectivity: 0.8                                 1.04     10.6±0.28ms        ? ?/sec    1.00     10.1±0.18ms        ? ?/sec
filter: mixed_utf8, 8192, nulls: 0.1, selectivity: 0.001                             1.00     82.7±0.55ms        ? ?/sec    1.09     90.0±0.45ms        ? ?/sec
filter: mixed_utf8, 8192, nulls: 0.1, selectivity: 0.01                              1.00     14.5±0.14ms        ? ?/sec    1.00     14.6±0.12ms        ? ?/sec
filter: mixed_utf8, 8192, nulls: 0.1, selectivity: 0.1                               1.02     10.4±0.34ms        ? ?/sec    1.00     10.2±0.29ms        ? ?/sec
filter: mixed_utf8, 8192, nulls: 0.1, selectivity: 0.8                               1.05     10.3±0.17ms        ? ?/sec    1.00      9.7±0.25ms        ? ?/sec
filter: mixed_utf8view (max_string_len=128), 8192, nulls: 0, selectivity: 0.001      1.00     66.6±0.35ms        ? ?/sec    1.29     86.1±0.53ms        ? ?/sec
filter: mixed_utf8view (max_string_len=128), 8192, nulls: 0, selectivity: 0.01       1.00      8.7±0.04ms        ? ?/sec    1.22     10.7±0.05ms        ? ?/sec
filter: mixed_utf8view (max_string_len=128), 8192, nulls: 0, selectivity: 0.1        1.00      5.0±0.10ms        ? ?/sec    1.21      6.1±0.23ms        ? ?/sec
filter: mixed_utf8view (max_string_len=128), 8192, nulls: 0, selectivity: 0.8        1.00      3.4±0.03ms        ? ?/sec    1.26      4.3±0.03ms        ? ?/sec
filter: mixed_utf8view (max_string_len=128), 8192, nulls: 0.1, selectivity: 0.001    1.00     86.4±0.52ms        ? ?/sec    1.23    106.2±0.52ms        ? ?/sec
filter: mixed_utf8view (max_string_len=128), 8192, nulls: 0.1, selectivity: 0.01     1.00     11.9±0.07ms        ? ?/sec    1.24     14.7±0.09ms        ? ?/sec
filter: mixed_utf8view (max_string_len=128), 8192, nulls: 0.1, selectivity: 0.1      1.00      5.8±0.16ms        ? ?/sec    1.27      7.3±0.19ms        ? ?/sec
filter: mixed_utf8view (max_string_len=128), 8192, nulls: 0.1, selectivity: 0.8      1.00      3.9±0.01ms        ? ?/sec    1.04      4.0±0.02ms        ? ?/sec
filter: mixed_utf8view (max_string_len=20), 8192, nulls: 0, selectivity: 0.001       1.00     58.1±0.23ms        ? ?/sec    1.35     78.6±1.09ms        ? ?/sec
filter: mixed_utf8view (max_string_len=20), 8192, nulls: 0, selectivity: 0.01        1.00      6.9±0.02ms        ? ?/sec    1.33      9.2±0.04ms        ? ?/sec
filter: mixed_utf8view (max_string_len=20), 8192, nulls: 0, selectivity: 0.1         1.00      2.8±0.04ms        ? ?/sec    1.43      4.0±0.04ms        ? ?/sec
filter: mixed_utf8view (max_string_len=20), 8192, nulls: 0, selectivity: 0.8         1.00      2.6±0.01ms        ? ?/sec    1.82      4.7±0.01ms        ? ?/sec
filter: mixed_utf8view (max_string_len=20), 8192, nulls: 0.1, selectivity: 0.001     1.00     70.9±0.27ms        ? ?/sec    1.22     86.5±0.35ms        ? ?/sec
filter: mixed_utf8view (max_string_len=20), 8192, nulls: 0.1, selectivity: 0.01      1.00     10.3±0.03ms        ? ?/sec    1.31     13.6±0.05ms        ? ?/sec
filter: mixed_utf8view (max_string_len=20), 8192, nulls: 0.1, selectivity: 0.1       1.00      3.8±0.10ms        ? ?/sec    1.37      5.3±0.02ms        ? ?/sec
filter: mixed_utf8view (max_string_len=20), 8192, nulls: 0.1, selectivity: 0.8       1.00      5.0±0.02ms        ? ?/sec    1.52      7.6±0.03ms        ? ?/sec
filter: single_utf8view, 8192, nulls: 0, selectivity: 0.001                          1.00     84.0±0.30ms        ? ?/sec    1.58    133.1±0.53ms        ? ?/sec
filter: single_utf8view, 8192, nulls: 0, selectivity: 0.01                           1.00     11.3±0.04ms        ? ?/sec    1.38     15.6±0.07ms        ? ?/sec
filter: single_utf8view, 8192, nulls: 0, selectivity: 0.1                            1.00      5.1±0.16ms        ? ?/sec    1.44      7.4±0.14ms        ? ?/sec
filter: single_utf8view, 8192, nulls: 0, selectivity: 0.8                            1.00      4.5±0.02ms        ? ?/sec    2.02      9.2±0.02ms        ? ?/sec
filter: single_utf8view, 8192, nulls: 0.1, selectivity: 0.001                        1.00    112.3±0.67ms        ? ?/sec    1.42    159.2±0.91ms        ? ?/sec
filter: single_utf8view, 8192, nulls: 0.1, selectivity: 0.01                         1.00     15.8±0.05ms        ? ?/sec    1.42     22.4±0.06ms        ? ?/sec
filter: single_utf8view, 8192, nulls: 0.1, selectivity: 0.1                          1.00      7.1±0.11ms        ? ?/sec    1.49     10.6±0.15ms        ? ?/sec
filter: single_utf8view, 8192, nulls: 0.1, selectivity: 0.8                          1.00      7.2±0.02ms        ? ?/sec    1.79     13.0±0.03ms        ? ?/sec

alamb · 2025-06-07T00:12:03Z

🤖 ./gh_compare_arrow.sh Benchmark Script Running
Linux aal-dev 6.11.0-1013-gcp #13~24.04.1-Ubuntu SMP Wed Apr 2 16:34:16 UTC 2025 x86_64 x86_64 x86_64 GNU/Linux
Comparing coalesce_improvements (a94a096) to 44d7194 diff
BENCH_NAME=filter_kernels
BENCH_COMMAND=cargo bench --features=arrow,async,test_common,experimental --bench filter_kernels
BENCH_FILTER=
BENCH_BRANCH_NAME=coalesce_improvements
Results will be posted here when complete

alamb · 2025-06-07T00:35:16Z

🤖: Benchmark completed

Details

group                                                                         coalesce_improvements                  main
-----                                                                         ---------------------                  ----
filter context decimal128 (kept 1/2)                                          1.14     45.8±6.75µs        ? ?/sec    1.00     40.2±2.02µs        ? ?/sec
filter context decimal128 high selectivity (kept 1023/1024)                   1.01     49.4±0.41µs        ? ?/sec    1.00     48.9±0.52µs        ? ?/sec
filter context decimal128 low selectivity (kept 1/1024)                       1.00    253.7±0.41ns        ? ?/sec    1.01    255.5±0.32ns        ? ?/sec
filter context f32 (kept 1/2)                                                 1.00     90.5±0.21µs        ? ?/sec    1.00     90.5±0.17µs        ? ?/sec
filter context f32 high selectivity (kept 1023/1024)                          1.00     13.4±0.49µs        ? ?/sec    1.03     13.7±0.42µs        ? ?/sec
filter context f32 low selectivity (kept 1/1024)                              1.00    476.5±0.74ns        ? ?/sec    1.01    479.2±0.70ns        ? ?/sec
filter context fsb with value length 20 (kept 1/2)                            1.00     70.6±0.06µs        ? ?/sec    1.00     70.8±0.11µs        ? ?/sec
filter context fsb with value length 20 high selectivity (kept 1023/1024)     1.00     70.7±0.14µs        ? ?/sec    1.00     70.6±0.05µs        ? ?/sec
filter context fsb with value length 20 low selectivity (kept 1/1024)         1.00     70.7±0.08µs        ? ?/sec    1.00     70.7±0.08µs        ? ?/sec
filter context fsb with value length 5 (kept 1/2)                             1.00     70.7±0.08µs        ? ?/sec    1.00     70.7±0.14µs        ? ?/sec
filter context fsb with value length 5 high selectivity (kept 1023/1024)      1.00     70.7±0.15µs        ? ?/sec    1.00     70.7±0.10µs        ? ?/sec
filter context fsb with value length 5 low selectivity (kept 1/1024)          1.00     70.6±0.07µs        ? ?/sec    1.00     70.7±0.14µs        ? ?/sec
filter context fsb with value length 50 (kept 1/2)                            1.00     70.8±0.23µs        ? ?/sec    1.00     70.7±0.28µs        ? ?/sec
filter context fsb with value length 50 high selectivity (kept 1023/1024)     1.00     70.7±0.24µs        ? ?/sec    1.00     70.7±0.11µs        ? ?/sec
filter context fsb with value length 50 low selectivity (kept 1/1024)         1.00     70.7±0.10µs        ? ?/sec    1.00     70.7±0.16µs        ? ?/sec
filter context i32 (kept 1/2)                                                 1.00     22.7±0.04µs        ? ?/sec    1.00     22.6±0.04µs        ? ?/sec
filter context i32 high selectivity (kept 1023/1024)                          1.05      6.5±0.54µs        ? ?/sec    1.00      6.2±0.30µs        ? ?/sec
filter context i32 low selectivity (kept 1/1024)                              1.00    254.5±0.41ns        ? ?/sec    1.02    258.8±0.40ns        ? ?/sec
filter context i32 w NULLs (kept 1/2)                                         1.00     93.9±0.20µs        ? ?/sec    1.00     93.9±0.31µs        ? ?/sec
filter context i32 w NULLs high selectivity (kept 1023/1024)                  1.00     13.5±0.45µs        ? ?/sec    1.03     13.9±0.53µs        ? ?/sec
filter context i32 w NULLs low selectivity (kept 1/1024)                      1.19    574.3±1.01ns        ? ?/sec    1.00    482.2±0.92ns        ? ?/sec
filter context mixed string view (kept 1/2)                                   1.05    120.1±7.82µs        ? ?/sec    1.00    114.6±6.77µs        ? ?/sec
filter context mixed string view high selectivity (kept 1023/1024)            1.00     58.0±1.34µs        ? ?/sec    1.00     57.8±1.50µs        ? ?/sec
filter context mixed string view low selectivity (kept 1/1024)                1.00    671.3±1.36ns        ? ?/sec    1.10    735.4±4.62ns        ? ?/sec
filter context short string view (kept 1/2)                                   1.00    109.9±0.60µs        ? ?/sec    1.03    112.9±5.98µs        ? ?/sec
filter context short string view high selectivity (kept 1023/1024)            1.00     58.4±1.25µs        ? ?/sec    1.04     60.5±0.11µs        ? ?/sec
filter context short string view low selectivity (kept 1/1024)                1.00    489.9±1.49ns        ? ?/sec    1.02    498.9±0.58ns        ? ?/sec
filter context string (kept 1/2)                                              1.01   589.9±12.86µs        ? ?/sec    1.00   585.4±14.66µs        ? ?/sec
filter context string dictionary (kept 1/2)                                   1.00     23.2±0.08µs        ? ?/sec    1.00     23.3±0.08µs        ? ?/sec
filter context string dictionary high selectivity (kept 1023/1024)            1.00      7.1±0.32µs        ? ?/sec    1.05      7.5±0.45µs        ? ?/sec
filter context string dictionary low selectivity (kept 1/1024)                1.00    824.4±3.42ns        ? ?/sec    1.00    821.5±0.79ns        ? ?/sec
filter context string dictionary w NULLs (kept 1/2)                           1.00     94.6±0.17µs        ? ?/sec    1.00     94.8±0.18µs        ? ?/sec
filter context string dictionary w NULLs high selectivity (kept 1023/1024)    1.00     14.2±0.43µs        ? ?/sec    1.05     14.9±0.65µs        ? ?/sec
filter context string dictionary w NULLs low selectivity (kept 1/1024)        1.00   1069.0±3.94ns        ? ?/sec    1.00   1065.4±3.55ns        ? ?/sec
filter context string high selectivity (kept 1023/1024)                       1.02   644.3±19.62µs        ? ?/sec    1.00   630.1±13.05µs        ? ?/sec
filter context string low selectivity (kept 1/1024)                           1.01   1148.9±7.77ns        ? ?/sec    1.00   1133.0±1.68ns        ? ?/sec
filter context u8 (kept 1/2)                                                  1.00     18.9±0.03µs        ? ?/sec    1.00     18.9±0.03µs        ? ?/sec
filter context u8 high selectivity (kept 1023/1024)                           1.12      2.0±0.01µs        ? ?/sec    1.00   1812.2±7.70ns        ? ?/sec
filter context u8 low selectivity (kept 1/1024)                               1.00    244.5±0.59ns        ? ?/sec    1.01    247.5±0.47ns        ? ?/sec
filter context u8 w NULLs (kept 1/2)                                          1.00     89.9±0.12µs        ? ?/sec    1.00     89.9±0.10µs        ? ?/sec
filter context u8 w NULLs high selectivity (kept 1023/1024)                   1.00      8.7±0.01µs        ? ?/sec    1.00      8.7±0.02µs        ? ?/sec
filter context u8 w NULLs low selectivity (kept 1/1024)                       1.00    569.2±1.22ns        ? ?/sec    1.00    571.0±1.04ns        ? ?/sec
filter decimal128 (kept 1/2)                                                  1.02     96.7±0.46µs        ? ?/sec    1.00     95.2±0.45µs        ? ?/sec
filter decimal128 high selectivity (kept 1023/1024)                           1.00     53.0±1.68µs        ? ?/sec    1.01     53.7±1.29µs        ? ?/sec
filter decimal128 low selectivity (kept 1/1024)                               1.00      3.0±0.01µs        ? ?/sec    1.01      3.0±0.01µs        ? ?/sec
filter f32 (kept 1/2)                                                         1.00    199.9±2.60µs        ? ?/sec    1.00    200.4±0.33µs        ? ?/sec
filter fsb with value length 20 (kept 1/2)                                    1.00    149.8±0.59µs        ? ?/sec    1.00    150.1±0.73µs        ? ?/sec
filter fsb with value length 20 high selectivity (kept 1023/1024)             1.02     73.1±1.52µs        ? ?/sec    1.00     72.1±2.02µs        ? ?/sec
filter fsb with value length 20 low selectivity (kept 1/1024)                 1.00      3.2±0.01µs        ? ?/sec    1.00      3.2±0.01µs        ? ?/sec
filter fsb with value length 5 (kept 1/2)                                     1.01    153.9±0.93µs        ? ?/sec    1.00    152.4±0.31µs        ? ?/sec
filter fsb with value length 5 high selectivity (kept 1023/1024)              1.03     11.5±0.42µs        ? ?/sec    1.00     11.2±0.68µs        ? ?/sec
filter fsb with value length 5 low selectivity (kept 1/1024)                  1.01      3.1±0.01µs        ? ?/sec    1.00      3.1±0.01µs        ? ?/sec
filter fsb with value length 50 (kept 1/2)                                    1.00    184.8±9.56µs        ? ?/sec    1.11    204.2±4.76µs        ? ?/sec
filter fsb with value length 50 high selectivity (kept 1023/1024)             1.00    199.9±5.65µs        ? ?/sec    1.07    213.1±8.57µs        ? ?/sec
filter fsb with value length 50 low selectivity (kept 1/1024)                 1.00      3.1±0.01µs        ? ?/sec    1.01      3.2±0.00µs        ? ?/sec
filter i32 (kept 1/2)                                                         1.04     95.4±0.14µs        ? ?/sec    1.00     91.3±0.13µs        ? ?/sec
filter i32 high selectivity (kept 1023/1024)                                  1.01      8.7±0.37µs        ? ?/sec    1.00      8.6±0.38µs        ? ?/sec
filter i32 low selectivity (kept 1/1024)                                      1.00      3.1±0.01µs        ? ?/sec    1.00      3.1±0.00µs        ? ?/sec
filter optimize (kept 1/2)                                                    1.01     84.7±0.18µs        ? ?/sec    1.00     84.1±0.24µs        ? ?/sec
filter optimize high selectivity (kept 1023/1024)                             1.01      2.8±0.01µs        ? ?/sec    1.00      2.8±0.01µs        ? ?/sec
filter optimize low selectivity (kept 1/1024)                                 1.00      2.8±0.00µs        ? ?/sec    1.00      2.8±0.00µs        ? ?/sec
filter run array (kept 1/2)                                                   1.00    358.4±0.77µs        ? ?/sec    1.00    357.3±1.07µs        ? ?/sec
filter run array high selectivity (kept 1023/1024)                            1.01    310.6±1.92µs        ? ?/sec    1.00    309.0±1.61µs        ? ?/sec
filter run array low selectivity (kept 1/1024)                                1.00    247.8±1.18µs        ? ?/sec    1.00    246.8±1.07µs        ? ?/sec
filter single record batch                                                    1.00     93.0±0.28µs        ? ?/sec    1.02     94.6±0.21µs        ? ?/sec
filter u8 (kept 1/2)                                                          1.05     95.6±0.57µs        ? ?/sec    1.00     91.3±0.45µs        ? ?/sec
filter u8 high selectivity (kept 1023/1024)                                   1.03      3.9±0.01µs        ? ?/sec    1.00      3.8±0.01µs        ? ?/sec
filter u8 low selectivity (kept 1/1024)                                       1.00      3.0±0.00µs        ? ?/sec    1.00      3.0±0.01µs        ? ?/sec

tlm365 · 2025-06-07T08:58:59Z

arrow-select/src/coalesce.rs

+                let views: Vec<u128> = s
+                    .views()
+                    .as_ref()
+                    .iter()
+                    .cloned()
+                    .map(|v| {
+                        let mut b: ByteView = ByteView::from(v);
+
+                        if b.length > 12 {
+                            let offset = buffer.len() as u32;
+                            buffer.extend_from_slice(
+                                buffers[b.buffer_index as usize]
+                                    .get(b.offset as usize..b.offset as usize + b.length as usize)
+                                    .expect("Invalid buffer slice"),
+                            );
+                            b.offset = offset;
+                            b.buffer_index = 0; // Set buffer index to 0, as we only have one buffer
+                        }
+
+                        b.into()
+                    })
+                    .collect();


I'm curious it is better if implement like this:

let mut views = Vec::with_capacity(s.views().len()); for &v in s.views().as_ref() { let mut b: ByteView = ByteView::from(v); if b.length > 12 { ... } views.push(b.into()); }

vector pre-allocates and no clone 🤔

No that is generally worse - .iter().collect() on rust core types optimize to better code:

it knows the exact final length based on slice and does some extra optimizations based on that (TrustedLen https://doc.rust-lang.org/std/iter/trait.TrustedLen.html)

push generally needs to check whether the item fits capacity on each call (so adds some branching).

the cloned() you see on iter().cloned() will generally not make a difference on primitive types for the generated code.

Dandandan · 2025-06-07T09:57:08Z

Thanks for the reviews @alamb - let's continue making it faster!

…#7614)" This reverts commit 7739a83.

#7623) This reverts commit 7739a83. # Which issue does this PR close? # Rationale for this change I found this errors in DataFusion (see apache/datafusion#16249 (comment)), so let's revert it and find the error. # What changes are included in this PR? # Are there any user-facing changes?

…apache#7614)" (apache#7623)" This reverts commit da461c8.

#7625) … (#7614)" (#7623)" This reverts commit da461c8. This adds a test and fix for the wrong index issue. I also verified the change for DataFusion (and benchmarks show notable improvements). # Which issue does this PR close? Closes #NNN. # Rationale for this change # What changes are included in this PR? # Are there any user-facing changes?

alamb · 2025-06-12T19:46:31Z

🤖 ./gh_compare_arrow.sh Benchmark Script Running
Linux aal-dev 6.11.0-1015-gcp #15~24.04.1-Ubuntu SMP Thu Apr 24 20:41:05 UTC 2025 x86_64 x86_64 x86_64 GNU/Linux
Comparing coalesce_improvements (289aeda) to 44d7194 diff
BENCH_NAME=coalesce_kernels
BENCH_COMMAND=cargo bench --features=arrow,async,test_common,experimental --bench coalesce_kernels
BENCH_FILTER=
BENCH_BRANCH_NAME=coalesce_improvements
Results will be posted here when complete

github-actions bot added the arrow Changes to the arrow crate label Jun 5, 2025

Dandandan force-pushed the coalesce_improvements branch from 9ff57e5 to b3d4337 Compare June 5, 2025 16:48

Improve coalesce

c3648f4

Dandandan force-pushed the coalesce_improvements branch from b3d4337 to c3648f4 Compare June 5, 2025 17:48

Dandandan marked this pull request as ready for review June 5, 2025 17:48

Dandandan changed the title ~~Improve coalesce performance~~ Improve gc_string_view_batch performance Jun 5, 2025

Dandandan added 3 commits June 6, 2025 09:13

Specialize concat

d9d309f

Specialize concat

cda04cf

Specialize concat

675d9d6

Dandandan changed the title ~~Improve gc_string_view_batch performance~~ Improve coalesce performance Jun 6, 2025

Dandandan changed the title ~~Improve coalesce performance~~ Improve coalesce and concat performance for views Jun 6, 2025

Clippy

b44d0ab

Dandandan added 5 commits June 6, 2025 09:57

Remove comment

294e188

Fix

9936a8a

Fmt

abe62b9

Rename

ab2878e

comment

06101a1

Dandandan requested a review from Copilot June 6, 2025 12:24

Copilot AI reviewed Jun 6, 2025

View reviewed changes

Style suggestion

7b408ad

alamb approved these changes Jun 6, 2025

View reviewed changes

alamb mentioned this pull request Jun 6, 2025

Add coalesce kernel andBatchCoalescer for statefully combining selected b…atches: #7597

Merged

Update arrow-array/src/builder/generic_bytes_view_builder.rs

16ff18c

Co-authored-by: Andrew Lamb <andrew@nerdnetworks.org>

alamb mentioned this pull request Jun 6, 2025

POC: Sketch out parquet cached filter result API #7513

Closed

3 tasks

Fix and extend fast path

97d93aa

Dandandan added 5 commits June 7, 2025 00:20

Merge remote-tracking branch 'upstream/main' into coalesce_improvements

9878106

Adjust comment

c422216

Address feedback

7a50614

new_unchecked

d2305b9

Update coalesce.rs

a94a096

alamb mentioned this pull request Jun 6, 2025

Optimize coalesce kernel for StringViewArray (5-10%) #7620

Closed

alamb reviewed Jun 6, 2025

View reviewed changes

Dandandan added 2 commits June 7, 2025 09:56

add another fast path

cf26d86

add another fast path

289aeda

tlm365 reviewed Jun 7, 2025

View reviewed changes

Dandandan merged commit 7739a83 into apache:main Jun 7, 2025
26 checks passed

Dandandan added a commit to Dandandan/arrow-rs that referenced this pull request Jun 7, 2025

Revert "Improve coalesce and concat performance for views (apache…

3de3260

…#7614)" This reverts commit 7739a83.

Dandandan added a commit to Dandandan/arrow-rs that referenced this pull request Jun 7, 2025

Revert "Revert "Improve coalesce and concat performance for views (…

e77928b

…apache#7614)" (apache#7623)" This reverts commit da461c8.

Dandandan mentioned this pull request Jun 7, 2025

Revert "Revert "Improve coalesce and concat performance for views… #7625

Merged

alamb mentioned this pull request Jun 20, 2025

Improve performance of coalesce and concat for views #7615

Closed

Improve coalesce and concat performance for views #7614

Improve coalesce and concat performance for views #7614

Uh oh!

Conversation

Dandandan commented Jun 5, 2025 • edited by alamb Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Which issue does this PR close?

Rationale for this change

What changes are included in this PR?

Are there any user-facing changes?

Uh oh!

Dandandan commented Jun 6, 2025

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull Request Overview

Reviewed Changes

Uh oh!

alamb commented Jun 6, 2025

Uh oh!

alamb commented Jun 6, 2025

Uh oh!

alamb left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Dandandan Jun 6, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Dandandan Jun 6, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

alamb commented Jun 6, 2025

Uh oh!

alamb commented Jun 6, 2025

Uh oh!

alamb commented Jun 6, 2025

Uh oh!

alamb commented Jun 6, 2025

Uh oh!

alamb commented Jun 6, 2025

Uh oh!

alamb commented Jun 6, 2025

Uh oh!

alamb commented Jun 6, 2025

Uh oh!

alamb left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Dandandan Jun 7, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Dandandan Jun 7, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

alamb commented Jun 6, 2025

Uh oh!

alamb commented Jun 6, 2025

Uh oh!

Improve `coalesce` and `concat` performance for views #7614

Improve `coalesce` and `concat` performance for views #7614

Dandandan commented Jun 5, 2025 •

edited by alamb

Loading

Dandandan Jun 6, 2025 •

edited

Loading

Dandandan Jun 6, 2025 •

edited

Loading

Dandandan Jun 7, 2025 •

edited

Loading

Dandandan Jun 7, 2025 •

edited

Loading