Skip to content

Conversation

@Dandandan
Copy link
Contributor

@Dandandan Dandandan commented Jun 5, 2025

Which issue does this PR close?

Rationale for this change

Improve performance of gc_string_view_batch

filter: mixed_utf8view, 8192, nulls: 0, selectivity: 0.001       1.00     30.4±1.05ms        ? ?/sec    1.29     39.3±0.88ms        ? ?/sec
filter: mixed_utf8view, 8192, nulls: 0, selectivity: 0.01        1.00      4.3±0.17ms        ? ?/sec    1.20      5.2±0.15ms        ? ?/sec
filter: mixed_utf8view, 8192, nulls: 0, selectivity: 0.1         1.00  1805.1±25.77µs        ? ?/sec    1.32      2.4±0.20ms        ? ?/sec
filter: mixed_utf8view, 8192, nulls: 0, selectivity: 0.8         1.00      2.6±0.12ms        ? ?/sec    1.48      3.8±0.11ms        ? ?/sec
filter: mixed_utf8view, 8192, nulls: 0.1, selectivity: 0.001     1.00     42.5±0.48ms        ? ?/sec    1.23     52.2±1.33ms        ? ?/sec
filter: mixed_utf8view, 8192, nulls: 0.1, selectivity: 0.01      1.00      5.8±0.12ms        ? ?/sec    1.28      7.4±0.20ms        ? ?/sec
filter: mixed_utf8view, 8192, nulls: 0.1, selectivity: 0.1       1.00      2.2±0.02ms        ? ?/sec    1.37      3.1±0.18ms        ? ?/sec
filter: mixed_utf8view, 8192, nulls: 0.1, selectivity: 0.8       1.00      3.6±0.15ms        ? ?/sec    1.43      5.1±0.12ms        ? ?/sec
filter: single_utf8view, 8192, nulls: 0, selectivity: 0.001      1.00     51.0±0.59ms        ? ?/sec    1.38     70.3±1.11ms        ? ?/sec
filter: single_utf8view, 8192, nulls: 0, selectivity: 0.01       1.00      6.7±0.03ms        ? ?/sec    1.32      8.8±0.16ms        ? ?/sec
filter: single_utf8view, 8192, nulls: 0, selectivity: 0.1        1.00      3.0±0.01ms        ? ?/sec    1.41      4.3±0.09ms        ? ?/sec
filter: single_utf8view, 8192, nulls: 0, selectivity: 0.8        1.00      4.5±0.34ms        ? ?/sec    1.71      7.7±0.28ms        ? ?/sec
filter: single_utf8view, 8192, nulls: 0.1, selectivity: 0.001    1.00     64.2±0.74ms        ? ?/sec    1.33     85.1±1.52ms        ? ?/sec
filter: single_utf8view, 8192, nulls: 0.1, selectivity: 0.01     1.00      9.4±0.09ms        ? ?/sec    1.35     12.6±0.26ms        ? ?/sec
filter: single_utf8view, 8192, nulls: 0.1, selectivity: 0.1      1.00      3.8±0.03ms        ? ?/sec    1.46      5.6±0.11ms        ? ?/sec
filter: single_utf8view, 8192, nulls: 0.1, selectivity: 0.8      1.00      5.7±0.28ms        ? ?/sec    1.73      9.9±0.27ms        ? ?/sec

What changes are included in this PR?

  • Avoiding recreating the views from scratch.
  • Specialize concat for view types
  • Takes owned RecordBatch (effect on performance is small, might be measurable with smaller batch size / more columns).

Are there any user-facing changes?

no

@github-actions github-actions bot added the arrow Changes to the arrow crate label Jun 5, 2025
@Dandandan Dandandan force-pushed the coalesce_improvements branch from 9ff57e5 to b3d4337 Compare June 5, 2025 16:48
@Dandandan Dandandan force-pushed the coalesce_improvements branch from b3d4337 to c3648f4 Compare June 5, 2025 17:48
@Dandandan Dandandan marked this pull request as ready for review June 5, 2025 17:48
@Dandandan Dandandan changed the title Improve coalesce performance Improve gc_string_view_batch performance Jun 5, 2025
@Dandandan Dandandan changed the title Improve gc_string_view_batch performance Improve coalesce performance Jun 6, 2025
@Dandandan Dandandan changed the title Improve coalesce performance Improve coalesce and concat performance for views Jun 6, 2025
@Dandandan
Copy link
Contributor Author

@alamb FYI

@Dandandan Dandandan requested a review from Copilot June 6, 2025 12:24
Copy link
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull Request Overview

This PR targets performance improvements for view concatenation and coalescing in Arrow by specializing functions and reducing unnecessary memory copies. Key changes include:

  • Adding specialized concatenation functions (concat_byte_view and concat_string_view) for view types.
  • Refactoring gc_string_view_batch to take ownership of RecordBatches to avoid redundant cloning.
  • Updating the GenericByteViewBuilder implementation to use a vector-based buffer instead of a BufferBuilder for better performance and clarity.

Reviewed Changes

Copilot reviewed 3 out of 3 changed files in this pull request and generated no comments.

File Description
arrow-select/src/concat.rs Introduces optimized concat functions for Utf8View and BinaryView.
arrow-select/src/coalesce.rs Refactors gc_string_view_batch for improved efficiency and updates associated tests.
arrow-array/src/builder/generic_bytes_view_builder.rs Replaces BufferBuilder with Vec for views and updates related debug output.
Comments suppressed due to low confidence (3)

arrow-select/src/concat.rs:89

  • Consider adding inline documentation for the concat_byte_view function to explain its purpose, expected input, and output behavior.
fn concat_byte_view(arrays: &[&dyn Array]) -> Result<ArrayRef, ArrowError> {

arrow-select/src/coalesce.rs:305

  • Add a comment to explain the safety guarantees around using the unsafe StringViewArray::new_unchecked call and why it is acceptable in this context.
let gc_string = unsafe {

arrow-array/src/builder/generic_bytes_view_builder.rs:449

  • Update the debug field label from "views_builder" to "views_buffer" so that the debug output accurately reflects the name of the underlying variable.
.field("views_builder", &self.views_buffer)

@alamb
Copy link
Contributor

alamb commented Jun 6, 2025

LOVE IT!

@alamb
Copy link
Contributor

alamb commented Jun 6, 2025

🤖 ./gh_compare_arrow.sh Benchmark Script Running
Linux aal-dev 6.11.0-1013-gcp #13~24.04.1-Ubuntu SMP Wed Apr 2 16:34:16 UTC 2025 x86_64 x86_64 x86_64 GNU/Linux
Comparing coalesce_improvements (7b408ad) to f92ff18 diff
BENCH_NAME=concatenate_kernel
BENCH_COMMAND=cargo bench --features=arrow,async,test_common,experimental --bench concatenate_kernel
BENCH_FILTER=
BENCH_BRANCH_NAME=coalesce_improvements
Results will be posted here when complete

Copy link
Contributor

@alamb alamb left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is a thing of Beauty @Dandandan

I kicked off some other benchmarks just to be sure and left some small potential improvements, but I also think this is better than what is currently on main so we could merge it as is and keep improving as a follow on

/// using [`GenericByteViewBuilder::try_append_view`]
pub struct GenericByteViewBuilder<T: ByteViewType + ?Sized> {
views_builder: BufferBuilder<u128>,
views_buffer: Vec<u128>,
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this is a great idea 💯

self.flush_in_progress();
self.completed.extend(array.data_buffers().iter().cloned());

if self.completed.is_empty() {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This checks completed.is_empty after pushing new data buffers, which I think means the fast path will never be taken. I think the check could be done prior to calling self.completed.extend and improve performance

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I couldn't resist and tried the following change locally and saw some additional improvements in concat but also some slowdowns with entirely inlined views, which I don't understand 🤔

diff --git a/arrow-array/src/builder/generic_bytes_view_builder.rs b/arrow-array/src/builder/generic_bytes_view_builder.rs
index 1e06cd7698..3936ca06d4 100644
--- a/arrow-array/src/builder/generic_bytes_view_builder.rs
+++ b/arrow-array/src/builder/generic_bytes_view_builder.rs
@@ -121,7 +121,7 @@ impl<T: ByteViewType + ?Sized> GenericByteViewBuilder<T> {
     /// growing buffer size exponentially from 8KB up to 2MB. The
     /// first buffer allocated is 8KB, then 16KB, then 32KB, etc up to 2MB.
     ///
-    /// If this method is used, any new buffers allocated are
+    /// If this method is used, any new buffers allocated are
     /// exactly this size. This can be useful for advanced users
     /// that want to control the memory usage and buffer count.
     ///
@@ -210,9 +210,11 @@ impl<T: ByteViewType + ?Sized> GenericByteViewBuilder<T> {
     /// and add the (adapted) views.
     pub fn append_array(&mut self, array: &GenericByteViewArray<T>) {
         self.flush_in_progress();
+        let update_views = !(self.completed.is_empty() || array.data_buffers().is_empty());
+
         self.completed.extend(array.data_buffers().iter().cloned());

-        if self.completed.is_empty() {
+        if update_views {
             self.views_buffer.extend_from_slice(array.views());
         } else {
             let starting_buffer = self.completed.len() as u32;
group                                                          coalesce_improvements                  coalesce_improvements_fix
-----                                                          ---------------------                  -------------------------
concat utf8_view  max_str_len=128 null_density=0               1.53     80.4±0.42µs        ? ?/sec    1.00     52.5±2.85µs        ? ?/sec
concat utf8_view  max_str_len=128 null_density=0.2             1.51     86.4±0.34µs        ? ?/sec    1.00     57.3±2.97µs        ? ?/sec
concat utf8_view  max_str_len=20 null_density=0                1.58     80.2±0.36µs        ? ?/sec    1.00     50.6±2.33µs        ? ?/sec
concat utf8_view  max_str_len=20 null_density=0.2              1.65     98.9±0.36µs        ? ?/sec    1.00     59.9±2.19µs        ? ?/sec
concat utf8_view all_inline max_str_len=12 null_density=0      1.00     40.6±2.82µs        ? ?/sec    1.94     78.7±1.18µs        ? ?/sec
concat utf8_view all_inline max_str_len=12 null_density=0.2    1.00     53.9±2.78µs        ? ?/sec    1.95    105.1±0.26µs        ? ?/sec

Copy link
Contributor Author

@Dandandan Dandandan Jun 6, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Shouldn't this be:

@@ -210,9 +210,11 @@ impl<T: ByteViewType + ?Sized> GenericByteViewBuilder<T> {
     /// and add the (adapted) views.
     pub fn append_array(&mut self, array: &GenericByteViewArray<T>) {
         self.flush_in_progress();
+        let keep_views = self.completed.is_empty() || array.data_buffers().is_empty();
+
         self.completed.extend(array.data_buffers().iter().cloned());

-        if self.completed.is_empty() {
+        if keep_views {
             self.views_buffer.extend_from_slice(array.views());
         } else {
             let starting_buffer = self.completed.len() as u32;

Copy link
Contributor Author

@Dandandan Dandandan Jun 6, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

For me this is a all-positive change 🎉 I really like the >2x improvement on inlined views.

concat utf8_view all_inline max_str_len=12 null_density=0
                        time:   [25.775 µs 27.046 µs 28.546 µs]
                        change: [−55.148% −53.600% −51.725%] (p = 0.00 < 0.05)
                        Performance has improved.
Found 23 outliers among 100 measurements (23.00%)
  7 (7.00%) low mild
  5 (5.00%) high mild
  11 (11.00%) high severe

concat utf8_view  max_str_len=20 null_density=0
                        time:   [53.475 µs 53.934 µs 54.689 µs]
                        change: [−7.1979% −6.4937% −5.5943%] (p = 0.00 < 0.05)
                        Performance has improved.
Found 7 outliers among 100 measurements (7.00%)
  1 (1.00%) low mild
  3 (3.00%) high mild
  3 (3.00%) high severe

concat utf8_view  max_str_len=128 null_density=0
                        time:   [52.999 µs 53.106 µs 53.245 µs]
                        change: [−7.0948% −6.6147% −6.0319%] (p = 0.00 < 0.05)
                        Performance has improved.
Found 15 outliers among 100 measurements (15.00%)
  7 (7.00%) high mild
  8 (8.00%) high severe

concat utf8_view all_inline max_str_len=12 null_density=0.2
                        time:   [25.532 µs 25.741 µs 25.943 µs]
                        change: [−56.519% −56.108% −55.710%] (p = 0.00 < 0.05)
                        Performance has improved.
Found 11 outliers among 100 measurements (11.00%)
  8 (8.00%) low mild
  3 (3.00%) high mild

concat utf8_view  max_str_len=20 null_density=0.2
                        time:   [56.062 µs 56.342 µs 56.688 µs]
                        change: [−6.1050% −5.3513% −4.6591%] (p = 0.00 < 0.05)
                        Performance has improved.
Found 5 outliers among 100 measurements (5.00%)
  1 (1.00%) low mild
  4 (4.00%) high severe

concat utf8_view  max_str_len=128 null_density=0.2
                        time:   [55.406 µs 55.623 µs 55.891 µs]
                        change: [−6.0536% −5.4045% −4.8059%] (p = 0.00 < 0.05)
                        Performance has improved.
Found 14 outliers among 100 measurements (14.00%)
  1 (1.00%) low mild
  1 (1.00%) high mild
  12 (12.00%) high severe

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Shouldn't this be:

🤦

@alamb
Copy link
Contributor

alamb commented Jun 6, 2025

🤖: Benchmark completed

Details

group                                                   coalesce_improvements                  main
-----                                                   ---------------------                  ----
concat 1024 arrays boolean 4                            1.00     28.7±0.04µs        ? ?/sec    1.01     28.9±0.05µs        ? ?/sec
concat 1024 arrays i32 4                                1.00     15.9±0.05µs        ? ?/sec    1.02     16.2±0.03µs        ? ?/sec
concat 1024 arrays str 4                                1.03     58.4±0.44µs        ? ?/sec    1.00     56.9±0.40µs        ? ?/sec
concat boolean 1024                                     1.00    437.8±1.03ns        ? ?/sec    1.04    457.4±2.22ns        ? ?/sec
concat boolean 8192 over 100 arrays                     1.00     50.9±0.08µs        ? ?/sec    1.00     51.0±0.06µs        ? ?/sec
concat boolean nulls 1024                               1.00    775.3±4.38ns        ? ?/sec    1.02    789.7±0.85ns        ? ?/sec
concat boolean nulls 8192 over 100 arrays               1.00    109.7±0.17µs        ? ?/sec    1.00    109.7±0.20µs        ? ?/sec
concat fixed size lists                                 1.00   714.2±17.89µs        ? ?/sec    1.01   721.3±17.25µs        ? ?/sec
concat i32 1024                                         1.01    446.1±0.77ns        ? ?/sec    1.00    439.6±0.93ns        ? ?/sec
concat i32 8192 over 100 arrays                         1.00   206.2±11.75µs        ? ?/sec    1.03    211.9±5.68µs        ? ?/sec
concat i32 nulls 1024                                   1.00    752.6±1.87ns        ? ?/sec    1.02    764.4±1.02ns        ? ?/sec
concat i32 nulls 8192 over 100 arrays                   1.00   286.0±11.26µs        ? ?/sec    1.03    294.2±8.00µs        ? ?/sec
concat str 1024                                         1.00     13.2±0.88µs        ? ?/sec    1.05     13.8±1.19µs        ? ?/sec
concat str 8192 over 100 arrays                         1.01    102.5±0.99ms        ? ?/sec    1.00    101.9±0.92ms        ? ?/sec
concat str nulls 1024                                   1.01      6.9±0.71µs        ? ?/sec    1.00      6.8±0.81µs        ? ?/sec
concat str nulls 8192 over 100 arrays                   1.00     52.3±0.42ms        ? ?/sec    1.00     52.1±0.47ms        ? ?/sec
concat str_dict 1024                                    1.00      2.8±0.01µs        ? ?/sec    1.00      2.8±0.02µs        ? ?/sec
concat str_dict_sparse 1024                             1.00      6.9±0.02µs        ? ?/sec    1.01      6.9±0.03µs        ? ?/sec
concat struct with int32 and dicts size=1024 count=2    1.00      6.4±0.15µs        ? ?/sec    1.03      6.6±0.18µs        ? ?/sec

@alamb
Copy link
Contributor

alamb commented Jun 6, 2025

🤖 ./gh_compare_arrow.sh Benchmark Script Running
Linux aal-dev 6.11.0-1013-gcp #13~24.04.1-Ubuntu SMP Wed Apr 2 16:34:16 UTC 2025 x86_64 x86_64 x86_64 GNU/Linux
Comparing coalesce_improvements (7b408ad) to f92ff18 diff
BENCH_NAME=coalesce_kernels
BENCH_COMMAND=cargo bench --features=arrow,async,test_common,experimental --bench coalesce_kernels
BENCH_FILTER=
BENCH_BRANCH_NAME=coalesce_improvements
Results will be posted here when complete

@alamb
Copy link
Contributor

alamb commented Jun 6, 2025

🤖: Benchmark completed

Details

group                                                            coalesce_improvements                  main
-----                                                            ---------------------                  ----
filter: mixed_dict, 8192, nulls: 0, selectivity: 0.001           1.00    256.9±2.38ms        ? ?/sec    1.15    296.4±1.90ms        ? ?/sec
filter: mixed_dict, 8192, nulls: 0, selectivity: 0.01            1.00      8.9±0.10ms        ? ?/sec    1.02      9.1±0.10ms        ? ?/sec
filter: mixed_dict, 8192, nulls: 0, selectivity: 0.1             1.00      4.4±0.11ms        ? ?/sec    1.01      4.4±0.11ms        ? ?/sec
filter: mixed_dict, 8192, nulls: 0, selectivity: 0.8             1.00      3.5±0.03ms        ? ?/sec    1.00      3.5±0.03ms        ? ?/sec
filter: mixed_dict, 8192, nulls: 0.1, selectivity: 0.001         1.00    252.2±2.03ms        ? ?/sec    1.13    285.7±2.30ms        ? ?/sec
filter: mixed_dict, 8192, nulls: 0.1, selectivity: 0.01          1.00     10.4±0.09ms        ? ?/sec    1.02     10.5±0.08ms        ? ?/sec
filter: mixed_dict, 8192, nulls: 0.1, selectivity: 0.1           1.00      4.9±0.11ms        ? ?/sec    1.01      4.9±0.10ms        ? ?/sec
filter: mixed_dict, 8192, nulls: 0.1, selectivity: 0.8           1.00      4.7±0.02ms        ? ?/sec    1.01      4.7±0.02ms        ? ?/sec
filter: mixed_utf8, 8192, nulls: 0, selectivity: 0.001           1.01     67.6±2.03ms        ? ?/sec    1.00     67.2±0.53ms        ? ?/sec
filter: mixed_utf8, 8192, nulls: 0, selectivity: 0.01            1.00     12.9±0.23ms        ? ?/sec    1.00     12.8±0.17ms        ? ?/sec
filter: mixed_utf8, 8192, nulls: 0, selectivity: 0.1             1.00     10.1±0.40ms        ? ?/sec    1.00     10.0±0.23ms        ? ?/sec
filter: mixed_utf8, 8192, nulls: 0, selectivity: 0.8             1.33     10.8±0.27ms        ? ?/sec    1.00      8.1±0.13ms        ? ?/sec
filter: mixed_utf8, 8192, nulls: 0.1, selectivity: 0.001         1.00     83.6±0.57ms        ? ?/sec    1.05     87.8±0.64ms        ? ?/sec
filter: mixed_utf8, 8192, nulls: 0.1, selectivity: 0.01          1.02     14.9±0.13ms        ? ?/sec    1.00     14.6±0.13ms        ? ?/sec
filter: mixed_utf8, 8192, nulls: 0.1, selectivity: 0.1           1.02     10.7±0.27ms        ? ?/sec    1.00     10.4±0.28ms        ? ?/sec
filter: mixed_utf8, 8192, nulls: 0.1, selectivity: 0.8           1.02     10.3±0.19ms        ? ?/sec    1.00     10.1±0.21ms        ? ?/sec
filter: mixed_utf8view, 8192, nulls: 0, selectivity: 0.001       1.00     60.5±0.53ms        ? ?/sec    1.30     78.5±0.94ms        ? ?/sec
filter: mixed_utf8view, 8192, nulls: 0, selectivity: 0.01        1.00      7.4±0.03ms        ? ?/sec    1.23      9.1±0.03ms        ? ?/sec
filter: mixed_utf8view, 8192, nulls: 0, selectivity: 0.1         1.00      3.2±0.05ms        ? ?/sec    1.31      4.2±0.16ms        ? ?/sec
filter: mixed_utf8view, 8192, nulls: 0, selectivity: 0.8         1.00      2.7±0.01ms        ? ?/sec    1.71      4.6±0.02ms        ? ?/sec
filter: mixed_utf8view, 8192, nulls: 0.1, selectivity: 0.001     1.00     79.4±0.47ms        ? ?/sec    1.19     94.4±1.03ms        ? ?/sec
filter: mixed_utf8view, 8192, nulls: 0.1, selectivity: 0.01      1.00     11.0±0.04ms        ? ?/sec    1.22     13.4±0.06ms        ? ?/sec
filter: mixed_utf8view, 8192, nulls: 0.1, selectivity: 0.1       1.00      4.5±0.07ms        ? ?/sec    1.36      6.2±0.17ms        ? ?/sec
filter: mixed_utf8view, 8192, nulls: 0.1, selectivity: 0.8       1.00      5.2±0.02ms        ? ?/sec    1.48      7.7±0.02ms        ? ?/sec
filter: single_utf8view, 8192, nulls: 0, selectivity: 0.001      1.00     85.9±0.32ms        ? ?/sec    1.49    127.6±0.56ms        ? ?/sec
filter: single_utf8view, 8192, nulls: 0, selectivity: 0.01       1.00     11.5±0.03ms        ? ?/sec    1.34     15.4±0.08ms        ? ?/sec
filter: single_utf8view, 8192, nulls: 0, selectivity: 0.1        1.00      5.3±0.23ms        ? ?/sec    1.42      7.5±0.15ms        ? ?/sec
filter: single_utf8view, 8192, nulls: 0, selectivity: 0.8        1.00      4.2±0.03ms        ? ?/sec    2.16      9.1±0.02ms        ? ?/sec
filter: single_utf8view, 8192, nulls: 0.1, selectivity: 0.001    1.00    113.3±1.07ms        ? ?/sec    1.34    152.3±1.56ms        ? ?/sec
filter: single_utf8view, 8192, nulls: 0.1, selectivity: 0.01     1.00     15.8±0.09ms        ? ?/sec    1.35     21.3±0.07ms        ? ?/sec
filter: single_utf8view, 8192, nulls: 0.1, selectivity: 0.1      1.00      6.7±0.10ms        ? ?/sec    1.54     10.4±0.22ms        ? ?/sec
filter: single_utf8view, 8192, nulls: 0.1, selectivity: 0.8      1.00      6.8±0.03ms        ? ?/sec    1.89     12.8±0.04ms        ? ?/sec

@alamb
Copy link
Contributor

alamb commented Jun 6, 2025

🤖 ./gh_compare_arrow.sh Benchmark Script Running
Linux aal-dev 6.11.0-1013-gcp #13~24.04.1-Ubuntu SMP Wed Apr 2 16:34:16 UTC 2025 x86_64 x86_64 x86_64 GNU/Linux
Comparing coalesce_improvements (7b408ad) to f92ff18 diff
BENCH_NAME=filter_kernels
BENCH_COMMAND=cargo bench --features=arrow,async,test_common,experimental --bench filter_kernels
BENCH_FILTER=
BENCH_BRANCH_NAME=coalesce_improvements
Results will be posted here when complete

@alamb
Copy link
Contributor

alamb commented Jun 6, 2025

🤖: Benchmark completed

🤔 seems there is no concat benchmark for StringView. I may go try and add one later todya

@alamb
Copy link
Contributor

alamb commented Jun 6, 2025

I also ran this PR with a new concat benchmark in

And it also shows pretty sweet gains

group                                                                         coalesce_improvements                  main
-----                                                                         ---------------------                  ----
concat utf8_view  max_str_len=128 null_density=0                              1.00     80.3±0.52µs        ? ?/sec    1.12     89.6±0.27µs        ? ?/sec
concat utf8_view  max_str_len=128 null_density=0.2                            1.00     86.1±0.48µs        ? ?/sec    1.73    148.9±0.75µs        ? ?/sec
concat utf8_view  max_str_len=20 null_density=0                               1.00     79.9±0.34µs        ? ?/sec    1.24     99.2±0.23µs        ? ?/sec
concat utf8_view  max_str_len=20 null_density=0.2                             1.00     98.4±0.30µs        ? ?/sec    1.48    145.5±0.52µs        ? ?/sec
concat utf8_view all_inline max_str_len=12 null_density=0                     1.00     40.2±3.69µs        ? ?/sec    1.80     72.4±0.71µs        ? ?/sec
concat utf8_view all_inline max_str_len=12 null_density=0.2                   1.00     52.0±2.95µs        ? ?/sec    1.73     90.0±0.98µs        ? ?/sec

Co-authored-by: Andrew Lamb <andrew@nerdnetworks.org>
Dandandan pushed a commit that referenced this pull request Jun 6, 2025
# Which issue does this PR close?

- Related to #7614

# Rationale for this change

We expect #7614 to improve concat
performance, but we don't have any concat benchmarks for StringViewArray

Let's fix that

# What changes are included in this PR?
1. Add a `concat` benchmark for StringViewArray

# Are there any user-facing changes?

No this is an internal benchmarking tool only
@alamb
Copy link
Contributor

alamb commented Jun 6, 2025

🤖 ./gh_compare_arrow.sh Benchmark Script Running
Linux aal-dev 6.11.0-1013-gcp #13~24.04.1-Ubuntu SMP Wed Apr 2 16:34:16 UTC 2025 x86_64 x86_64 x86_64 GNU/Linux
Comparing coalesce_improvements (a94a096) to 44d7194 diff
BENCH_NAME=concatenate_kernel
BENCH_COMMAND=cargo bench --features=arrow,async,test_common,experimental --bench concatenate_kernel
BENCH_FILTER=
BENCH_BRANCH_NAME=coalesce_improvements
Results will be posted here when complete

Copy link
Contributor

@alamb alamb left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looking great -- I kicked off some more tests

.iter()
fn gc_string_view_batch(batch: RecordBatch) -> RecordBatch {
let (schema, columns, num_rows) = batch.into_parts();
let new_columns: Vec<ArrayRef> = columns
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't know if it matters but in #7620 I simply modified the existing Vec rather than collecting into a new one to save maybe an allocation

for mut c in columns.iter_mut() {
...
}

Copy link
Contributor Author

@Dandandan Dandandan Jun 7, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

AFAIK there is a Rust compiler optimization that will reuse a Vec allocation if using into_iter / collect (on owned Vec) 🤔

Copy link
Contributor Author

@Dandandan Dandandan Jun 7, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is one of the reasons I think using Vec arrow-rs makes sense, there are many optimizations that are not feasible to implement in arrow-rs.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah I think using Vec is a great idea - it is both simpler and as can be seen by this PR often quite a bit faster

})
.collect();

let buffers = if buffer.is_empty() {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think the only way this could happen is if ideal_buffer_size is zero, right? In that case we could make a more efficient inner loop that simply copied the views without checking length, etc

if ideal_buffer_size == 0 {
 // just reuse views as is...
}

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Added 🚀

@alamb
Copy link
Contributor

alamb commented Jun 6, 2025

🤖: Benchmark completed

Details

group                                                          coalesce_improvements                  main
-----                                                          ---------------------                  ----
concat 1024 arrays boolean 4                                   1.00     27.8±0.03µs        ? ?/sec    1.00     27.8±0.05µs        ? ?/sec
concat 1024 arrays i32 4                                       1.02     12.9±0.03µs        ? ?/sec    1.00     12.6±0.03µs        ? ?/sec
concat 1024 arrays str 4                                       1.01     55.5±0.41µs        ? ?/sec    1.00     54.8±0.25µs        ? ?/sec
concat boolean 1024                                            1.00    430.1±0.99ns        ? ?/sec    1.03    441.0±0.45ns        ? ?/sec
concat boolean 8192 over 100 arrays                            1.00     50.9±0.47µs        ? ?/sec    1.00     50.9±0.12µs        ? ?/sec
concat boolean nulls 1024                                      1.00    760.0±0.89ns        ? ?/sec    1.05    801.7±2.11ns        ? ?/sec
concat boolean nulls 8192 over 100 arrays                      1.00    110.0±0.20µs        ? ?/sec    1.00    109.8±0.27µs        ? ?/sec
concat fixed size lists                                        1.00   744.1±30.52µs        ? ?/sec    1.07   799.4±23.49µs        ? ?/sec
concat i32 1024                                                1.00    437.3±0.71ns        ? ?/sec    1.05    458.3±0.82ns        ? ?/sec
concat i32 8192 over 100 arrays                                1.00    211.9±5.17µs        ? ?/sec    1.05    223.3±7.68µs        ? ?/sec
concat i32 nulls 1024                                          1.00    746.0±2.06ns        ? ?/sec    1.06    792.1±1.96ns        ? ?/sec
concat i32 nulls 8192 over 100 arrays                          1.00    268.7±3.93µs        ? ?/sec    1.06    285.8±6.43µs        ? ?/sec
concat str 1024                                                1.08     14.3±1.22µs        ? ?/sec    1.00     13.3±0.96µs        ? ?/sec
concat str 8192 over 100 arrays                                1.00    105.5±0.84ms        ? ?/sec    1.01    106.2±0.77ms        ? ?/sec
concat str nulls 1024                                          1.06      7.4±0.71µs        ? ?/sec    1.00      7.0±0.83µs        ? ?/sec
concat str nulls 8192 over 100 arrays                          1.00     52.6±0.47ms        ? ?/sec    1.01     53.1±0.35ms        ? ?/sec
concat str_dict 1024                                           1.00      2.9±0.01µs        ? ?/sec    1.06      3.0±0.01µs        ? ?/sec
concat str_dict_sparse 1024                                    1.00      6.8±0.02µs        ? ?/sec    1.03      7.1±0.03µs        ? ?/sec
concat struct with int32 and dicts size=1024 count=2           1.00      6.6±0.04µs        ? ?/sec    1.05      7.0±0.06µs        ? ?/sec
concat utf8_view  max_str_len=128 null_density=0               1.00     78.2±1.70µs        ? ?/sec    1.28    100.3±0.45µs        ? ?/sec
concat utf8_view  max_str_len=128 null_density=0.2             1.00     84.2±0.32µs        ? ?/sec    1.84    154.8±1.36µs        ? ?/sec
concat utf8_view  max_str_len=20 null_density=0                1.00     90.0±0.90µs        ? ?/sec    1.10     99.2±0.34µs        ? ?/sec
concat utf8_view  max_str_len=20 null_density=0.2              1.00     96.4±1.09µs        ? ?/sec    1.56    150.5±0.46µs        ? ?/sec
concat utf8_view all_inline max_str_len=12 null_density=0      1.00     46.8±2.89µs        ? ?/sec    1.56     72.9±0.97µs        ? ?/sec
concat utf8_view all_inline max_str_len=12 null_density=0.2    1.00     53.7±2.96µs        ? ?/sec    1.46     78.5±0.80µs        ? ?/sec

@alamb
Copy link
Contributor

alamb commented Jun 6, 2025

🤖 ./gh_compare_arrow.sh Benchmark Script Running
Linux aal-dev 6.11.0-1013-gcp #13~24.04.1-Ubuntu SMP Wed Apr 2 16:34:16 UTC 2025 x86_64 x86_64 x86_64 GNU/Linux
Comparing coalesce_improvements (a94a096) to 44d7194 diff
BENCH_NAME=coalesce_kernels
BENCH_COMMAND=cargo bench --features=arrow,async,test_common,experimental --bench coalesce_kernels
BENCH_FILTER=
BENCH_BRANCH_NAME=coalesce_improvements
Results will be posted here when complete

@alamb
Copy link
Contributor

alamb commented Jun 7, 2025

🤖: Benchmark completed

Details

group                                                                                coalesce_improvements                  main
-----                                                                                ---------------------                  ----
filter: mixed_dict, 8192, nulls: 0, selectivity: 0.001                               1.00    300.7±1.94ms        ? ?/sec    1.00    300.1±2.31ms        ? ?/sec
filter: mixed_dict, 8192, nulls: 0, selectivity: 0.01                                1.00      8.9±0.09ms        ? ?/sec    1.01      9.1±0.10ms        ? ?/sec
filter: mixed_dict, 8192, nulls: 0, selectivity: 0.1                                 1.01      4.3±0.11ms        ? ?/sec    1.00      4.3±0.11ms        ? ?/sec
filter: mixed_dict, 8192, nulls: 0, selectivity: 0.8                                 1.00      3.5±0.03ms        ? ?/sec    1.02      3.5±0.02ms        ? ?/sec
filter: mixed_dict, 8192, nulls: 0.1, selectivity: 0.001                             1.12    294.6±2.72ms        ? ?/sec    1.00    262.6±2.47ms        ? ?/sec
filter: mixed_dict, 8192, nulls: 0.1, selectivity: 0.01                              1.00     10.1±0.08ms        ? ?/sec    1.04     10.5±0.08ms        ? ?/sec
filter: mixed_dict, 8192, nulls: 0.1, selectivity: 0.1                               1.00      4.6±0.05ms        ? ?/sec    1.03      4.7±0.10ms        ? ?/sec
filter: mixed_dict, 8192, nulls: 0.1, selectivity: 0.8                               1.01      4.6±0.02ms        ? ?/sec    1.00      4.6±0.02ms        ? ?/sec
filter: mixed_utf8, 8192, nulls: 0, selectivity: 0.001                               1.00     66.0±0.55ms        ? ?/sec    1.06     69.7±0.71ms        ? ?/sec
filter: mixed_utf8, 8192, nulls: 0, selectivity: 0.01                                1.00     12.8±0.18ms        ? ?/sec    1.00     12.9±0.11ms        ? ?/sec
filter: mixed_utf8, 8192, nulls: 0, selectivity: 0.1                                 1.05     10.4±0.34ms        ? ?/sec    1.00      9.9±0.28ms        ? ?/sec
filter: mixed_utf8, 8192, nulls: 0, selectivity: 0.8                                 1.04     10.6±0.28ms        ? ?/sec    1.00     10.1±0.18ms        ? ?/sec
filter: mixed_utf8, 8192, nulls: 0.1, selectivity: 0.001                             1.00     82.7±0.55ms        ? ?/sec    1.09     90.0±0.45ms        ? ?/sec
filter: mixed_utf8, 8192, nulls: 0.1, selectivity: 0.01                              1.00     14.5±0.14ms        ? ?/sec    1.00     14.6±0.12ms        ? ?/sec
filter: mixed_utf8, 8192, nulls: 0.1, selectivity: 0.1                               1.02     10.4±0.34ms        ? ?/sec    1.00     10.2±0.29ms        ? ?/sec
filter: mixed_utf8, 8192, nulls: 0.1, selectivity: 0.8                               1.05     10.3±0.17ms        ? ?/sec    1.00      9.7±0.25ms        ? ?/sec
filter: mixed_utf8view (max_string_len=128), 8192, nulls: 0, selectivity: 0.001      1.00     66.6±0.35ms        ? ?/sec    1.29     86.1±0.53ms        ? ?/sec
filter: mixed_utf8view (max_string_len=128), 8192, nulls: 0, selectivity: 0.01       1.00      8.7±0.04ms        ? ?/sec    1.22     10.7±0.05ms        ? ?/sec
filter: mixed_utf8view (max_string_len=128), 8192, nulls: 0, selectivity: 0.1        1.00      5.0±0.10ms        ? ?/sec    1.21      6.1±0.23ms        ? ?/sec
filter: mixed_utf8view (max_string_len=128), 8192, nulls: 0, selectivity: 0.8        1.00      3.4±0.03ms        ? ?/sec    1.26      4.3±0.03ms        ? ?/sec
filter: mixed_utf8view (max_string_len=128), 8192, nulls: 0.1, selectivity: 0.001    1.00     86.4±0.52ms        ? ?/sec    1.23    106.2±0.52ms        ? ?/sec
filter: mixed_utf8view (max_string_len=128), 8192, nulls: 0.1, selectivity: 0.01     1.00     11.9±0.07ms        ? ?/sec    1.24     14.7±0.09ms        ? ?/sec
filter: mixed_utf8view (max_string_len=128), 8192, nulls: 0.1, selectivity: 0.1      1.00      5.8±0.16ms        ? ?/sec    1.27      7.3±0.19ms        ? ?/sec
filter: mixed_utf8view (max_string_len=128), 8192, nulls: 0.1, selectivity: 0.8      1.00      3.9±0.01ms        ? ?/sec    1.04      4.0±0.02ms        ? ?/sec
filter: mixed_utf8view (max_string_len=20), 8192, nulls: 0, selectivity: 0.001       1.00     58.1±0.23ms        ? ?/sec    1.35     78.6±1.09ms        ? ?/sec
filter: mixed_utf8view (max_string_len=20), 8192, nulls: 0, selectivity: 0.01        1.00      6.9±0.02ms        ? ?/sec    1.33      9.2±0.04ms        ? ?/sec
filter: mixed_utf8view (max_string_len=20), 8192, nulls: 0, selectivity: 0.1         1.00      2.8±0.04ms        ? ?/sec    1.43      4.0±0.04ms        ? ?/sec
filter: mixed_utf8view (max_string_len=20), 8192, nulls: 0, selectivity: 0.8         1.00      2.6±0.01ms        ? ?/sec    1.82      4.7±0.01ms        ? ?/sec
filter: mixed_utf8view (max_string_len=20), 8192, nulls: 0.1, selectivity: 0.001     1.00     70.9±0.27ms        ? ?/sec    1.22     86.5±0.35ms        ? ?/sec
filter: mixed_utf8view (max_string_len=20), 8192, nulls: 0.1, selectivity: 0.01      1.00     10.3±0.03ms        ? ?/sec    1.31     13.6±0.05ms        ? ?/sec
filter: mixed_utf8view (max_string_len=20), 8192, nulls: 0.1, selectivity: 0.1       1.00      3.8±0.10ms        ? ?/sec    1.37      5.3±0.02ms        ? ?/sec
filter: mixed_utf8view (max_string_len=20), 8192, nulls: 0.1, selectivity: 0.8       1.00      5.0±0.02ms        ? ?/sec    1.52      7.6±0.03ms        ? ?/sec
filter: single_utf8view, 8192, nulls: 0, selectivity: 0.001                          1.00     84.0±0.30ms        ? ?/sec    1.58    133.1±0.53ms        ? ?/sec
filter: single_utf8view, 8192, nulls: 0, selectivity: 0.01                           1.00     11.3±0.04ms        ? ?/sec    1.38     15.6±0.07ms        ? ?/sec
filter: single_utf8view, 8192, nulls: 0, selectivity: 0.1                            1.00      5.1±0.16ms        ? ?/sec    1.44      7.4±0.14ms        ? ?/sec
filter: single_utf8view, 8192, nulls: 0, selectivity: 0.8                            1.00      4.5±0.02ms        ? ?/sec    2.02      9.2±0.02ms        ? ?/sec
filter: single_utf8view, 8192, nulls: 0.1, selectivity: 0.001                        1.00    112.3±0.67ms        ? ?/sec    1.42    159.2±0.91ms        ? ?/sec
filter: single_utf8view, 8192, nulls: 0.1, selectivity: 0.01                         1.00     15.8±0.05ms        ? ?/sec    1.42     22.4±0.06ms        ? ?/sec
filter: single_utf8view, 8192, nulls: 0.1, selectivity: 0.1                          1.00      7.1±0.11ms        ? ?/sec    1.49     10.6±0.15ms        ? ?/sec
filter: single_utf8view, 8192, nulls: 0.1, selectivity: 0.8                          1.00      7.2±0.02ms        ? ?/sec    1.79     13.0±0.03ms        ? ?/sec

@alamb
Copy link
Contributor

alamb commented Jun 7, 2025

🤖 ./gh_compare_arrow.sh Benchmark Script Running
Linux aal-dev 6.11.0-1013-gcp #13~24.04.1-Ubuntu SMP Wed Apr 2 16:34:16 UTC 2025 x86_64 x86_64 x86_64 GNU/Linux
Comparing coalesce_improvements (a94a096) to 44d7194 diff
BENCH_NAME=filter_kernels
BENCH_COMMAND=cargo bench --features=arrow,async,test_common,experimental --bench filter_kernels
BENCH_FILTER=
BENCH_BRANCH_NAME=coalesce_improvements
Results will be posted here when complete

@alamb
Copy link
Contributor

alamb commented Jun 7, 2025

🤖: Benchmark completed

Details

group                                                                         coalesce_improvements                  main
-----                                                                         ---------------------                  ----
filter context decimal128 (kept 1/2)                                          1.14     45.8±6.75µs        ? ?/sec    1.00     40.2±2.02µs        ? ?/sec
filter context decimal128 high selectivity (kept 1023/1024)                   1.01     49.4±0.41µs        ? ?/sec    1.00     48.9±0.52µs        ? ?/sec
filter context decimal128 low selectivity (kept 1/1024)                       1.00    253.7±0.41ns        ? ?/sec    1.01    255.5±0.32ns        ? ?/sec
filter context f32 (kept 1/2)                                                 1.00     90.5±0.21µs        ? ?/sec    1.00     90.5±0.17µs        ? ?/sec
filter context f32 high selectivity (kept 1023/1024)                          1.00     13.4±0.49µs        ? ?/sec    1.03     13.7±0.42µs        ? ?/sec
filter context f32 low selectivity (kept 1/1024)                              1.00    476.5±0.74ns        ? ?/sec    1.01    479.2±0.70ns        ? ?/sec
filter context fsb with value length 20 (kept 1/2)                            1.00     70.6±0.06µs        ? ?/sec    1.00     70.8±0.11µs        ? ?/sec
filter context fsb with value length 20 high selectivity (kept 1023/1024)     1.00     70.7±0.14µs        ? ?/sec    1.00     70.6±0.05µs        ? ?/sec
filter context fsb with value length 20 low selectivity (kept 1/1024)         1.00     70.7±0.08µs        ? ?/sec    1.00     70.7±0.08µs        ? ?/sec
filter context fsb with value length 5 (kept 1/2)                             1.00     70.7±0.08µs        ? ?/sec    1.00     70.7±0.14µs        ? ?/sec
filter context fsb with value length 5 high selectivity (kept 1023/1024)      1.00     70.7±0.15µs        ? ?/sec    1.00     70.7±0.10µs        ? ?/sec
filter context fsb with value length 5 low selectivity (kept 1/1024)          1.00     70.6±0.07µs        ? ?/sec    1.00     70.7±0.14µs        ? ?/sec
filter context fsb with value length 50 (kept 1/2)                            1.00     70.8±0.23µs        ? ?/sec    1.00     70.7±0.28µs        ? ?/sec
filter context fsb with value length 50 high selectivity (kept 1023/1024)     1.00     70.7±0.24µs        ? ?/sec    1.00     70.7±0.11µs        ? ?/sec
filter context fsb with value length 50 low selectivity (kept 1/1024)         1.00     70.7±0.10µs        ? ?/sec    1.00     70.7±0.16µs        ? ?/sec
filter context i32 (kept 1/2)                                                 1.00     22.7±0.04µs        ? ?/sec    1.00     22.6±0.04µs        ? ?/sec
filter context i32 high selectivity (kept 1023/1024)                          1.05      6.5±0.54µs        ? ?/sec    1.00      6.2±0.30µs        ? ?/sec
filter context i32 low selectivity (kept 1/1024)                              1.00    254.5±0.41ns        ? ?/sec    1.02    258.8±0.40ns        ? ?/sec
filter context i32 w NULLs (kept 1/2)                                         1.00     93.9±0.20µs        ? ?/sec    1.00     93.9±0.31µs        ? ?/sec
filter context i32 w NULLs high selectivity (kept 1023/1024)                  1.00     13.5±0.45µs        ? ?/sec    1.03     13.9±0.53µs        ? ?/sec
filter context i32 w NULLs low selectivity (kept 1/1024)                      1.19    574.3±1.01ns        ? ?/sec    1.00    482.2±0.92ns        ? ?/sec
filter context mixed string view (kept 1/2)                                   1.05    120.1±7.82µs        ? ?/sec    1.00    114.6±6.77µs        ? ?/sec
filter context mixed string view high selectivity (kept 1023/1024)            1.00     58.0±1.34µs        ? ?/sec    1.00     57.8±1.50µs        ? ?/sec
filter context mixed string view low selectivity (kept 1/1024)                1.00    671.3±1.36ns        ? ?/sec    1.10    735.4±4.62ns        ? ?/sec
filter context short string view (kept 1/2)                                   1.00    109.9±0.60µs        ? ?/sec    1.03    112.9±5.98µs        ? ?/sec
filter context short string view high selectivity (kept 1023/1024)            1.00     58.4±1.25µs        ? ?/sec    1.04     60.5±0.11µs        ? ?/sec
filter context short string view low selectivity (kept 1/1024)                1.00    489.9±1.49ns        ? ?/sec    1.02    498.9±0.58ns        ? ?/sec
filter context string (kept 1/2)                                              1.01   589.9±12.86µs        ? ?/sec    1.00   585.4±14.66µs        ? ?/sec
filter context string dictionary (kept 1/2)                                   1.00     23.2±0.08µs        ? ?/sec    1.00     23.3±0.08µs        ? ?/sec
filter context string dictionary high selectivity (kept 1023/1024)            1.00      7.1±0.32µs        ? ?/sec    1.05      7.5±0.45µs        ? ?/sec
filter context string dictionary low selectivity (kept 1/1024)                1.00    824.4±3.42ns        ? ?/sec    1.00    821.5±0.79ns        ? ?/sec
filter context string dictionary w NULLs (kept 1/2)                           1.00     94.6±0.17µs        ? ?/sec    1.00     94.8±0.18µs        ? ?/sec
filter context string dictionary w NULLs high selectivity (kept 1023/1024)    1.00     14.2±0.43µs        ? ?/sec    1.05     14.9±0.65µs        ? ?/sec
filter context string dictionary w NULLs low selectivity (kept 1/1024)        1.00   1069.0±3.94ns        ? ?/sec    1.00   1065.4±3.55ns        ? ?/sec
filter context string high selectivity (kept 1023/1024)                       1.02   644.3±19.62µs        ? ?/sec    1.00   630.1±13.05µs        ? ?/sec
filter context string low selectivity (kept 1/1024)                           1.01   1148.9±7.77ns        ? ?/sec    1.00   1133.0±1.68ns        ? ?/sec
filter context u8 (kept 1/2)                                                  1.00     18.9±0.03µs        ? ?/sec    1.00     18.9±0.03µs        ? ?/sec
filter context u8 high selectivity (kept 1023/1024)                           1.12      2.0±0.01µs        ? ?/sec    1.00   1812.2±7.70ns        ? ?/sec
filter context u8 low selectivity (kept 1/1024)                               1.00    244.5±0.59ns        ? ?/sec    1.01    247.5±0.47ns        ? ?/sec
filter context u8 w NULLs (kept 1/2)                                          1.00     89.9±0.12µs        ? ?/sec    1.00     89.9±0.10µs        ? ?/sec
filter context u8 w NULLs high selectivity (kept 1023/1024)                   1.00      8.7±0.01µs        ? ?/sec    1.00      8.7±0.02µs        ? ?/sec
filter context u8 w NULLs low selectivity (kept 1/1024)                       1.00    569.2±1.22ns        ? ?/sec    1.00    571.0±1.04ns        ? ?/sec
filter decimal128 (kept 1/2)                                                  1.02     96.7±0.46µs        ? ?/sec    1.00     95.2±0.45µs        ? ?/sec
filter decimal128 high selectivity (kept 1023/1024)                           1.00     53.0±1.68µs        ? ?/sec    1.01     53.7±1.29µs        ? ?/sec
filter decimal128 low selectivity (kept 1/1024)                               1.00      3.0±0.01µs        ? ?/sec    1.01      3.0±0.01µs        ? ?/sec
filter f32 (kept 1/2)                                                         1.00    199.9±2.60µs        ? ?/sec    1.00    200.4±0.33µs        ? ?/sec
filter fsb with value length 20 (kept 1/2)                                    1.00    149.8±0.59µs        ? ?/sec    1.00    150.1±0.73µs        ? ?/sec
filter fsb with value length 20 high selectivity (kept 1023/1024)             1.02     73.1±1.52µs        ? ?/sec    1.00     72.1±2.02µs        ? ?/sec
filter fsb with value length 20 low selectivity (kept 1/1024)                 1.00      3.2±0.01µs        ? ?/sec    1.00      3.2±0.01µs        ? ?/sec
filter fsb with value length 5 (kept 1/2)                                     1.01    153.9±0.93µs        ? ?/sec    1.00    152.4±0.31µs        ? ?/sec
filter fsb with value length 5 high selectivity (kept 1023/1024)              1.03     11.5±0.42µs        ? ?/sec    1.00     11.2±0.68µs        ? ?/sec
filter fsb with value length 5 low selectivity (kept 1/1024)                  1.01      3.1±0.01µs        ? ?/sec    1.00      3.1±0.01µs        ? ?/sec
filter fsb with value length 50 (kept 1/2)                                    1.00    184.8±9.56µs        ? ?/sec    1.11    204.2±4.76µs        ? ?/sec
filter fsb with value length 50 high selectivity (kept 1023/1024)             1.00    199.9±5.65µs        ? ?/sec    1.07    213.1±8.57µs        ? ?/sec
filter fsb with value length 50 low selectivity (kept 1/1024)                 1.00      3.1±0.01µs        ? ?/sec    1.01      3.2±0.00µs        ? ?/sec
filter i32 (kept 1/2)                                                         1.04     95.4±0.14µs        ? ?/sec    1.00     91.3±0.13µs        ? ?/sec
filter i32 high selectivity (kept 1023/1024)                                  1.01      8.7±0.37µs        ? ?/sec    1.00      8.6±0.38µs        ? ?/sec
filter i32 low selectivity (kept 1/1024)                                      1.00      3.1±0.01µs        ? ?/sec    1.00      3.1±0.00µs        ? ?/sec
filter optimize (kept 1/2)                                                    1.01     84.7±0.18µs        ? ?/sec    1.00     84.1±0.24µs        ? ?/sec
filter optimize high selectivity (kept 1023/1024)                             1.01      2.8±0.01µs        ? ?/sec    1.00      2.8±0.01µs        ? ?/sec
filter optimize low selectivity (kept 1/1024)                                 1.00      2.8±0.00µs        ? ?/sec    1.00      2.8±0.00µs        ? ?/sec
filter run array (kept 1/2)                                                   1.00    358.4±0.77µs        ? ?/sec    1.00    357.3±1.07µs        ? ?/sec
filter run array high selectivity (kept 1023/1024)                            1.01    310.6±1.92µs        ? ?/sec    1.00    309.0±1.61µs        ? ?/sec
filter run array low selectivity (kept 1/1024)                                1.00    247.8±1.18µs        ? ?/sec    1.00    246.8±1.07µs        ? ?/sec
filter single record batch                                                    1.00     93.0±0.28µs        ? ?/sec    1.02     94.6±0.21µs        ? ?/sec
filter u8 (kept 1/2)                                                          1.05     95.6±0.57µs        ? ?/sec    1.00     91.3±0.45µs        ? ?/sec
filter u8 high selectivity (kept 1023/1024)                                   1.03      3.9±0.01µs        ? ?/sec    1.00      3.8±0.01µs        ? ?/sec
filter u8 low selectivity (kept 1/1024)                                       1.00      3.0±0.00µs        ? ?/sec    1.00      3.0±0.01µs        ? ?/sec

Comment on lines +291 to +312
let views: Vec<u128> = s
.views()
.as_ref()
.iter()
.cloned()
.map(|v| {
let mut b: ByteView = ByteView::from(v);

if b.length > 12 {
let offset = buffer.len() as u32;
buffer.extend_from_slice(
buffers[b.buffer_index as usize]
.get(b.offset as usize..b.offset as usize + b.length as usize)
.expect("Invalid buffer slice"),
);
b.offset = offset;
b.buffer_index = 0; // Set buffer index to 0, as we only have one buffer
}

b.into()
})
.collect();
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm curious it is better if implement like this:

let mut views = Vec::with_capacity(s.views().len());
for &v in s.views().as_ref() {
    let mut b: ByteView = ByteView::from(v);
    if b.length > 12 {
        ...
    }
    views.push(b.into());
}

vector pre-allocates and no clone 🤔

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

No that is generally worse - .iter().collect() on rust core types optimize to better code:

the cloned() you see on iter().cloned() will generally not make a difference on primitive types for the generated code.

@Dandandan Dandandan merged commit 7739a83 into apache:main Jun 7, 2025
26 checks passed
@Dandandan
Copy link
Contributor Author

Thanks for the reviews @alamb - let's continue making it faster!

Dandandan added a commit to Dandandan/arrow-rs that referenced this pull request Jun 7, 2025
alamb pushed a commit that referenced this pull request Jun 7, 2025
#7623)

This reverts commit 7739a83.

# Which issue does this PR close?



# Rationale for this change
I found this errors in DataFusion (see
apache/datafusion#16249 (comment)),
so let's revert it and find the error.


# What changes are included in this PR?


# Are there any user-facing changes?
Dandandan added a commit to Dandandan/arrow-rs that referenced this pull request Jun 7, 2025
alamb pushed a commit that referenced this pull request Jun 8, 2025
#7625)

… (#7614)" (#7623)"

This reverts commit da461c8.

This adds a test and fix for the wrong index issue.
I also verified the change for DataFusion (and benchmarks show notable
improvements).

# Which issue does this PR close?


Closes #NNN.

# Rationale for this change


# What changes are included in this PR?



# Are there any user-facing changes?
@alamb
Copy link
Contributor

alamb commented Jun 12, 2025

🤖 ./gh_compare_arrow.sh Benchmark Script Running
Linux aal-dev 6.11.0-1015-gcp #15~24.04.1-Ubuntu SMP Thu Apr 24 20:41:05 UTC 2025 x86_64 x86_64 x86_64 GNU/Linux
Comparing coalesce_improvements (289aeda) to 44d7194 diff
BENCH_NAME=coalesce_kernels
BENCH_COMMAND=cargo bench --features=arrow,async,test_common,experimental --bench coalesce_kernels
BENCH_FILTER=
BENCH_BRANCH_NAME=coalesce_improvements
Results will be posted here when complete

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

arrow Changes to the arrow crate

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Improve performance of coalesce and concat for views

3 participants