Make `push_batch_with_filter` up to 3x faster for primitive types #8951

Dandandan · 2025-12-04T14:28:18Z

Which issue does this PR close?

Closes #NNN.

Rationale for this change

filter: primitive, 8192, nulls: 0, selectivity: 0.001
                        time:   [20.430 ms 20.678 ms 21.105 ms]
                        change: [−65.000% −64.516% −63.806%] (p = 0.00 < 0.05)
                        Performance has improved.
Found 7 outliers among 100 measurements (7.00%)
  1 (1.00%) high mild
  6 (6.00%) high severe

filter: primitive, 8192, nulls: 0, selectivity: 0.01
                        time:   [3.3275 ms 3.3451 ms 3.3665 ms]
                        change: [−49.062% −48.663% −48.260%] (p = 0.00 < 0.05)
                        Performance has improved.
Found 6 outliers among 100 measurements (6.00%)
  1 (1.00%) low mild
  5 (5.00%) high severe

Benchmarking filter: primitive, 8192, nulls: 0, selectivity: 0.1: Warming up for 3.0000 s
Warning: Unable to complete 100 samples in 5.0s. You may wish to increase target time to 7.5s, enable flat sampling, or reduce sample count to 50.
filter: primitive, 8192, nulls: 0, selectivity: 0.1
                        time:   [1.4759 ms 1.4887 ms 1.5105 ms]
                        change: [−26.613% −23.553% −15.842%] (p = 0.00 < 0.05)
                        Performance has improved.
Found 9 outliers among 100 measurements (9.00%)
  2 (2.00%) low mild
  1 (1.00%) high mild
  6 (6.00%) high severe

Benchmarking filter: primitive, 8192, nulls: 0, selectivity: 0.8: Warming up for 3.0000 s
Warning: Unable to complete 100 samples in 5.0s. You may wish to increase target time to 6.9s, enable flat sampling, or reduce sample count to 60.
filter: primitive, 8192, nulls: 0, selectivity: 0.8
                        time:   [1.3569 ms 1.3626 ms 1.3702 ms]
                        change: [−47.225% −46.850% −46.451%] (p = 0.00 < 0.05)
                        Performance has improved.
Found 8 outliers among 100 measurements (8.00%)
  3 (3.00%) low mild
  2 (2.00%) high mild
  3 (3.00%) high severe

filter: primitive, 8192, nulls: 0.1, selectivity: 0.001
                        time:   [23.231 ms 23.295 ms 23.376 ms]
                        change: [−69.694% −69.516% −69.351%] (p = 0.00 < 0.05)
                        Performance has improved.
Found 3 outliers among 100 measurements (3.00%)
  3 (3.00%) high severe

filter: primitive, 8192, nulls: 0.1, selectivity: 0.01
                        time:   [5.4033 ms 5.4201 ms 5.4424 ms]
                        change: [−49.860% −49.590% −49.325%] (p = 0.00 < 0.05)
                        Performance has improved.
Found 5 outliers among 100 measurements (5.00%)
  3 (3.00%) high mild
  2 (2.00%) high severe

filter: primitive, 8192, nulls: 0.1, selectivity: 0.1
                        time:   [3.6111 ms 3.6270 ms 3.6475 ms]
                        change: [−27.778% −26.284% −25.286%] (p = 0.00 < 0.05)
                        Performance has improved.
Found 5 outliers among 100 measurements (5.00%)
  2 (2.00%) high mild
  3 (3.00%) high severe

filter: primitive, 8192, nulls: 0.1, selectivity: 0.8
                        time:   [3.6298 ms 3.7206 ms 3.8600 ms]
                        change: [−26.637% −24.714% −21.997%] (p = 0.00 < 0.05)
                        Performance has improved.
Found 6 outliers among 100 measurements (6.00%)
  2 (2.00%) high mild
  4 (4.00%) high severe

What changes are included in this PR?

Are these changes tested?

Are there any user-facing changes?

Make filtered coalescing faster for primitive

Dandandan · 2025-12-04T16:00:15Z

@alamb you are probably interested in this

alamb · 2025-12-04T16:01:44Z

YAAAAASSS -- this is exactly the type of thing I was hoping for with BatchCoalescer. I will check this out shortly

Dandandan · 2025-12-04T16:05:47Z

arrow-select/src/coalesce.rs

-        let filtered_batch = filter_record_batch(&batch, filter)?;
-        self.push_batch(filtered_batch)
+        // We only support primitve now, fallback to filter_record_batch for other types
+        // Also, skip optimization when filter is not very selective


Not sure if always better to take into account biggest_coalesce_batch_size

alamb · 2025-12-04T16:51:52Z

run benchmark filter_kernels

alamb · 2025-12-04T16:52:00Z

show benchmark queue

alamb-ghbot · 2025-12-04T16:52:03Z

🤖 Hi @alamb, you asked to view the benchmark queue (#8951 (comment)).

Job	User	Benchmarks	Comment
`arrow-8933-3613162300.sh`	alamb	default	`https://github.com/apache/arrow-rs/pull/8933#issuecomment-3613162300`
`arrow-8933-3613131981.sh`	alamb	filter_kernels	`https://github.com/apache/arrow-rs/pull/8933#issuecomment-3613131981`
`arrow-8951-3613212415.sh`	alamb	filter_kernels	`https://github.com/apache/arrow-rs/pull/8951#issuecomment-3613212415`

Dandandan · 2025-12-04T17:04:14Z

Hm it seems it contains a bug, probably makes the benchmark results off as well (will take a look tomorrow).

alamb-ghbot · 2025-12-04T17:27:06Z

🤖 ./gh_compare_arrow.sh Benchmark Script Running
Linux aal-dev 6.14.0-1018-gcp #19~24.04.1-Ubuntu SMP Wed Sep 24 23:23:09 UTC 2025 x86_64 x86_64 x86_64 GNU/Linux
Comparing coalesce_batches_filter (0872a9b) to ed9efe7 diff
BENCH_NAME=filter_kernels
BENCH_COMMAND=cargo bench --features=arrow,async,test_common,experimental --bench filter_kernels
BENCH_FILTER=
BENCH_BRANCH_NAME=coalesce_batches_filter
Results will be posted here when complete

alamb-ghbot · 2025-12-04T17:51:25Z

🤖: Benchmark completed

Details

group                                                                         coalesce_batches_filter                main
-----                                                                         -----------------------                ----
filter context decimal128 (kept 1/2)                                          1.36     57.5±5.45µs        ? ?/sec    1.00     42.1±1.93µs        ? ?/sec
filter context decimal128 high selectivity (kept 1023/1024)                   1.00     55.7±4.51µs        ? ?/sec    1.09     60.5±0.29µs        ? ?/sec
filter context decimal128 low selectivity (kept 1/1024)                       1.00    242.4±0.35ns        ? ?/sec    1.06    256.0±1.60ns        ? ?/sec
filter context f32 (kept 1/2)                                                 1.00     77.7±1.20µs        ? ?/sec    1.00     78.0±2.52µs        ? ?/sec
filter context f32 high selectivity (kept 1023/1024)                          1.00      9.9±0.32µs        ? ?/sec    1.01     10.1±0.30µs        ? ?/sec
filter context f32 low selectivity (kept 1/1024)                              1.00    444.2±7.59ns        ? ?/sec    1.06   469.4±13.36ns        ? ?/sec
filter context fsb with value length 20 (kept 1/2)                            1.00     60.7±1.16µs        ? ?/sec    1.00     60.7±0.37µs        ? ?/sec
filter context fsb with value length 20 high selectivity (kept 1023/1024)     1.00     60.7±0.36µs        ? ?/sec    1.00     60.7±0.56µs        ? ?/sec
filter context fsb with value length 20 low selectivity (kept 1/1024)         1.00     60.6±0.26µs        ? ?/sec    1.00     60.8±1.05µs        ? ?/sec
filter context fsb with value length 5 (kept 1/2)                             1.00     60.8±1.45µs        ? ?/sec    1.00     60.7±1.02µs        ? ?/sec
filter context fsb with value length 5 high selectivity (kept 1023/1024)      1.00     60.7±0.71µs        ? ?/sec    1.00     60.8±1.22µs        ? ?/sec
filter context fsb with value length 5 low selectivity (kept 1/1024)          1.01     61.2±3.05µs        ? ?/sec    1.00     60.8±0.90µs        ? ?/sec
filter context fsb with value length 50 (kept 1/2)                            1.00     60.7±0.46µs        ? ?/sec    1.00     60.8±0.46µs        ? ?/sec
filter context fsb with value length 50 high selectivity (kept 1023/1024)     1.00     61.0±2.06µs        ? ?/sec    1.00     60.7±0.55µs        ? ?/sec
filter context fsb with value length 50 low selectivity (kept 1/1024)         1.00     60.8±1.25µs        ? ?/sec    1.00     60.8±1.00µs        ? ?/sec
filter context i32 (kept 1/2)                                                 1.01     16.6±0.28µs        ? ?/sec    1.00     16.5±0.30µs        ? ?/sec
filter context i32 high selectivity (kept 1023/1024)                          1.04      6.5±0.20µs        ? ?/sec    1.00      6.2±0.17µs        ? ?/sec
filter context i32 low selectivity (kept 1/1024)                              1.00    236.0±5.78ns        ? ?/sec    1.05    246.9±1.45ns        ? ?/sec
filter context i32 w NULLs (kept 1/2)                                         1.00     77.8±2.17µs        ? ?/sec    1.00     77.9±0.80µs        ? ?/sec
filter context i32 w NULLs high selectivity (kept 1023/1024)                  1.00     10.1±0.52µs        ? ?/sec    1.04     10.5±0.18µs        ? ?/sec
filter context i32 w NULLs low selectivity (kept 1/1024)                      1.00    446.9±4.94ns        ? ?/sec    1.06    471.6±6.49ns        ? ?/sec
filter context mixed string view (kept 1/2)                                   1.00    109.0±3.21µs        ? ?/sec    1.11    120.7±3.20µs        ? ?/sec
filter context mixed string view high selectivity (kept 1023/1024)            1.00     53.9±2.45µs        ? ?/sec    1.03     55.3±2.41µs        ? ?/sec
filter context mixed string view low selectivity (kept 1/1024)                1.00   654.9±19.57ns        ? ?/sec    1.04   677.9±18.99ns        ? ?/sec
filter context short string view (kept 1/2)                                   1.00    104.2±1.47µs        ? ?/sec    1.08    112.2±3.44µs        ? ?/sec
filter context short string view high selectivity (kept 1023/1024)            1.02     55.5±1.25µs        ? ?/sec    1.00     54.5±0.23µs        ? ?/sec
filter context short string view low selectivity (kept 1/1024)                1.00    464.2±2.70ns        ? ?/sec    1.06    491.4±7.75ns        ? ?/sec
filter context string (kept 1/2)                                              1.03   599.4±17.30µs        ? ?/sec    1.00    582.1±5.14µs        ? ?/sec
filter context string dictionary (kept 1/2)                                   1.00     17.0±0.13µs        ? ?/sec    1.02     17.3±0.27µs        ? ?/sec
filter context string dictionary high selectivity (kept 1023/1024)            1.00      7.0±0.34µs        ? ?/sec    1.02      7.2±0.27µs        ? ?/sec
filter context string dictionary low selectivity (kept 1/1024)                1.02    847.3±9.58ns        ? ?/sec    1.00    829.8±3.84ns        ? ?/sec
filter context string dictionary w NULLs (kept 1/2)                           1.00     78.8±1.05µs        ? ?/sec    1.00     78.9±2.34µs        ? ?/sec
filter context string dictionary w NULLs high selectivity (kept 1023/1024)    1.00     10.7±0.41µs        ? ?/sec    1.01     10.8±0.35µs        ? ?/sec
filter context string dictionary w NULLs low selectivity (kept 1/1024)        1.01  1076.9±14.42ns        ? ?/sec    1.00  1067.4±30.14ns        ? ?/sec
filter context string high selectivity (kept 1023/1024)                       1.00   703.0±13.80µs        ? ?/sec    1.00   703.8±19.93µs        ? ?/sec
filter context string low selectivity (kept 1/1024)                           1.00  1016.7±52.17ns        ? ?/sec    1.02  1036.2±34.58ns        ? ?/sec
filter context u8 (kept 1/2)                                                  1.00     14.9±0.05µs        ? ?/sec    1.00     15.0±0.14µs        ? ?/sec
filter context u8 high selectivity (kept 1023/1024)                           1.00  1829.3±23.69ns        ? ?/sec    1.11      2.0±0.01µs        ? ?/sec
filter context u8 low selectivity (kept 1/1024)                               1.00    231.0±5.30ns        ? ?/sec    1.03    238.8±0.83ns        ? ?/sec
filter context u8 w NULLs (kept 1/2)                                          1.00     75.9±0.20µs        ? ?/sec    1.00     76.1±0.78µs        ? ?/sec
filter context u8 w NULLs high selectivity (kept 1023/1024)                   1.00      5.1±0.08µs        ? ?/sec    1.05      5.4±0.06µs        ? ?/sec
filter context u8 w NULLs low selectivity (kept 1/1024)                       1.00   441.3±12.39ns        ? ?/sec    1.06    467.4±2.32ns        ? ?/sec
filter decimal128 (kept 1/2)                                                  1.00     49.5±0.83µs        ? ?/sec    1.18     58.6±2.81µs        ? ?/sec
filter decimal128 high selectivity (kept 1023/1024)                           1.17     61.3±2.70µs        ? ?/sec    1.00     52.6±1.25µs        ? ?/sec
filter decimal128 low selectivity (kept 1/1024)                               1.00      2.9±0.09µs        ? ?/sec    1.13      3.2±0.08µs        ? ?/sec
filter f32 (kept 1/2)                                                         1.07    166.6±7.99µs        ? ?/sec    1.00    156.4±2.84µs        ? ?/sec
filter fsb with value length 20 (kept 1/2)                                    1.12    141.6±1.19µs        ? ?/sec    1.00    126.0±3.73µs        ? ?/sec
filter fsb with value length 20 high selectivity (kept 1023/1024)             1.11     76.6±1.07µs        ? ?/sec    1.00     68.7±1.04µs        ? ?/sec
filter fsb with value length 20 low selectivity (kept 1/1024)                 1.00      2.7±0.09µs        ? ?/sec    1.29      3.5±0.10µs        ? ?/sec
filter fsb with value length 5 (kept 1/2)                                     1.17    141.8±2.30µs        ? ?/sec    1.00    121.1±0.87µs        ? ?/sec
filter fsb with value length 5 high selectivity (kept 1023/1024)              1.00     10.8±0.16µs        ? ?/sec    1.05     11.3±0.33µs        ? ?/sec
filter fsb with value length 5 low selectivity (kept 1/1024)                  1.00      2.6±0.08µs        ? ?/sec    1.28      3.3±0.02µs        ? ?/sec
filter fsb with value length 50 (kept 1/2)                                    1.05    189.3±7.05µs        ? ?/sec    1.00    181.1±9.22µs        ? ?/sec
filter fsb with value length 50 high selectivity (kept 1023/1024)             1.00    255.5±8.77µs        ? ?/sec    1.03    264.3±6.26µs        ? ?/sec
filter fsb with value length 50 low selectivity (kept 1/1024)                 1.00      2.6±0.03µs        ? ?/sec    1.27      3.3±0.10µs        ? ?/sec
filter i32 (kept 1/2)                                                         1.25     53.8±0.68µs        ? ?/sec    1.00     43.2±0.31µs        ? ?/sec
filter i32 high selectivity (kept 1023/1024)                                  1.05      8.9±0.48µs        ? ?/sec    1.00      8.4±0.32µs        ? ?/sec
filter i32 low selectivity (kept 1/1024)                                      1.24      2.9±0.06µs        ? ?/sec    1.00      2.4±0.03µs        ? ?/sec
filter optimize (kept 1/2)                                                    1.01     54.8±2.99µs        ? ?/sec    1.00     54.5±1.51µs        ? ?/sec
filter optimize high selectivity (kept 1023/1024)                             1.18      3.1±0.14µs        ? ?/sec    1.00      2.6±0.01µs        ? ?/sec
filter optimize low selectivity (kept 1/1024)                                 1.00      2.7±0.00µs        ? ?/sec    1.00      2.7±0.02µs        ? ?/sec
filter run array (kept 1/2)                                                   1.03   436.4±17.42µs        ? ?/sec    1.00    422.5±4.27µs        ? ?/sec
filter run array high selectivity (kept 1023/1024)                            1.01    452.6±7.45µs        ? ?/sec    1.00   449.3±12.94µs        ? ?/sec
filter run array low selectivity (kept 1/1024)                                1.01   336.4±10.57µs        ? ?/sec    1.00    334.5±2.82µs        ? ?/sec
filter single record batch                                                    1.23     54.3±2.92µs        ? ?/sec    1.00     44.2±0.07µs        ? ?/sec
filter u8 (kept 1/2)                                                          1.00     45.5±0.99µs        ? ?/sec    1.00     45.7±0.44µs        ? ?/sec
filter u8 high selectivity (kept 1023/1024)                                   1.05      4.0±0.11µs        ? ?/sec    1.00      3.8±0.04µs        ? ?/sec
filter u8 low selectivity (kept 1/1024)                                       1.00      3.0±0.05µs        ? ?/sec    1.12      3.3±0.11µs        ? ?/sec

Dandandan · 2025-12-04T20:29:02Z

run benchmark coalesce_kernels

alamb-ghbot · 2025-12-04T20:29:06Z

🤖 Hi @Dandandan, thanks for the request (#8951 (comment)).

scrape_comments.py only supports whitelisted benchmarks.

Standard: (none)
Criterion: arrow_reader, concatenate_kernels, filter_kernels

Please choose one or more of these with run benchmark <name> or run benchmark <name1> <name2>...
Unsupported benchmarks: coalesce_kernels.

Dandandan · 2025-12-04T20:47:08Z

@alamb I think it's ok now - I called AI (Opus 4.5) for some help on the find_nth_set_bit_position function.

Mainly needs some polish and seeing if we can improve the filter: primitive, 8192, nulls: 0.1, selectivity: 0.8 case.

alamb · 2025-12-05T14:35:54Z

run benchmark coalesce_kernels

I added this to the allowed benchmarks

alamb · 2025-12-05T14:36:00Z

run benchmark coalesce_kernels

alamb-ghbot · 2025-12-05T14:36:06Z

🤖 ./gh_compare_arrow.sh Benchmark Script Running
Linux aal-dev 6.14.0-1018-gcp #19~24.04.1-Ubuntu SMP Wed Sep 24 23:23:09 UTC 2025 x86_64 x86_64 x86_64 GNU/Linux
Comparing coalesce_batches_filter (dcf4864) to ed9efe7 diff
BENCH_NAME=coalesce_kernels
BENCH_COMMAND=cargo bench --features=arrow,async,test_common,experimental --bench coalesce_kernels
BENCH_FILTER=
BENCH_BRANCH_NAME=coalesce_batches_filter
Results will be posted here when complete

alamb-ghbot · 2025-12-05T14:55:52Z

🤖: Benchmark completed

Details

group                                                                                coalesce_batches_filter                main
-----                                                                                -----------------------                ----
filter: mixed_dict, 8192, nulls: 0, selectivity: 0.001                               1.01    261.9±3.23ms        ? ?/sec    1.00    259.4±2.06ms        ? ?/sec
filter: mixed_dict, 8192, nulls: 0, selectivity: 0.01                                1.00      8.6±0.14ms        ? ?/sec    1.01      8.7±0.10ms        ? ?/sec
filter: mixed_dict, 8192, nulls: 0, selectivity: 0.1                                 1.00      4.1±0.06ms        ? ?/sec    1.01      4.1±0.09ms        ? ?/sec
filter: mixed_dict, 8192, nulls: 0, selectivity: 0.8                                 1.00      3.5±0.01ms        ? ?/sec    1.02      3.5±0.02ms        ? ?/sec
filter: mixed_dict, 8192, nulls: 0.1, selectivity: 0.001                             1.00    245.6±2.39ms        ? ?/sec    1.27    312.5±3.08ms        ? ?/sec
filter: mixed_dict, 8192, nulls: 0.1, selectivity: 0.01                              1.01      9.4±0.09ms        ? ?/sec    1.00      9.4±0.07ms        ? ?/sec
filter: mixed_dict, 8192, nulls: 0.1, selectivity: 0.1                               1.00      4.5±0.08ms        ? ?/sec    1.02      4.6±0.08ms        ? ?/sec
filter: mixed_dict, 8192, nulls: 0.1, selectivity: 0.8                               1.00      4.6±0.03ms        ? ?/sec    1.01      4.6±0.02ms        ? ?/sec
filter: mixed_utf8, 8192, nulls: 0, selectivity: 0.001                               1.01     59.6±1.58ms        ? ?/sec    1.00     59.2±0.34ms        ? ?/sec
filter: mixed_utf8, 8192, nulls: 0, selectivity: 0.01                                1.00     11.6±0.18ms        ? ?/sec    1.00     11.6±0.18ms        ? ?/sec
filter: mixed_utf8, 8192, nulls: 0, selectivity: 0.1                                 1.01      9.3±0.18ms        ? ?/sec    1.00      9.2±0.09ms        ? ?/sec
filter: mixed_utf8, 8192, nulls: 0, selectivity: 0.8                                 1.00      8.2±0.22ms        ? ?/sec    1.28     10.4±0.24ms        ? ?/sec
filter: mixed_utf8, 8192, nulls: 0.1, selectivity: 0.001                             1.01     70.3±0.25ms        ? ?/sec    1.00     69.9±0.25ms        ? ?/sec
filter: mixed_utf8, 8192, nulls: 0.1, selectivity: 0.01                              1.01     12.9±0.14ms        ? ?/sec    1.00     12.8±0.06ms        ? ?/sec
filter: mixed_utf8, 8192, nulls: 0.1, selectivity: 0.1                               1.00      9.8±0.05ms        ? ?/sec    1.06     10.4±0.16ms        ? ?/sec
filter: mixed_utf8, 8192, nulls: 0.1, selectivity: 0.8                               1.00     10.0±0.25ms        ? ?/sec    1.02     10.1±0.20ms        ? ?/sec
filter: mixed_utf8view (max_string_len=128), 8192, nulls: 0, selectivity: 0.001      1.05     50.7±0.30ms        ? ?/sec    1.00     48.1±0.17ms        ? ?/sec
filter: mixed_utf8view (max_string_len=128), 8192, nulls: 0, selectivity: 0.01       1.03      6.2±0.06ms        ? ?/sec    1.00      6.0±0.05ms        ? ?/sec
filter: mixed_utf8view (max_string_len=128), 8192, nulls: 0, selectivity: 0.1        1.00      4.5±0.11ms        ? ?/sec    1.00      4.5±0.15ms        ? ?/sec
filter: mixed_utf8view (max_string_len=128), 8192, nulls: 0, selectivity: 0.8        1.02      3.1±0.03ms        ? ?/sec    1.00      3.0±0.02ms        ? ?/sec
filter: mixed_utf8view (max_string_len=128), 8192, nulls: 0.1, selectivity: 0.001    1.04     60.3±0.24ms        ? ?/sec    1.00     58.1±0.25ms        ? ?/sec
filter: mixed_utf8view (max_string_len=128), 8192, nulls: 0.1, selectivity: 0.01     1.03      8.2±0.03ms        ? ?/sec    1.00      7.9±0.03ms        ? ?/sec
filter: mixed_utf8view (max_string_len=128), 8192, nulls: 0.1, selectivity: 0.1      1.00      5.6±0.13ms        ? ?/sec    1.07      6.0±0.11ms        ? ?/sec
filter: mixed_utf8view (max_string_len=128), 8192, nulls: 0.1, selectivity: 0.8      1.00      3.9±0.02ms        ? ?/sec    1.01      3.9±0.01ms        ? ?/sec
filter: mixed_utf8view (max_string_len=20), 8192, nulls: 0, selectivity: 0.001       1.03     43.5±0.56ms        ? ?/sec    1.00     42.5±0.09ms        ? ?/sec
filter: mixed_utf8view (max_string_len=20), 8192, nulls: 0, selectivity: 0.01        1.05      4.9±0.22ms        ? ?/sec    1.00      4.7±0.01ms        ? ?/sec
filter: mixed_utf8view (max_string_len=20), 8192, nulls: 0, selectivity: 0.1         1.03      2.4±0.05ms        ? ?/sec    1.00      2.3±0.04ms        ? ?/sec
filter: mixed_utf8view (max_string_len=20), 8192, nulls: 0, selectivity: 0.8         1.00  1466.4±10.31µs        ? ?/sec    1.05   1537.3±9.34µs        ? ?/sec
filter: mixed_utf8view (max_string_len=20), 8192, nulls: 0.1, selectivity: 0.001     1.02     53.3±0.16ms        ? ?/sec    1.00     52.1±0.13ms        ? ?/sec
filter: mixed_utf8view (max_string_len=20), 8192, nulls: 0.1, selectivity: 0.01      1.01      7.2±0.03ms        ? ?/sec    1.00      7.1±0.02ms        ? ?/sec
filter: mixed_utf8view (max_string_len=20), 8192, nulls: 0.1, selectivity: 0.1       1.00      3.7±0.03ms        ? ?/sec    1.06      3.9±0.07ms        ? ?/sec
filter: mixed_utf8view (max_string_len=20), 8192, nulls: 0.1, selectivity: 0.8       1.01      3.9±0.02ms        ? ?/sec    1.00      3.9±0.01ms        ? ?/sec
filter: primitive, 8192, nulls: 0, selectivity: 0.001                                1.00     54.1±1.62ms        ? ?/sec    1.80     97.2±0.21ms        ? ?/sec
filter: primitive, 8192, nulls: 0, selectivity: 0.01                                 1.00      5.9±0.03ms        ? ?/sec    1.57      9.3±0.02ms        ? ?/sec
filter: primitive, 8192, nulls: 0, selectivity: 0.1                                  1.00      3.2±0.09ms        ? ?/sec    1.17      3.7±0.05ms        ? ?/sec
filter: primitive, 8192, nulls: 0, selectivity: 0.8                                  1.00      2.7±0.01ms        ? ?/sec    1.14      3.1±0.02ms        ? ?/sec
filter: primitive, 8192, nulls: 0.1, selectivity: 0.001                              1.00     60.9±0.09ms        ? ?/sec    2.06    125.4±0.26ms        ? ?/sec
filter: primitive, 8192, nulls: 0.1, selectivity: 0.01                               1.00     10.9±0.04ms        ? ?/sec    1.38     15.1±0.06ms        ? ?/sec
filter: primitive, 8192, nulls: 0.1, selectivity: 0.1                                1.16      8.6±0.20ms        ? ?/sec    1.00      7.4±0.36ms        ? ?/sec
filter: primitive, 8192, nulls: 0.1, selectivity: 0.8                                1.62     14.7±0.04ms        ? ?/sec    1.00      9.1±0.04ms        ? ?/sec
filter: single_utf8view, 8192, nulls: 0, selectivity: 0.001                          1.04     68.2±0.48ms        ? ?/sec    1.00     65.7±1.26ms        ? ?/sec
filter: single_utf8view, 8192, nulls: 0, selectivity: 0.01                           1.09      7.9±0.04ms        ? ?/sec    1.00      7.3±0.02ms        ? ?/sec
filter: single_utf8view, 8192, nulls: 0, selectivity: 0.1                            1.00      3.6±0.17ms        ? ?/sec    1.08      3.9±0.21ms        ? ?/sec
filter: single_utf8view, 8192, nulls: 0, selectivity: 0.8                            1.00   1400.0±6.27µs        ? ?/sec    1.02   1421.7±6.47µs        ? ?/sec
filter: single_utf8view, 8192, nulls: 0.1, selectivity: 0.001                        1.10     92.0±0.23ms        ? ?/sec    1.00     83.6±0.13ms        ? ?/sec
filter: single_utf8view, 8192, nulls: 0.1, selectivity: 0.01                         1.06     11.6±0.05ms        ? ?/sec    1.00     11.0±0.05ms        ? ?/sec
filter: single_utf8view, 8192, nulls: 0.1, selectivity: 0.1                          1.00      5.1±0.08ms        ? ?/sec    1.10      5.7±0.33ms        ? ?/sec
filter: single_utf8view, 8192, nulls: 0.1, selectivity: 0.8                          1.00      3.8±0.01ms        ? ?/sec    1.01      3.8±0.01ms        ? ?/sec

Dandandan · 2025-12-05T15:07:27Z

filter: primitive, 8192, nulls: 0, selectivity: 0.001                                1.00     54.1±1.62ms        ? ?/sec    1.80     97.2±0.21ms        ? ?/sec
filter: primitive, 8192, nulls: 0, selectivity: 0.01                                 1.00      5.9±0.03ms        ? ?/sec    1.57      9.3±0.02ms        ? ?/sec
filter: primitive, 8192, nulls: 0, selectivity: 0.1                                  1.00      3.2±0.09ms        ? ?/sec    1.17      3.7±0.05ms        ? ?/sec
filter: primitive, 8192, nulls: 0, selectivity: 0.8                                  1.00      2.7±0.01ms        ? ?/sec    1.14      3.1±0.02ms        ? ?/sec
filter: primitive, 8192, nulls: 0.1, selectivity: 0.001                              1.00     60.9±0.09ms        ? ?/sec    2.06    125.4±0.26ms        ? ?/sec
filter: primitive, 8192, nulls: 0.1, selectivity: 0.01                               1.00     10.9±0.04ms        ? ?/sec    1.38     15.1±0.06ms        ? ?/sec
filter: primitive, 8192, nulls: 0.1, selectivity: 0.1                                1.16      8.6±0.20ms        ? ?/sec    1.00      7.4±0.36ms        ? ?/sec
filter: primitive, 8192, nulls: 0.1, selectivity: 0.8                                1.62     14.7±0.04ms        ? ?/sec    1.00      9.1±0.04ms        ? ?/sec

Pretty good I would say... probably have to look a bit more at the null-handling speed

alamb · 2025-12-05T15:18:30Z

Pretty good I would say... probably have to look a bit more at the null-handling speed

I feel there is a bunch of null handling performance to be had via work in

Improvements to BooleanBufferBuilder / BooleanBuilder #8561

I'll try and review this PR more carefully later today

Dandandan · 2025-12-06T08:27:29Z

run benchmark coalesce_kernels filter_kernels

Dandandan · 2025-12-06T08:30:54Z

@alamb seems we can play with the filter threshold value, probably a value with >=0.9 will give nice speedups, we might even go further based on some benchmark results

Dandandan · 2025-12-06T08:31:06Z

run benchmark coalesce_kernels

Dandandan · 2025-12-06T08:48:19Z

It is now faster in all cases on my machine 🚀

Make filtered coalescing faster for primitive / byte types

6ecd42b

Make filtered coalescing faster for primitive

github-actions bot added the arrow Changes to the arrow crate label Dec 4, 2025

Make filtered coalescing faster for primitive types

a8df36f

Dandandan changed the title ~~Make filtered coalescing faster for primitive types~~ Make push_batch_with_filter faster for primitive types Dec 4, 2025

Dandandan added 3 commits December 4, 2025 16:41

Faster api

f20702b

Faster api

124b4e3

Faster api

79bd847

Dandandan changed the title ~~Make push_batch_with_filter faster for primitive types~~ Make push_batch_with_filter faster for primitive types: up to 10x faster Dec 4, 2025

Dandandan changed the title ~~Make push_batch_with_filter faster for primitive types: up to 10x faster~~ Make push_batch_with_filter up to 10x faster for primitive types Dec 4, 2025

Faster api

b2fc66f

Cleanup

0872a9b

Dandandan commented Dec 4, 2025

View reviewed changes

Dandandan marked this pull request as draft December 4, 2025 17:08

Fix?

b7b3f18

Dandandan changed the title ~~Make push_batch_with_filter up to 10x faster for primitive types~~ Make push_batch_with_filter up to 2x faster for primitive types Dec 4, 2025

optimize

7758889

Dandandan changed the title ~~Make push_batch_with_filter up to 2x faster for primitive types~~ Make push_batch_with_filter up to 3x faster for primitive types Dec 4, 2025

Dandandan mentioned this pull request Dec 4, 2025

Add coalesce_kernels to allowed list alamb/datafusion-benchmarking#2

Closed

perf

dcf4864

alamb mentioned this pull request Dec 5, 2025

Andrew Lamb Weekly-ish Open Source plan - 2025-12-01 apache/datafusion#19016

Open

42 tasks

Dandandan added 2 commits December 5, 2025 16:34

comment

87626d1

Increase filter threshold

7c46a72

Dandandan added 3 commits December 6, 2025 09:36

Adapt comment

6ee1f04

More speed

c39a455

Fmt

dc0c45e

Don't collect

d2b5d29

Dandandan mentioned this pull request Dec 6, 2025

Support BatchCoalescer::push_batch_with_indices #8957

Open

Make push_batch_with_filter up to 3x faster for primitive types #8951

Are you sure you want to change the base?

Make push_batch_with_filter up to 3x faster for primitive types #8951

Conversation

Dandandan commented Dec 4, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Which issue does this PR close?

Rationale for this change

What changes are included in this PR?

Are these changes tested?

Are there any user-facing changes?

Uh oh!

Dandandan commented Dec 4, 2025

Uh oh!

alamb commented Dec 4, 2025

Uh oh!

Dandandan Dec 4, 2025

Choose a reason for hiding this comment

Uh oh!

alamb commented Dec 4, 2025

Uh oh!

alamb commented Dec 4, 2025

Uh oh!

alamb-ghbot commented Dec 4, 2025

Uh oh!

Dandandan commented Dec 4, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

alamb-ghbot commented Dec 4, 2025

Uh oh!

alamb-ghbot commented Dec 4, 2025

Uh oh!

Dandandan commented Dec 4, 2025

Uh oh!

alamb-ghbot commented Dec 4, 2025

Uh oh!

Dandandan commented Dec 4, 2025

Uh oh!

alamb commented Dec 5, 2025

Uh oh!

alamb commented Dec 5, 2025

Uh oh!

alamb-ghbot commented Dec 5, 2025

Uh oh!

alamb-ghbot commented Dec 5, 2025

Uh oh!

Dandandan commented Dec 5, 2025

Uh oh!

alamb commented Dec 5, 2025

Uh oh!

Dandandan commented Dec 6, 2025

Uh oh!

Dandandan commented Dec 6, 2025

Uh oh!

Dandandan commented Dec 6, 2025

Uh oh!

Dandandan commented Dec 6, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

Make `push_batch_with_filter` up to 3x faster for primitive types #8951

Make `push_batch_with_filter` up to 3x faster for primitive types #8951

Dandandan commented Dec 4, 2025 •

edited

Loading

Dandandan commented Dec 4, 2025 •

edited

Loading