WIP: Avoid allocations when applying filters to RowSelection #7541

alamb · 2025-05-23T17:20:29Z

Which issue does this PR close?

Related to #7458

Rationale for this change

It has been observed that and_then is slow. Let's try and optimize it by avoiding an extra copy

What changes are included in this PR?

Refactor code into explicit iterators so we can build the final output RowSelection directly from filters, rather than instantiating the intermediate selectors first.

Are there any user-facing changes?

No (other than hopefully faster performance)

alamb · 2025-05-23T17:21:07Z

parquet/src/arrow/arrow_reader/read_plan.rs

        self.selection = match self.selection.take() {
-            Some(selection) => Some(selection.and_then(&raw)),
-            None => Some(raw),
+            Some(selection) => Some(selection.apply_filters(&filters)),


The whole point of this PR is to call apply_filters here and avoid creating raw when there is an existing filter.

I will run some benchmarks and see

alamb · 2025-05-23T17:24:12Z

🤖 ./gh_compare_arrow.sh Benchmark Script Running
Linux aal-dev 6.11.0-1013-gcp #13~24.04.1-Ubuntu SMP Wed Apr 2 16:34:16 UTC 2025 x86_64 x86_64 x86_64 GNU/Linux
Comparing alamb/less_allocations (0d1074f) to e9df239 diff
BENCH_NAME=arrow_reader_clickbench
BENCH_COMMAND=cargo bench --features=arrow,async,test_common,experimental --bench arrow_reader_clickbench
BENCH_FILTER=
BENCH_BRANCH_NAME=alamb_less_allocations
Results will be posted here when complete

alamb · 2025-05-23T17:24:42Z

🤖 ./gh_compare_arrow.sh Benchmark Script Running
Linux aal-dev 6.11.0-1013-gcp #13~24.04.1-Ubuntu SMP Wed Apr 2 16:34:16 UTC 2025 x86_64 x86_64 x86_64 GNU/Linux
Comparing alamb/less_allocations (0d1074f) to e9df239 diff
BENCH_NAME=row_selector
BENCH_COMMAND=cargo bench --features=arrow,async,test_common,experimental --bench row_selector
BENCH_FILTER=
BENCH_BRANCH_NAME=alamb_less_allocations
Results will be posted here when complete

alamb · 2025-05-23T17:26:33Z

🤖: Benchmark completed

Details

group           alamb_less_allocations                 main
-----           ----------------------                 ----
and_then        1.00    963.4±3.25µs        ? ?/sec    1.11   1069.7±4.44µs        ? ?/sec
from_filters    1.48   910.3±30.01µs        ? ?/sec    1.00    613.5±9.90µs        ? ?/sec
intersection    1.05      2.0±0.00ms        ? ?/sec    1.00   1912.0±4.44µs        ? ?/sec
union           1.08      2.2±0.01ms        ? ?/sec    1.00      2.0±0.00ms        ? ?/sec

alamb · 2025-05-23T17:29:35Z

Hmm, the benchmark fails like this:

thread 'main' panicked at parquet/benches/arrow_reader_clickbench.rs:826:9:
assertion `left == right` failed: Expected 3312 rows, but got 98 in Q1
  left: 98
 right: 3312
note: run with `RUST_BACKTRACE=1` environment variable to display a backtrace

error: bench failed, to rerun pass `-p parquet --bench arrow_reader_clickbench`

So there is some bug I need to address

Avoid allocations when applying filters to RowSelection

0d1074f

github-actions bot added the parquet Changes to the parquet crate label May 23, 2025

alamb commented May 23, 2025

View reviewed changes

fix docs

ff78e8c

alamb closed this Jun 24, 2025

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

WIP: Avoid allocations when applying filters to RowSelection #7541

WIP: Avoid allocations when applying filters to RowSelection #7541

Uh oh!

alamb commented May 23, 2025

Uh oh!

alamb May 23, 2025

Uh oh!

alamb commented May 23, 2025

Uh oh!

alamb commented May 23, 2025

Uh oh!

alamb commented May 23, 2025

Uh oh!

alamb commented May 23, 2025

Uh oh!

Uh oh!

WIP: Avoid allocations when applying filters to RowSelection #7541

WIP: Avoid allocations when applying filters to RowSelection #7541

Uh oh!

Conversation

alamb commented May 23, 2025

Which issue does this PR close?

Rationale for this change

What changes are included in this PR?

Are there any user-facing changes?

Uh oh!

alamb May 23, 2025

Choose a reason for hiding this comment

Uh oh!

alamb commented May 23, 2025

Uh oh!

alamb commented May 23, 2025

Uh oh!

alamb commented May 23, 2025

Uh oh!

alamb commented May 23, 2025

Uh oh!

Uh oh!