Skip to content

Conversation

alamb
Copy link
Contributor

@alamb alamb commented May 23, 2025

Which issue does this PR close?

Related to #7458

Rationale for this change

It has been observed that and_then is slow. Let's try and optimize it by avoiding an extra copy

What changes are included in this PR?

  1. Refactor code into explicit iterators so we can build the final output RowSelection directly from filters, rather than instantiating the intermediate selectors first.

Are there any user-facing changes?

No (other than hopefully faster performance)

@github-actions github-actions bot added the parquet Changes to the parquet crate label May 23, 2025
self.selection = match self.selection.take() {
Some(selection) => Some(selection.and_then(&raw)),
None => Some(raw),
Some(selection) => Some(selection.apply_filters(&filters)),
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The whole point of this PR is to call apply_filters here and avoid creating raw when there is an existing filter.

I will run some benchmarks and see

@alamb
Copy link
Contributor Author

alamb commented May 23, 2025

🤖 ./gh_compare_arrow.sh Benchmark Script Running
Linux aal-dev 6.11.0-1013-gcp #13~24.04.1-Ubuntu SMP Wed Apr 2 16:34:16 UTC 2025 x86_64 x86_64 x86_64 GNU/Linux
Comparing alamb/less_allocations (0d1074f) to e9df239 diff
BENCH_NAME=arrow_reader_clickbench
BENCH_COMMAND=cargo bench --features=arrow,async,test_common,experimental --bench arrow_reader_clickbench
BENCH_FILTER=
BENCH_BRANCH_NAME=alamb_less_allocations
Results will be posted here when complete

@alamb
Copy link
Contributor Author

alamb commented May 23, 2025

🤖 ./gh_compare_arrow.sh Benchmark Script Running
Linux aal-dev 6.11.0-1013-gcp #13~24.04.1-Ubuntu SMP Wed Apr 2 16:34:16 UTC 2025 x86_64 x86_64 x86_64 GNU/Linux
Comparing alamb/less_allocations (0d1074f) to e9df239 diff
BENCH_NAME=row_selector
BENCH_COMMAND=cargo bench --features=arrow,async,test_common,experimental --bench row_selector
BENCH_FILTER=
BENCH_BRANCH_NAME=alamb_less_allocations
Results will be posted here when complete

@alamb
Copy link
Contributor Author

alamb commented May 23, 2025

🤖: Benchmark completed

Details

group           alamb_less_allocations                 main
-----           ----------------------                 ----
and_then        1.00    963.4±3.25µs        ? ?/sec    1.11   1069.7±4.44µs        ? ?/sec
from_filters    1.48   910.3±30.01µs        ? ?/sec    1.00    613.5±9.90µs        ? ?/sec
intersection    1.05      2.0±0.00ms        ? ?/sec    1.00   1912.0±4.44µs        ? ?/sec
union           1.08      2.2±0.01ms        ? ?/sec    1.00      2.0±0.00ms        ? ?/sec

@alamb
Copy link
Contributor Author

alamb commented May 23, 2025

Hmm, the benchmark fails like this:

thread 'main' panicked at parquet/benches/arrow_reader_clickbench.rs:826:9:
assertion `left == right` failed: Expected 3312 rows, but got 98 in Q1
  left: 98
 right: 3312
note: run with `RUST_BACKTRACE=1` environment variable to display a backtrace

error: bench failed, to rerun pass `-p parquet --bench arrow_reader_clickbench`

So there is some bug I need to address

@alamb alamb closed this Jun 24, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
parquet Changes to the parquet crate
Projects
None yet
Development

Successfully merging this pull request may close these issues.

1 participant