perf: reduce time_elapsed_opening overhead#5879
Conversation
|
I think that for many modern setups, 1Mb and 64KB are pretty close in terms of latency, maybe the takeaway here is to expose that as a config for the DataFusion integration? |
Codecov Report❌ Patch coverage is
☔ View full report in Codecov by Sentry. 🚀 New features to boost your workflow:
|
|
Seems like running benchmarks from forks is still painful, I'm working on fixing it so we can get some numbers here. |
|
FYI: 'Rust (semver checks)' was failing due to cargo-semver-checks enabling all features, which pulls in the est-harness feature. That feature exports stest_reuse #[template] #[export]-generated hashed macro_rules! names (e.g. search_sorted_conformance_), and semver-checks flags them as removed/renamed vs the baseline crate. This PR updates the semver job to use eature-group: default-features so the semver surface matches the supported public API and avoids those false positives. Because this touches .github/workflows, GitHub marks the latest CI run as action_required until a maintainer approves running workflows for this fork PR. |
|
we're working on the semver check elsewhere, it's not a required check anyway, more of a general indication |
|
Thanks! I'll turn the initial read size into a config as suggested. Really appreciate you handling the benchmark infra for forks—looking forward to your feedback tomorrow! |
.github/workflows/ci.yml
Outdated
| with: | ||
| # Avoid enabling test-only feature flags (e.g. `test-harness`) that export unstable | ||
| # procedural-macro-generated items and create false-positive semver diffs. | ||
| feature-group: default-features |
There was a problem hiding this comment.
Lets remove this from this PR, we might refactor that whole macro or use an upstream fix
|
You can see the benchmark results in the run summary, seems to make more of a difference for DuckDB but I'll take that, I expect it to also make a difference for DataFusion or at least for that metric, I'll try it out soon. |
|
A small local test I ran showed very nice improvements to |
…egression test Signed-off-by: godnight10061 <godnight10061@users.noreply.github.com>
Signed-off-by: godnight10061 <godnight10061@users.noreply.github.com>
Signed-off-by: godnight10061 <godnight10061@users.noreply.github.com>
Signed-off-by: godnight10061 <godnight10061@users.noreply.github.com>
Signed-off-by: godnight10061 <godnight10061@users.noreply.github.com>
Signed-off-by: godnight10061 <godnight10061@users.noreply.github.com>
1dcf3ae to
4856cf1
Compare
|
There's a noticeable improvement in several of our benchmarks because of this which is nice! Though I did notice regressions in Clickbench Q6 and Q23 that someone should probably investigate? |
This PR tries to reduce `time_elapsed_opening` for Vortex scans (observed vs Parquet). Changes: - `vortex-file`: shrink default footer initial read from 1MiB to `MAX_POSTSCRIPT_SIZE + EOF_SIZE` (~64KiB) and add a regression test. - `vortex-scan`: make `ScanBuilder::into_stream()` lazy (defer `prepare()` / split registration until first poll) and add a unit test to ensure stream construction has no split-planning side effects. - `vortex-datafusion`: expose the footer initial read size as a format option (`footer_initial_read_size_bytes`) and plumb it into `VortexOpenOptions::with_initial_read_size`. Notes: - Scan planning errors now surface on first poll instead of during `into_stream()` construction. - If the footer/schema/layout don’t fit in the initial window, `read_footer` will issue additional reads as before. Tests: - `cargo +nightly fmt --all --check` - `cargo clippy -p vortex-datafusion --all-targets --all-features -- -D warnings` - `cargo test --locked -p vortex-file -p vortex-scan -p vortex-io` - `cargo test --locked -p vortex-datafusion` Related: #4677 --------- Signed-off-by: godnight10061 <godnight10061@users.noreply.github.com> Co-authored-by: godnight10061 <godnight10061@users.noreply.github.com>
This PR tries to reduce
time_elapsed_openingfor Vortex scans (observed vs Parquet).Changes:
vortex-file: shrink default footer initial read from 1MiB toMAX_POSTSCRIPT_SIZE + EOF_SIZE(~64KiB) and add a regression test.vortex-scan: makeScanBuilder::into_stream()lazy (deferprepare()/ split registration until first poll) and add a unit test to ensure stream construction has no split-planning side effects.vortex-datafusion: expose the footer initial read size as a format option (footer_initial_read_size_bytes) and plumb it intoVortexOpenOptions::with_initial_read_size.Notes:
into_stream()construction.read_footerwill issue additional reads as before.Tests:
cargo +nightly fmt --all --checkcargo clippy -p vortex-datafusion --all-targets --all-features -- -D warningscargo test --locked -p vortex-file -p vortex-scan -p vortex-iocargo test --locked -p vortex-datafusionRelated: #4677