perf: reduce time_elapsed_opening overhead by godnight10061 · Pull Request #5879 · vortex-data/vortex

godnight10061 · 2026-01-07T16:58:09Z

This PR tries to reduce time_elapsed_opening for Vortex scans (observed vs Parquet).

Changes:

vortex-file: shrink default footer initial read from 1MiB to MAX_POSTSCRIPT_SIZE + EOF_SIZE (~64KiB) and add a regression test.
vortex-scan: make ScanBuilder::into_stream() lazy (defer prepare() / split registration until first poll) and add a unit test to ensure stream construction has no split-planning side effects.
vortex-datafusion: expose the footer initial read size as a format option (footer_initial_read_size_bytes) and plumb it into VortexOpenOptions::with_initial_read_size.

Notes:

Scan planning errors now surface on first poll instead of during into_stream() construction.
If the footer/schema/layout don’t fit in the initial window, read_footer will issue additional reads as before.

Tests:

cargo +nightly fmt --all --check
cargo clippy -p vortex-datafusion --all-targets --all-features -- -D warnings
cargo test --locked -p vortex-file -p vortex-scan -p vortex-io
cargo test --locked -p vortex-datafusion

Related: #4677

AdamGS · 2026-01-07T17:06:21Z

I think that for many modern setups, 1Mb and 64KB are pretty close in terms of latency, maybe the takeaway here is to expose that as a config for the DataFusion integration?
The deferred scan stream looking very promising, I'll give it a deeper look tomorrow.

codecov · 2026-01-07T17:18:13Z

Codecov Report

❌ Patch coverage is 70.96774% with 45 lines in your changes missing coverage. Please review.
✅ Project coverage is 81.91%. Comparing base (be977c0) to head (57cc0af).
⚠️ Report is 1 commits behind head on develop.

Files with missing lines	Patch %	Lines
vortex-scan/src/scan_builder.rs	42.10%	44 Missing ⚠️
vortex-file/src/open.rs	98.18%	1 Missing ⚠️

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

🚀 New features to boost your workflow:

❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.
📦 JS Bundle Analysis: Save yourself from yourself by tracking and limiting bundle sizes in JS merges.

AdamGS · 2026-01-07T18:52:13Z

Seems like running benchmarks from forks is still painful, I'm working on fixing it so we can get some numbers here.

godnight10061 · 2026-01-08T02:05:45Z

FYI: 'Rust (semver checks)' was failing due to cargo-semver-checks enabling all features, which pulls in the est-harness feature. That feature exports stest_reuse #[template] #[export]-generated hashed macro_rules! names (e.g. search_sorted_conformance_), and semver-checks flags them as removed/renamed vs the baseline crate.

This PR updates the semver job to use eature-group: default-features so the semver surface matches the supported public API and avoids those false positives.

Because this touches .github/workflows, GitHub marks the latest CI run as action_required until a maintainer approves running workflows for this fork PR.

AdamGS · 2026-01-08T09:21:01Z

we're working on the semver check elsewhere, it's not a required check anyway, more of a general indication

godnight10061 · 2026-01-08T09:55:20Z

Thanks! I'll turn the initial read size into a config as suggested. Really appreciate you handling the benchmark infra for forks—looking forward to your feedback tomorrow!

AdamGS · 2026-01-08T11:01:27Z

.github/workflows/ci.yml

+        with:
+          # Avoid enabling test-only feature flags (e.g. `test-harness`) that export unstable
+          # procedural-macro-generated items and create false-positive semver diffs.
+          feature-group: default-features


Lets remove this from this PR, we might refactor that whole macro or use an upstream fix

AdamGS · 2026-01-08T15:13:02Z

You can see the benchmark results in the run summary, seems to make more of a difference for DuckDB but I'll take that, I expect it to also make a difference for DataFusion or at least for that metric, I'll try it out soon.

AdamGS · 2026-01-08T15:27:23Z

A small local test I ran showed very nice improvements to time_elapsed_opening which is cool.

…egression test Signed-off-by: godnight10061 <godnight10061@users.noreply.github.com>

Signed-off-by: godnight10061 <godnight10061@users.noreply.github.com>

AdamGS

LGTM!

connortsui20 · 2026-01-08T22:42:49Z

There's a noticeable improvement in several of our benchmarks because of this which is nice! Though I did notice regressions in Clickbench Q6 and Q23 that someone should probably investigate?

This PR tries to reduce `time_elapsed_opening` for Vortex scans (observed vs Parquet). Changes: - `vortex-file`: shrink default footer initial read from 1MiB to `MAX_POSTSCRIPT_SIZE + EOF_SIZE` (~64KiB) and add a regression test. - `vortex-scan`: make `ScanBuilder::into_stream()` lazy (defer `prepare()` / split registration until first poll) and add a unit test to ensure stream construction has no split-planning side effects. - `vortex-datafusion`: expose the footer initial read size as a format option (`footer_initial_read_size_bytes`) and plumb it into `VortexOpenOptions::with_initial_read_size`. Notes: - Scan planning errors now surface on first poll instead of during `into_stream()` construction. - If the footer/schema/layout don’t fit in the initial window, `read_footer` will issue additional reads as before. Tests: - `cargo +nightly fmt --all --check` - `cargo clippy -p vortex-datafusion --all-targets --all-features -- -D warnings` - `cargo test --locked -p vortex-file -p vortex-scan -p vortex-io` - `cargo test --locked -p vortex-datafusion` Related: #4677 --------- Signed-off-by: godnight10061 <godnight10061@users.noreply.github.com> Co-authored-by: godnight10061 <godnight10061@users.noreply.github.com>

AdamGS assigned AdamGS and unassigned AdamGS Jan 7, 2026

AdamGS self-requested a review January 7, 2026 17:06

joseph-isaacs added the action/benchmark Trigger full benchmarks to run on this PR label Jan 7, 2026

AdamGS added changelog/performance A performance improvement action/benchmark-sql Trigger SQL benchmarks to run on this PR and removed action/benchmark Trigger full benchmarks to run on this PR labels Jan 7, 2026

joseph-isaacs added action/benchmark-sql Trigger SQL benchmarks to run on this PR and removed action/benchmark-sql Trigger SQL benchmarks to run on this PR labels Jan 8, 2026

AdamGS reviewed Jan 8, 2026

View reviewed changes

AdamGS added action/benchmark-sql Trigger SQL benchmarks to run on this PR and removed action/benchmark-sql Trigger SQL benchmarks to run on this PR labels Jan 8, 2026

godnight10061 added 6 commits January 9, 2026 00:09

Reduce default INITIAL_READ_SIZE to MAX_POSTSCRIPT_SIZE+EOF and add r…

0bf2dea

…egression test Signed-off-by: godnight10061 <godnight10061@users.noreply.github.com>

Make ScanBuilder::into_stream lazy

0376b5c

Signed-off-by: godnight10061 <godnight10061@users.noreply.github.com>

Loosen footer read regression test

db9958d

Signed-off-by: godnight10061 <godnight10061@users.noreply.github.com>

Fix lint failures in lazy scan and footer read test

1b09efc

Signed-off-by: godnight10061 <godnight10061@users.noreply.github.com>

datafusion: make footer initial read size configurable

b662b20

Signed-off-by: godnight10061 <godnight10061@users.noreply.github.com>

datafusion: fix clippy dead_code in tests

4856cf1

Signed-off-by: godnight10061 <godnight10061@users.noreply.github.com>

godnight10061 force-pushed the fix/4677-initial-read-size branch from 1dcf3ae to 4856cf1 Compare January 8, 2026 16:13

AdamGS approved these changes Jan 8, 2026

View reviewed changes

Merge branch 'develop' into fix/4677-initial-read-size

57cc0af

AdamGS enabled auto-merge (squash) January 8, 2026 17:12

AdamGS merged commit f9c3c20 into vortex-data:develop Jan 8, 2026
47 of 48 checks passed

connortsui20 mentioned this pull request Jan 8, 2026

Clickbench Q6 and Q23 regressions #5899

Closed

godnight10061 deleted the fix/4677-initial-read-size branch January 16, 2026 13:27

AdamGS mentioned this pull request Jan 21, 2026

time_elapsed_opening spends too much time #4677

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

perf: reduce time_elapsed_opening overhead#5879

perf: reduce time_elapsed_opening overhead#5879
AdamGS merged 7 commits intovortex-data:developfrom
godnight10061:fix/4677-initial-read-size

godnight10061 commented Jan 7, 2026 •

edited

Loading

Uh oh!

AdamGS commented Jan 7, 2026

Uh oh!

codecov bot commented Jan 7, 2026 •

edited

Loading

Uh oh!

AdamGS commented Jan 7, 2026

Uh oh!

godnight10061 commented Jan 8, 2026 •

edited

Loading

Uh oh!

AdamGS commented Jan 8, 2026

Uh oh!

godnight10061 commented Jan 8, 2026

Uh oh!

AdamGS Jan 8, 2026 •

edited

Loading

Uh oh!

AdamGS commented Jan 8, 2026

Uh oh!

AdamGS commented Jan 8, 2026

Uh oh!

AdamGS left a comment

Uh oh!

Uh oh!

connortsui20 commented Jan 8, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

Conversation

godnight10061 commented Jan 7, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

AdamGS commented Jan 7, 2026

Uh oh!

codecov bot commented Jan 7, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Codecov Report

Uh oh!

AdamGS commented Jan 7, 2026

Uh oh!

godnight10061 commented Jan 8, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

AdamGS commented Jan 8, 2026

Uh oh!

godnight10061 commented Jan 8, 2026

Uh oh!

AdamGS Jan 8, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

AdamGS commented Jan 8, 2026

Uh oh!

AdamGS commented Jan 8, 2026

Uh oh!

AdamGS left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

connortsui20 commented Jan 8, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

godnight10061 commented Jan 7, 2026 •

edited

Loading

codecov bot commented Jan 7, 2026 •

edited

Loading

godnight10061 commented Jan 8, 2026 •

edited

Loading

AdamGS Jan 8, 2026 •

edited

Loading