Speedup statistics_from_parquet_metadata by Dandandan · Pull Request #20004 · apache/datafusion

Dandandan · 2026-01-26T09:53:54Z

Which issue does this PR close?

Closes Speedup statistics_from_parquet_metadata (DataFusion side) #20005

PR:

SELECT COUNT(*) FROM hits WHERE "AdvEngineID" <> 0;

Query 1 iteration 0 took 30.5 ms and returned 1 rows
Query 1 avg time: 30.48 ms

Main:

SELECT COUNT(*) FROM hits WHERE "AdvEngineID" <> 0;

Query 1 iteration 0 took 39.6 ms and returned 1 rows
Query 1 avg time: 39.61 ms

Rationale for this change

Improving cold starts.

What changes are included in this PR?

Are these changes tested?

Existing tests

Are there any user-facing changes?

No

alamb

Thanks @Dandandan -- this looks good to me

It also makes me wonder if there is more performance to be had in our statistics management -- for example, I notice that there are multiple call sites that convert the max values (when the table is first created and then during query to try pruning)

https://github.com/search?q=repo%3Aapache%2Fdatafusion%20row_group_maxes&type=code

alamb · 2026-01-26T11:57:52Z

datafusion/datasource-parquet/src/metadata.rs

+
    let scalar_array = value.to_scalar().ok()?;
    let eq_mask = eq(&scalar_array, &array).ok()?;
    let combined_mask = and(&eq_mask, exactness).ok()?;


As an aside, there is the type if pattern where I would really like an API that can reuse the allocation

[Arrow] Add bitwise operations BooleanArray that potentially reuse the underlying allocation arrow-rs#8809

This could easily reuse the allocation of eq_mask

(not tis PR)

alamb · 2026-01-26T12:03:31Z

datafusion/datasource-parquet/src/metadata.rs

+            && exactness.null_count() == 0
+            && exactness.true_count() == exactness.len()
+        {
+            accumulators.is_max_value_exact[logical_schema_index] = Some(true);


It does seem like this fast path will be hit often (as it is when all stats are exact, a common case).

datafusion/datasource-parquet/src/metadata.rs

Co-authored-by: Andrew Lamb <andrew@nerdnetworks.org>

Dandandan · 2026-01-26T13:03:55Z

🚀

Dandandan · 2026-01-26T13:39:12Z

Thanks @Dandandan -- this looks good to me

It also makes me wonder if there is more performance to be had in our statistics management -- for example, I notice that there are multiple call sites that convert the max values (when the table is first created and then during query to try pruning)

https://github.com/search?q=repo%3Aapache%2Fdatafusion%20row_group_maxes&type=code

I think there is also a lot of overhead in creating Arrays, accumulators, etc. (columnar datastructures) for essentially only a bunch of values...

Dandandan added 2 commits January 26, 2026 10:45

Speedup statistics_from_parquet_metadata

c10f77d

clippy

fd1d713

Dandandan requested a review from adriangb January 26, 2026 10:30

Dandandan marked this pull request as ready for review January 26, 2026 10:30

Dandandan requested a review from alamb January 26, 2026 11:46

alamb approved these changes Jan 26, 2026

View reviewed changes

alamb added the performance Make DataFusion faster label Jan 26, 2026

Apply suggestions from code review

4d96d85

Co-authored-by: Andrew Lamb <andrew@nerdnetworks.org>

github-actions bot added the datasource Changes to the datasource crate label Jan 26, 2026

Dandandan added this pull request to the merge queue Jan 26, 2026

Merged via the queue into apache:main with commit 50a3e13 Jan 26, 2026
28 checks passed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Speedup statistics_from_parquet_metadata#20004

Speedup statistics_from_parquet_metadata#20004
Dandandan merged 3 commits intoapache:mainfrom
Dandandan:perf_statistics_from_parquet_metadata

Dandandan commented Jan 26, 2026 •

edited

Loading

Uh oh!

alamb left a comment

Uh oh!

alamb Jan 26, 2026

Uh oh!

alamb Jan 26, 2026

Uh oh!

Dandandan Jan 26, 2026

Uh oh!

Uh oh!

Uh oh!

Dandandan commented Jan 26, 2026

Uh oh!

Uh oh!

Dandandan commented Jan 26, 2026 •

edited

Loading

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

Dandandan commented Jan 26, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Which issue does this PR close?

Rationale for this change

What changes are included in this PR?

Are these changes tested?

Are there any user-facing changes?

Uh oh!

alamb left a comment

Choose a reason for hiding this comment

Uh oh!

alamb Jan 26, 2026

Choose a reason for hiding this comment

Uh oh!

alamb Jan 26, 2026

Choose a reason for hiding this comment

Uh oh!

Dandandan Jan 26, 2026

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Dandandan commented Jan 26, 2026

Uh oh!

Uh oh!

Dandandan commented Jan 26, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Dandandan commented Jan 26, 2026 •

edited

Loading

Dandandan commented Jan 26, 2026 •

edited

Loading