Skip to content

Speedup statistics_from_parquet_metadata#20004

Merged
Dandandan merged 3 commits intoapache:mainfrom
Dandandan:perf_statistics_from_parquet_metadata
Jan 26, 2026
Merged

Speedup statistics_from_parquet_metadata#20004
Dandandan merged 3 commits intoapache:mainfrom
Dandandan:perf_statistics_from_parquet_metadata

Conversation

@Dandandan
Copy link
Contributor

@Dandandan Dandandan commented Jan 26, 2026

Which issue does this PR close?

PR:

SELECT COUNT(*) FROM hits WHERE "AdvEngineID" <> 0;

Query 1 iteration 0 took 30.5 ms and returned 1 rows
Query 1 avg time: 30.48 ms

Main:

SELECT COUNT(*) FROM hits WHERE "AdvEngineID" <> 0;

Query 1 iteration 0 took 39.6 ms and returned 1 rows
Query 1 avg time: 39.61 ms

Rationale for this change

Improving cold starts.

What changes are included in this PR?

Are these changes tested?

Existing tests

Are there any user-facing changes?

No

@Dandandan Dandandan requested a review from adriangb January 26, 2026 10:30
@Dandandan Dandandan marked this pull request as ready for review January 26, 2026 10:30
@Dandandan Dandandan requested a review from alamb January 26, 2026 11:46
Copy link
Contributor

@alamb alamb left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks @Dandandan -- this looks good to me

It also makes me wonder if there is more performance to be had in our statistics management -- for example, I notice that there are multiple call sites that convert the max values (when the table is first created and then during query to try pruning)

https://github.com/search?q=repo%3Aapache%2Fdatafusion%20row_group_maxes&type=code


let scalar_array = value.to_scalar().ok()?;
let eq_mask = eq(&scalar_array, &array).ok()?;
let combined_mask = and(&eq_mask, exactness).ok()?;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

As an aside, there is the type if pattern where I would really like an API that can reuse the allocation

This could easily reuse the allocation of eq_mask

(not tis PR)

&& exactness.null_count() == 0
&& exactness.true_count() == exactness.len()
{
accumulators.is_max_value_exact[logical_schema_index] = Some(true);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It does seem like this fast path will be hit often (as it is when all stats are exact, a common case).

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes!

@alamb alamb added the performance Make DataFusion faster label Jan 26, 2026
Co-authored-by: Andrew Lamb <andrew@nerdnetworks.org>
@github-actions github-actions bot added the datasource Changes to the datasource crate label Jan 26, 2026
@Dandandan Dandandan added this pull request to the merge queue Jan 26, 2026
@Dandandan
Copy link
Contributor Author

🚀

Merged via the queue into apache:main with commit 50a3e13 Jan 26, 2026
28 checks passed
@Dandandan
Copy link
Contributor Author

Dandandan commented Jan 26, 2026

Thanks @Dandandan -- this looks good to me

It also makes me wonder if there is more performance to be had in our statistics management -- for example, I notice that there are multiple call sites that convert the max values (when the table is first created and then during query to try pruning)

https://github.com/search?q=repo%3Aapache%2Fdatafusion%20row_group_maxes&type=code

I think there is also a lot of overhead in creating Arrays, accumulators, etc. (columnar datastructures) for essentially only a bunch of values...

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

datasource Changes to the datasource crate performance Make DataFusion faster

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Speedup statistics_from_parquet_metadata (DataFusion side)

2 participants