Speedup statistics_from_parquet_metadata#20004
Conversation
alamb
left a comment
There was a problem hiding this comment.
Thanks @Dandandan -- this looks good to me
It also makes me wonder if there is more performance to be had in our statistics management -- for example, I notice that there are multiple call sites that convert the max values (when the table is first created and then during query to try pruning)
https://github.com/search?q=repo%3Aapache%2Fdatafusion%20row_group_maxes&type=code
|
|
||
| let scalar_array = value.to_scalar().ok()?; | ||
| let eq_mask = eq(&scalar_array, &array).ok()?; | ||
| let combined_mask = and(&eq_mask, exactness).ok()?; |
There was a problem hiding this comment.
As an aside, there is the type if pattern where I would really like an API that can reuse the allocation
This could easily reuse the allocation of eq_mask
(not tis PR)
| && exactness.null_count() == 0 | ||
| && exactness.true_count() == exactness.len() | ||
| { | ||
| accumulators.is_max_value_exact[logical_schema_index] = Some(true); |
There was a problem hiding this comment.
It does seem like this fast path will be hit often (as it is when all stats are exact, a common case).
Co-authored-by: Andrew Lamb <andrew@nerdnetworks.org>
|
🚀 |
I think there is also a lot of overhead in creating |
Which issue does this PR close?
PR:
Main:
Rationale for this change
Improving cold starts.
What changes are included in this PR?
Are these changes tested?
Existing tests
Are there any user-facing changes?
No