-
Notifications
You must be signed in to change notification settings - Fork 1.5k
Set aggregation hash seed #16165
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Set aggregation hash seed #16165
Conversation
🤖 |
@@ -57,6 +57,10 @@ mod row_hash; | |||
mod topk; | |||
mod topk_stream; | |||
|
|||
/// Hard-coded seed for aggregations to ensure hash values differ from `RepartitionExec`, avoiding collisions. | |||
const AGGREGATION_HASH_SEED: ahash::RandomState = |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
👍
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks @ctsk -- I started some benchmark runs just to be sure, but I think this looks good to me
🤖: Benchmark completed Details
|
I am surprised this shows any performance difference. I will rerun and see if I can reproduce |
🤖 |
🤖: Benchmark completed Details
|
🤖 |
🤖: Benchmark completed Details
|
Second performance run looks as good / better so let's merge this in! |
Thanks again @ctsk |
This PR hard-codes the seed for the hash aggregation. The main benefit compared to the previously runtime-determined seed is that after applying this PR, partial aggregation and final aggregation will share the same hash function.
I haven't measured it, but in theory, this should make the final aggregation step more efficient, because the partial aggregation will emit the group values in a way that will be clustered in the final aggregation hash table - thus causing a benefitial memory access pattern when building the final aggregation.
I expect it speeds up large-cardinality aggregations that don't trigger the skipping of the partial aggregation step a tiny bit.