Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Reuse hash #11708

Closed
wants to merge 24 commits into from
Closed

Reuse hash #11708

wants to merge 24 commits into from

Conversation

jayzhan211
Copy link
Contributor

@jayzhan211 jayzhan211 commented Jul 29, 2024

Which issue does this PR close?

Closes #11680

Rationale for this change

What changes are included in this PR?

Are these changes tested?

Are there any user-facing changes?

Signed-off-by: jayzhan211 <jayzhan211@gmail.com>
Signed-off-by: jayzhan211 <jayzhan211@gmail.com>
Signed-off-by: jayzhan211 <jayzhan211@gmail.com>
Signed-off-by: jayzhan211 <jayzhan211@gmail.com>
Signed-off-by: jayzhan211 <jayzhan211@gmail.com>
Signed-off-by: jayzhan211 <jayzhan211@gmail.com>
Signed-off-by: jayzhan211 <jayzhan211@gmail.com>
Signed-off-by: jayzhan211 <jayzhan211@gmail.com>
@github-actions github-actions bot added the sqllogictest SQL Logic Tests (.slt) label Jul 29, 2024
SELECT "SearchPhrase", COUNT(*) AS c FROM hits WHERE "SearchPhrase" <> '' GROUP BY "SearchPhrase" ORDER BY c DESC LIMIT 10;
SELECT "SearchPhrase", COUNT(DISTINCT "UserID") AS u FROM hits WHERE "SearchPhrase" <> '' GROUP BY "SearchPhrase" ORDER BY u DESC LIMIT 10;
SELECT "SearchEngineID", "SearchPhrase", COUNT(*) AS c FROM hits WHERE "SearchPhrase" <> '' GROUP BY "SearchEngineID", "SearchPhrase" ORDER BY c DESC LIMIT 10;
SELECT "UserID", COUNT(*) FROM hits GROUP BY "UserID" ORDER BY COUNT(*) DESC LIMIT 10;
SELECT "UserID", "SearchPhrase", COUNT(*) FROM hits GROUP BY "UserID", "SearchPhrase" ORDER BY COUNT(*) DESC LIMIT 10;
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

no change compared to main

@jayzhan211

This comment was marked as outdated.

Signed-off-by: jayzhan211 <jayzhan211@gmail.com>
Signed-off-by: jayzhan211 <jayzhan211@gmail.com>
Signed-off-by: jayzhan211 <jayzhan211@gmail.com>
@jayzhan211 jayzhan211 changed the title [Draft; No change for trivial case, Worse for extreme case] Reuse hash [Draft; Benchmark shows No change] Reuse hash Jul 30, 2024
@jayzhan211
Copy link
Contributor Author

jayzhan211 commented Jul 30, 2024

Screenshot 2024-07-30 at 5 54 04 PM

Cost ranked in GroupValuesRows
RowConverter::append
Rows::push
create_hashes

Signed-off-by: jayzhan211 <jayzhan211@gmail.com>
@github-actions github-actions bot added the core Core DataFusion crate label Jul 30, 2024
Signed-off-by: jayzhan211 <jayzhan211@gmail.com>
@jayzhan211 jayzhan211 changed the title [Draft; Benchmark shows No change] Reuse hash Reuse hash Jul 31, 2024
Signed-off-by: jayzhan211 <jayzhan211@gmail.com>
hashes_buffer: Default::default(),
rows_buffer,
random_state: Default::default(),
random_state: RandomState::with_seeds(0, 0, 0, 0),
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

With fixed seed to ensure the hash value is the same for the same group values

Signed-off-by: jayzhan211 <jayzhan211@gmail.com>
Signed-off-by: jayzhan211 <jayzhan211@gmail.com>
for (index, hash) in hash_buffer.iter().enumerate() {
indices[(*hash % *partitions as u64) as usize]
let hash_values = batch
.column_by_name("hash_value")
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nice

Signed-off-by: jayzhan211 <jayzhan211@gmail.com>
Signed-off-by: jayzhan211 <jayzhan211@gmail.com>
Signed-off-by: jayzhan211 <jayzhan211@gmail.com>
Signed-off-by: jayzhan211 <jayzhan211@gmail.com>
Signed-off-by: jayzhan211 <jayzhan211@gmail.com>
Signed-off-by: jayzhan211 <jayzhan211@gmail.com>
Signed-off-by: jayzhan211 <jayzhan211@gmail.com>
@github-actions github-actions bot removed the sqllogictest SQL Logic Tests (.slt) label Aug 1, 2024
@Dandandan
Copy link
Contributor

I wonder if we can make "reusing hash" an optional feature?
In some cases reuse means materializing the column (distributed), so sometimes might be better to just recompute the hash.

@jayzhan211
Copy link
Contributor Author

I wonder if we can make "reusing hash" an optional feature? In some cases reuse means materializing the column (distributed), so sometimes might be better to just recompute the hash.

It makes sense to me. And it is possible to make it optional

@alamb
Copy link
Contributor

alamb commented Aug 1, 2024

If it turns out that reusing hash values isn't a good idea, I think we could close the ticket as "won't do"

Another thought, similarly to what @andygrove has been exploring in #5944 (comment) and I am working on in #7957 is to combine what is currently in different operators into a single one
to avoid a copy

Like maybe we could combine repartition and group by 🤔

but that might also be a crazy idea before i have had enough ☕ this morning

@alamb
Copy link
Contributor

alamb commented Aug 1, 2024

Another point is that generating the hash value is most expensive for string columns -- so maybe we could only reuse hash values when there are string columns as part of the groupby exprs

@jayzhan211
Copy link
Contributor Author

jayzhan211 commented Aug 1, 2024

Let's find out if combining group + repartition is reasonable, if yes, whether to reuse hash should be trivial since we don't need the overhead of record batch conversion anymore

Copy link

github-actions bot commented Oct 1, 2024

Thank you for your contribution. Unfortunately, this pull request is stale because it has been open 60 days with no activity. Please remove the stale label or comment or this will be closed in 7 days.

@github-actions github-actions bot added the Stale PR has not had any activity for some time label Oct 1, 2024
@github-actions github-actions bot closed this Oct 9, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
core Core DataFusion crate Stale PR has not had any activity for some time
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Improve performance of high cardinality grouping by reusing hash values
3 participants