Skip to content

Add xxhash algorithms in SQL and expression api #14367

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Draft
wants to merge 15 commits into
base: main
Choose a base branch
from

Conversation

Spaarsh
Copy link
Contributor

@Spaarsh Spaarsh commented Jan 30, 2025

Which issue does this PR close?

Rationale for this change

Lack of xxhash (a quick, non-cryptographic hashing technique) functions.

What changes are included in this PR?

Are these changes tested?

Yes.

Are there any user-facing changes?

The users shall now be able to use xxhash32 and xxhash64 functions.

@github-actions github-actions bot added the functions Changes to functions implementation label Jan 30, 2025
@Omega359
Copy link
Contributor

Omega359 commented Feb 3, 2025

Can you fix the clippy issue and generate the documentation (./dev/update_function_docs.sh) ?


fn process_array<T: Hasher>(array: &dyn Array, mut hasher: T, hash_type: HashType) -> Result<Vec<String>> {
let mut hash_results: Vec<String> = Vec::with_capacity(array.len());
for i in 0..array.len() {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Instead of downcasting the array for every single element in it see if you can do it once up front.

Self {
signature: Signature::uniform(
1,
vec![Utf8View, Utf8, LargeUtf8, Binary, LargeBinary],
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The types you say you accept here do not match the types handled in process_array or to_string_from_scalar. You should likely handle all these and the int ones you are handling in those.

Copy link
Contributor Author

@Spaarsh Spaarsh Feb 4, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sure I'll fix this. I had actually made several changes that had not been pushed. Due do a glitch in my device they got wiped off 😅

I removed the Int types from being accepted as input since I was thinking that allowing any other form of input other than a string wouldn't be ideal and I am anyway casting them to the Utf8 type. And all the hash functions expect their input to be of that form only. Even the crate I am using expects the same. Allowing for multiple types doesn't serve much of a purpose now that I think about it. I would love your thoughts on this!

And should I convert this issue to draft? Since I am still adding the optional seed argument.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If it's still a wip, sure, mark as draft with a WIP in the description helps a lot. Some hashes do allow binary input (See md5 for example, so that is normally reasonable.

@Omega359
Copy link
Contributor

Omega359 commented Feb 3, 2025

Beyond the comments I've added it would be very welcome to include .slt tests for these hash functions. Search for md5 or sha256 for examples of current tests.

@Spaarsh Spaarsh marked this pull request as draft February 4, 2025 09:55
@github-actions github-actions bot added the documentation Improvements or additions to documentation label Feb 4, 2025
@Spaarsh
Copy link
Contributor Author

Spaarsh commented Feb 4, 2025

I will add support for Null values and write the tests as well!

@Spaarsh Spaarsh marked this pull request as ready for review February 5, 2025 04:04
@alamb alamb marked this pull request as draft February 5, 2025 11:02
@alamb
Copy link
Contributor

alamb commented Feb 5, 2025

This PR isn't passing CI checks so I am marking it as draft to clear the review queue.

@Spaarsh
Copy link
Contributor Author

Spaarsh commented Feb 5, 2025

Beyond the comments I've added it would be very welcome to include .slt tests for these hash functions. Search for md5 or sha256 for examples of current tests.

@Omega359 I found out that the md5 tests are found in functions.slt and expr.slt as well. While there are many other functions that have their own .slt file as well as tests in expr.slt

Since I am planning to add wywash hash functions soon, should I make a separate hash.slt and include a few tests in the expr.slt as well?

Thanks!

@Omega359
Copy link
Contributor

Omega359 commented Feb 5, 2025

Beyond the comments I've added it would be very welcome to include .slt tests for these hash functions. Search for md5 or sha256 for examples of current tests.

@Omega359 I found out that the md5 tests are found in functions.slt and expr.slt as well. While there are many other functions that have their own .slt file as well as tests in expr.slt

Since I am planning to add wywash hash functions soon, should I make a separate hash.slt and include a few tests in the expr.slt as well?

The slt tests are known to need a bit of a cleanup in spots. A hash.slt I think is a good idea and we can have a followup PR to move the other hash functions in there for their tests.

@github-actions github-actions bot added the sqllogictest SQL Logic Tests (.slt) label Feb 5, 2025
@Spaarsh
Copy link
Contributor Author

Spaarsh commented Feb 5, 2025

@Omega359 I had not pushed the updated cargo.lock file, hence the tests were failing. I think now this PR is ready for review. The hash.stl file has been added as well. There is a small conflict in the Cargo.lock that needs resolving.

@Omega359
Copy link
Contributor

Omega359 commented Feb 6, 2025

Thanks for continuing to work through this feature! ❤️

I left a number of feedback items. I think the largest concern I have is the downcast_ref code not handling StringViewArray where you likely want to switch to using StringArrayType.

@Spaarsh
Copy link
Contributor Author

Spaarsh commented Feb 7, 2025

I have added the test for Utf8View array for xxhash32 and xxhash64 as well.

@Spaarsh
Copy link
Contributor Author

Spaarsh commented Feb 7, 2025

The tests are failing due to code in the main itself at the time of branch creation. Is this PR ready for review?

…n-API' into 14044/enhancement/add-xxhash-algorithms-in-expression-API
@Spaarsh Spaarsh marked this pull request as ready for review February 7, 2025 18:40
@Omega359
Copy link
Contributor

Omega359 commented Feb 7, 2025

Thanks for applying the updates! I'll do one final review over the weekend then if it's good (I suspect it is!) then I'll ask a Committer to merge it in.

@Omega359
Copy link
Contributor

Omega359 commented Feb 9, 2025

While I can see opportunities for improvements in the code I think they are relatively minor and this PR is suitable for inclusion in DF. Thanks!

@alamb

@alamb alamb changed the title 14044/enhancement/add xxhash algorithms in expression api Add xxhash algorithms in expression api Feb 10, 2025
@alamb alamb changed the title Add xxhash algorithms in expression api Add xxhash algorithms in SQL and expression api Feb 10, 2025
Copy link
Contributor

@alamb alamb left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks @Spaarsh and @Omega359

I don't understand how widespread the xxhash function is. If this is a feature to support delta-rs I think it would be suitable for inclusion.

If it is a function for one particular user's usecase I think we should be much more careful of taking on a new function (and the dependency and maintenance)

I started a conversation here

@Spaarsh
Copy link
Contributor Author

Spaarsh commented Feb 10, 2025

@alamb could you also get a clarification over the inclusion of wyhash functions as well? It was also requested under the same issue here.

@alamb
Copy link
Contributor

alamb commented Feb 10, 2025

@alamb could you also get a clarification over the inclusion of wyhash functions as well? It was also requested under the same issue here.

In my opinion there are an infinite number of potentially useful functions we could add to the DataFusion crate

However, given there are extension APIs the only benefits of adding functions to the core crates vs installing them elsewere are:

  1. pre-integration (no extra work is required downstream)
  2. long term maintenance (there is an active community in DataFusion)

As valuable as those properties are for users of datafusion enabled products / projects, they come at a cost to the core maintainers and our review / maintenance bandwidth is our most limited resource.

@alamb
Copy link
Contributor

alamb commented Feb 10, 2025

So unless there are many users (and ideally maintainers) of new functions, I am hesitant to add new ones

@Omega359
Copy link
Contributor

Note that this seems to also be in the spark function PR - https://github.com/apache/datafusion/pull/14392/files#diff-2bbff2d3a0ce9cec9ed9d6ec6e38ff910875af704b60855f43b47b46c96c5d44

@alamb
Copy link
Contributor

alamb commented Feb 26, 2025

That is all the more reason to perhaps defer to the spark function PR / focus on that

1 similar comment
@alamb
Copy link
Contributor

alamb commented Feb 26, 2025

That is all the more reason to perhaps defer to the spark function PR / focus on that

@alamb alamb marked this pull request as draft February 26, 2025 14:24
Copy link

github-actions bot commented May 1, 2025

Thank you for your contribution. Unfortunately, this pull request is stale because it has been open 60 days with no activity. Please remove the stale label or comment or this will be closed in 7 days.

@github-actions github-actions bot added the Stale PR has not had any activity for some time label May 1, 2025
@alamb
Copy link
Contributor

alamb commented May 5, 2025

We recently added the datafusion-spark crate for spark compatible functions. @Spaarsh , would you be willing to port this function over to that crate and update it to be spark compatible?

@github-actions github-actions bot removed the Stale PR has not had any activity for some time label May 6, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
documentation Improvements or additions to documentation functions Changes to functions implementation sqllogictest SQL Logic Tests (.slt)
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Implement xxhash algorithms as part of the expression API
3 participants