-
Notifications
You must be signed in to change notification settings - Fork 1.5k
Add xxhash algorithms in SQL and expression api #14367
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
Add xxhash algorithms in SQL and expression api #14367
Conversation
Can you fix the clippy issue and generate the documentation (./dev/update_function_docs.sh) ? |
|
||
fn process_array<T: Hasher>(array: &dyn Array, mut hasher: T, hash_type: HashType) -> Result<Vec<String>> { | ||
let mut hash_results: Vec<String> = Vec::with_capacity(array.len()); | ||
for i in 0..array.len() { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Instead of downcasting the array for every single element in it see if you can do it once up front.
Self { | ||
signature: Signature::uniform( | ||
1, | ||
vec![Utf8View, Utf8, LargeUtf8, Binary, LargeBinary], |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The types you say you accept here do not match the types handled in process_array
or to_string_from_scalar
. You should likely handle all these and the int ones you are handling in those.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sure I'll fix this. I had actually made several changes that had not been pushed. Due do a glitch in my device they got wiped off 😅
I removed the Int types from being accepted as input since I was thinking that allowing any other form of input other than a string wouldn't be ideal and I am anyway casting them to the Utf8 type. And all the hash functions expect their input to be of that form only. Even the crate I am using expects the same. Allowing for multiple types doesn't serve much of a purpose now that I think about it. I would love your thoughts on this!
And should I convert this issue to draft? Since I am still adding the optional seed argument.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If it's still a wip, sure, mark as draft with a WIP in the description helps a lot. Some hashes do allow binary input (See md5 for example, so that is normally reasonable.
Beyond the comments I've added it would be very welcome to include .slt tests for these hash functions. Search for md5 or sha256 for examples of current tests. |
I will add support for Null values and write the tests as well! |
This PR isn't passing CI checks so I am marking it as draft to clear the review queue. |
@Omega359 I found out that the md5 tests are found in functions.slt and expr.slt as well. While there are many other functions that have their own .slt file as well as tests in expr.slt Since I am planning to add wywash hash functions soon, should I make a separate hash.slt and include a few tests in the expr.slt as well? Thanks! |
The slt tests are known to need a bit of a cleanup in spots. A hash.slt I think is a good idea and we can have a followup PR to move the other hash functions in there for their tests. |
@Omega359 I had not pushed the updated cargo.lock file, hence the tests were failing. I think now this PR is ready for review. The hash.stl file has been added as well. There is a small conflict in the Cargo.lock that needs resolving. |
Thanks for continuing to work through this feature! ❤️ I left a number of feedback items. I think the largest concern I have is the downcast_ref code not handling StringViewArray where you likely want to switch to using StringArrayType. |
I have added the test for Utf8View array for xxhash32 and xxhash64 as well. |
The tests are failing due to code in the main itself at the time of branch creation. Is this PR ready for review? |
…n-API' into 14044/enhancement/add-xxhash-algorithms-in-expression-API
Thanks for applying the updates! I'll do one final review over the weekend then if it's good (I suspect it is!) then I'll ask a Committer to merge it in. |
While I can see opportunities for improvements in the code I think they are relatively minor and this PR is suitable for inclusion in DF. Thanks! |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't understand how widespread the xxhash function is. If this is a feature to support delta-rs I think it would be suitable for inclusion.
If it is a function for one particular user's usecase I think we should be much more careful of taking on a new function (and the dependency and maintenance)
I started a conversation here
In my opinion there are an infinite number of potentially useful functions we could add to the DataFusion crate However, given there are extension APIs the only benefits of adding functions to the core crates vs installing them elsewere are:
As valuable as those properties are for users of datafusion enabled products / projects, they come at a cost to the core maintainers and our review / maintenance bandwidth is our most limited resource. |
So unless there are many users (and ideally maintainers) of new functions, I am hesitant to add new ones |
Note that this seems to also be in the spark function PR - https://github.com/apache/datafusion/pull/14392/files#diff-2bbff2d3a0ce9cec9ed9d6ec6e38ff910875af704b60855f43b47b46c96c5d44 |
That is all the more reason to perhaps defer to the spark function PR / focus on that |
1 similar comment
That is all the more reason to perhaps defer to the spark function PR / focus on that |
Thank you for your contribution. Unfortunately, this pull request is stale because it has been open 60 days with no activity. Please remove the stale label or comment or this will be closed in 7 days. |
We recently added the datafusion-spark crate for spark compatible functions. @Spaarsh , would you be willing to port this function over to that crate and update it to be spark compatible? |
Which issue does this PR close?
Rationale for this change
Lack of xxhash (a quick, non-cryptographic hashing technique) functions.
What changes are included in this PR?
Are these changes tested?
Yes.
Are there any user-facing changes?
The users shall now be able to use xxhash32 and xxhash64 functions.