-
Notifications
You must be signed in to change notification settings - Fork 1.8k
common: Add hashing support for REE arrays #18981
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
Conversation
Jefffrey
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If we're aiming to close #16011 with this PR, perhaps we should add an end-to-end test showcasing REE arrays can be aggregated on?
datafusion/common/src/hash_utils.rs
Outdated
| let mut values_hashes = vec![0u64; values_len]; | ||
| create_hashes(&[Arc::clone(values)], random_state, &mut values_hashes)?; | ||
|
|
||
| // Get run ends buffer |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Could we remove these verbose, LLM-like comments throughout the PR
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
done
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Could we do the same for some of the test comments as well?
Can do! Where do you think is the right place for a test like that? |
|
I have an example, but maybe in the interest of keeping PRs small and easily reviewable, I'll contribute the end-to-end example in a follow up? |
|
Let me know how you feel about the clippy lints, I personally find the suggestions from clippy in this case far harder to read since we're dealing with indexes, so I'd be inclined to add ignores, but let me know what you think! |
I prefer the clippy suggested way. Using indices is more C-style. |
c6669cf to
1e37fce
Compare
|
Adjusted to using mutable iterators. |
| } | ||
|
|
||
| // Downcast ArrayRef to RunArray | ||
| pub fn as_run_array<R: RunEndIndexType>(array: &dyn Array) -> Result<&RunArray<R>> { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is this function still needed ?
Its usage has been replaced with downcast_run_array!() and now it is not used anywhere.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Might be useful to have anyway, following suit with similar functions for other types above
datafusion/common/src/hash_utils.rs
Outdated
| let mut prev_run_end = 0; | ||
|
|
||
| for (i, value_hash) in values_hashes.iter().enumerate().take(values_len) { | ||
| let run_end = run_ends.values()[i].as_usize(); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This may fail for a sliced RunArray.
Sliced RunArray has the same values but sliced run_ends, so their indices do not match 1:1.
https://docs.rs/arrow-array/latest/src/arrow_array/array/run_array.rs.html#247-253
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Perhaps we could zip the run_ends.values().iter() as part of the for clause?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
this is a very good point, let me see what I can do
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
done
| for (i, value_hash) in values_hashes.iter().enumerate().take(values_len) { | ||
| let run_end = run_ends.values()[i].as_usize(); | ||
|
|
||
| if rehash { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The other functions check whether the current value is valid/non-null before hashing it when rehash == true. Is this needed here too ?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I believe you're right, we need to. Combining hashes modifies the hash even on 0 hashes (which makes sense). So we do need to care.
datafusion/common/src/hash_utils.rs
Outdated
| *hash = combine_hashes(*value_hash, *hash); | ||
| } | ||
| } else { | ||
| for hash in hashes_buffer.iter_mut().take(run_end).skip(prev_run_end) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This could be optimized to hashes_buffer[prev_run_end..run_end].fill(value_hash) (SIMD friendly!).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Nice!
I think it's fine to add the tests in this PR. Perhaps we can add the test as an SLT, assuming |
datafusion/common/src/hash_utils.rs
Outdated
| let run_ends = array.run_ends(); | ||
| let mut prev_run_end = 0; | ||
|
|
||
| for (i, value_hash) in values_hashes.iter().enumerate().take(values_len) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
| for (i, value_hash) in values_hashes.iter().enumerate().take(values_len) { | |
| for (i, value_hash) in values_hashes.iter().enumerate() { |
|
In addition to the comments, I also realized that we were also always hashing all values, which is unnecessary, so I changed the implementation to first figure out via the runs which values are relevant and only hash those and then process them accordingly. |
|
Ok I went down a bit of a rabbithole, and I have a working SLT, but I think it's better as a follow up, as we need arrow 57.1.0 as we need apache/arrow-rs#8765 and another patch in datafusion for the arrow cast to work. |
Which issue does this PR close?
Closes #16011
What changes are included in this PR?
Support for hashing of REE arrays
Are these changes tested?
Unit tests are added.
Are there any user-facing changes?
No, strictly additive behavior/new feature.
@alamb @vegarsti