-
Notifications
You must be signed in to change notification settings - Fork 1.2k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Minor: Introduce utils::hash for StructArray #8552
Conversation
Signed-off-by: jayzhan211 <jayzhan211@gmail.com>
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks good to me, thanks @jayzhan211
datafusion/common/src/hash_utils.rs
Outdated
}; | ||
|
||
for i in valid_indices { | ||
let mut values_hashes = vec![0u64; array.len()]; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why do you create a new vec for each iteration?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
create each values_hashes for each column
. And combine those hashed element in column to one
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The values_hashes
can be created upfront, outside of the loop.
let column_values = values[i].clone(); | ||
// create hashes for each column | ||
create_hashes(&[column_values], random_state, &mut values_hashes)?; | ||
let hash = &mut hashes_buffer[i]; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This seems not correct, as i
here to a column in the struct and hashes_buffer
to a hash of one value?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@Dandandan. Not fully got your point. I do consider each column as one value (iterate each row in one column with get one final hash hashes_buffer[i]
).
i
is the column index in the struct array and is also the hashes_buffer index.
For example, column1, null, column3, column4. We iterate 0, 2, and 3. And hash each of them to hashes_buffer[0], hashes_buffer[2], and hashes_buffer[3] respectively. A column-wise hashes.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think I mistake how the null buffer works in StructArray. It seems to be row-wise, but I consider it as column-wise...
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It looks pretty much better now :)
Signed-off-by: jayzhan211 <jayzhan211@gmail.com>
3d7b5d3
to
fc36232
Compare
Signed-off-by: jayzhan211 <jayzhan211@gmail.com>
let valid_indices: Vec<usize> = if let Some(nulls) = nulls { | ||
nulls.valid_indices().collect() | ||
} else { | ||
(0..num_columns).collect() |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This doesn't yet seem correct, valid_indices
contains the rows that are valid, not the columns.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I hash each row now, so we iterate the valid row and hash each column at that row to one hash value.
}; | ||
|
||
let mut values_hashes = vec![0u64; array.len()]; | ||
create_hashes(array.columns(), random_state, &mut values_hashes)?; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We should be able to use hashes_buffer
here directly?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If we do that, we should be able to remove the code related to valid_indices
, as this already takes care of nulls.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I dont think so.
values_hashes: [9258723240401068087, 9258723240401068087, 8502738074356456021, 8502738074356456021, 4222447303697976283, 9753707356376286577]
hashes_buffer: [9258723240401091360, 9258723240401091360, 0, 8502738074356479294, 0, 0]
We can see that create_hashes
does not consider if the row is valid or not. Therefore, we need to iterate valid_indices
once .
@@ -327,12 +354,16 @@ pub fn create_hashes<'a>( | |||
array => hash_dictionary(array, random_state, hashes_buffer, rehash)?, | |||
_ => unreachable!() | |||
} | |||
DataType::Struct(_) => { | |||
let array = as_struct_array(array)?; | |||
hash_struct_array(array, random_state, hashes_buffer)?; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This should work?
hash_struct_array(array, random_state, hashes_buffer)?; | |
create_hashes(array.columns(), random_state, hashes_buffer)?; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If we consider nulls, this doesn't work
Signed-off-by: jayzhan211 <jayzhan211@gmail.com>
@Dandandan do you think this PR is ready to merge? |
@alamb @Dandandan Any update on this? |
Thank you @jayzhan211 @waynexia and @Dandandan |
I found the current implementation failed this
Let me checkout the reason |
* hash struct Signed-off-by: jayzhan211 <jayzhan211@gmail.com> * fmt Signed-off-by: jayzhan211 <jayzhan211@gmail.com> * row-wise hash Signed-off-by: jayzhan211 <jayzhan211@gmail.com> * fmt Signed-off-by: jayzhan211 <jayzhan211@gmail.com> * create hashes once Signed-off-by: jayzhan211 <jayzhan211@gmail.com> * cleanup Signed-off-by: jayzhan211 <jayzhan211@gmail.com> * add comment Signed-off-by: jayzhan211 <jayzhan211@gmail.com> --------- Signed-off-by: jayzhan211 <jayzhan211@gmail.com>
Which issue does this PR close?
Closes #.
We will need this for #7893
Rationale for this change
What changes are included in this PR?
Are these changes tested?
Are there any user-facing changes?