-
Notifications
You must be signed in to change notification settings - Fork 831
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
deduplicate variadic buffers in MutableArrayData::extend for ByteView arrays #6808
base: main
Are you sure you want to change the base?
deduplicate variadic buffers in MutableArrayData::extend for ByteView arrays #6808
Conversation
_ => vec![], | ||
let (variadic_data_buffers, buffer_to_idx) = match &data_type { | ||
DataType::BinaryView | DataType::Utf8View => { | ||
let mut buffer_to_idx = HashMap::new(); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I wonder if building a hashmap / vec would be overly expensive (though we would need to run benchmarks to be sure)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
happy to run benchmarks, any particular in mind or should I create one with criterion specific to this?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think the ones in cast are probably a good place to start
@alamb @tustvold I did add a string view case for the interleave benchmark and ran on main, this PR (interleave-deduplicated), and #6779 (interleave-specific-impl)
I believe the penalty introduced by this PR would be mitigated for interleave's case if we also merge #6779, for other cases it feels like the read / transfer over the wire improvements might outweigh the cost. Happy to hear your thoughts |
Thank you @onursatici -- I hope to find time to review this PR this weekend or early next week |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I (again) apologize for the delay in reviewing this PR. We are stretched quite thin as always
In general, I think this PR needs some tests to show it is working as well as ensure we don't break this functionality with some future PR.
Thank you for running the benchmarks. They seem promising and I will give them a more careful look if we proceed with this PR
@alamb no worries and thank you for having a look. I added some tests now checking the deduplication and remapping behaviour, let me know whenever you have time if this looks good, happy holidays! |
Which issue does this PR close?
Closes #.
Rationale for this change
MutableArrayData adds all variadic buffers from input arrays together, potentially duplicating the same buffers in the output array.
What changes are included in this PR?
extend
now checks if the same buffer is added from some other input array and changes the views to be appended to point to the new deduplicated buffer indicesAre there any user-facing changes?