-
Notifications
You must be signed in to change notification settings - Fork 1k
perf: add optimized zip implementation for scalars #8653
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
Conversation
This is useful for `IF <expr> THEN <scalar> ELSE <scalar> END` TODO: - [ ] Need to add comments if missing - [ ] Add benchmark
| let scalars: Vec<T::Native> = predicate | ||
| .iter() | ||
| .map(|b| if b { then_val } else { else_val }) | ||
| .collect(); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This will probably use conditional move
| fn combine_nulls_and_false(predicate: &BooleanArray) -> BooleanBuffer { | ||
| if let Some(nulls) = predicate.nulls().filter(|n| n.null_count() > 0) { | ||
| predicate.values().bitand( | ||
| // nulls are represented as 0 (false) in the values buffer | ||
| nulls.inner(), | ||
| ) | ||
| } else { | ||
| predicate.values().clone() | ||
| } | ||
| } |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm pretty sure there is already a helper function in arrow for this
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I believe it is prep_null_mask_filter: https://github.com/apache/arrow-rs/blob/a0bbe7faaad6303355c5e9461f91a177e267861f/arrow-select/src/filter.rs#L122-L121
# Which issue does this PR close? N/A # Rationale for this change I have a PR to improve zip perf for scalar but I don't see any benchmarks for it: - #8653 # What changes are included in this PR? created zip benchmarks for scalar and non scalar with different masks # Are these changes tested? N/A # Are there any user-facing changes? Nope
|
@alamb If you wanna run the benchmarks for zip, there are no more optimization left for this PR, only cleanups, tests and comments I saw for scalars major improvements while in array and scalar regression for some reason (maybe the extra check? even though it is a simple comparison. I run it on bare metal to reduce noise as much as possible) I tests it on: $ neofetch
.-/+oossssoo+/-. ubuntu@ip-
`:+ssssssssssssssssss+:` -----------------------
-+ssssssssssssssssssyyssss+- OS: Ubuntu 24.04.3 LTS x86_64
.ossssssssssssssssssdMMMNysssso. Host: c5.metal 1.0
/ssssssssssshdmmNNmmyNMMMMhssssss/ Kernel: 6.14.0-1011-aws
+ssssssssshmydMMMMMMMNddddyssssssss+ Uptime: 3 hours, 46 mins
/sssssssshNMMMyhhyyyyhmNMMMNhssssssss/ Packages: 921 (dpkg), 5 (snap)
.ssssssssdMMMNhsssssssssshNMMMdssssssss. Shell: bash 5.2.21
+sssshhhyNMMNyssssssssssssyNMMMysssssss+ Terminal: /dev/pts/0
ossyNMMMNyMMhsssssssssssssshmmmhssssssso CPU: Intel Xeon Platinum 8275CL (96) @ 3.900GHz
ossyNMMMNyMMhsssssssssssssshmmmhssssssso Memory: 2144MiB / 193025MiB
+sssshhhyNMMNyssssssssssssyNMMMysssssss+
.ssssssssdMMMNhsssssssssshNMMMdssssssss.
/sssssssshNMMMyhhyyyyhdNMMMNhssssssss/
+sssssssssdmydMMMMMMMMddddyssssssss+
/ssssssssssshdmNNNNmyNMMMMhssssss/
.ossssssssssssssssssdMMMNysssso.
-+sssssssssssssssssyyyssss+-
`:+ssssssssssssssssss+:`
.-/+oossssoo+/-.
|
this will be used in: - apache#8653
# Conflicts: # arrow-buffer/src/buffer/mutable.rs
…e-zip-for-scalars
|
Thank you @rluvaton -- I have scheduled benchmarks for this PR and reviewed the dependent ones. Exciting stuff |
|
🤖 |
|
The zip benchmarks are still running... Maybe we should trim them back a bit |
|
🤖: Benchmark completed Details
|
|
Good, it looks like we have massive speedups |
# Which issue does this PR close? N/A # Rationale for this change doing `OffsetBuffer::from_lengths(std::iter::repeat_n(size, value.len()));` does not utilize SIMD (I explain further if you want) See [GodBolt Link](https://godbolt.org/z/PTsfvfjqx) Extracted from: - #8653 After this and the pr below is merged will improve the datafusion scalar to array to use this and make it really really fast: - #8658 # What changes are included in this PR? added new function # Are these changes tested? yes # Are there any user-facing changes? yes
yes, nice work! |
# Which issue does this PR close? N/A # Rationale for this change I want to repeat the same value multiple times in a very fast way which will be used in: - #8653 After this and the pr below is merged will improve the datafusion scalar to array to use this and make it really really fast: - #8656 # What changes are included in this PR? Created a function in `MutableBuffer` to repeat a slice a number of times in a logarithmic way to reduce memcopy calls # Are these changes tested? Yes # Are there any user-facing changes? Yes, and added docs ------- Extracted from: - #8653 Benchmark results on local machine | Slice Length | Repetitions (n) | repeat_slice_n_times | extend_from_slice loop | Speedup | |--------------|-----------------|----------------------|------------------------|---------| | 3 | 3 | 47.092 ns | 41.910 ns | 0.89x | | 3 | 64 | 63.548 ns | 222.29 ns | 3.50x | | 3 | 1024 | 105.57 ns | 3.031 µs | 28.7x | | 3 | 8192 | 405.71 ns | 24.170 µs | 59.6x | | 20 | 3 | 48.437 ns | 46.437 ns | 0.96x | | 20 | 64 | 74.993 ns | 319.04 ns | 4.25x | | 20 | 1024 | 350.94 ns | 4.437 µs | 12.6x | | 20 | 8192 | 2.440 µs | 35.524 µs | 14.6x | | 100 | 3 | 50.369 ns | 47.568 ns | 0.94x | | 100 | 64 | 119.70 ns | 165.37 ns | 1.38x | | 100 | 1024 | 1.734 µs | 2.623 µs | 1.51x | | 100 | 8192 | 10.615 µs | 19.750 µs | 1.86x | these are the results: <details> <summary>Result</summary> ``` MutableBuffer repeat slice/repeat_slice_n_times/slice_len=3 n=3 time: [46.719 ns 47.092 ns 47.453 ns] Found 2 outliers among 100 measurements (2.00%) 1 (1.00%) low mild 1 (1.00%) high mild MutableBuffer repeat slice/extend_from_slice loop/slice_len=3 n=3 time: [41.833 ns 41.910 ns 41.996 ns] Found 11 outliers among 100 measurements (11.00%) 9 (9.00%) high mild 2 (2.00%) high severe MutableBuffer repeat slice/repeat_slice_n_times/slice_len=3 n=64 time: [62.935 ns 63.548 ns 64.183 ns] Found 5 outliers among 100 measurements (5.00%) 5 (5.00%) high mild MutableBuffer repeat slice/extend_from_slice loop/slice_len=3 n=64 time: [221.75 ns 222.29 ns 222.86 ns] Found 5 outliers among 100 measurements (5.00%) 3 (3.00%) high mild 2 (2.00%) high severe MutableBuffer repeat slice/repeat_slice_n_times/slice_len=3 n=1024 time: [105.15 ns 105.57 ns 106.01 ns] Found 1 outliers among 100 measurements (1.00%) 1 (1.00%) high severe MutableBuffer repeat slice/extend_from_slice loop/slice_len=3 n=1024 time: [3.0240 µs 3.0308 µs 3.0395 µs] Found 11 outliers among 100 measurements (11.00%) 2 (2.00%) low mild 5 (5.00%) high mild 4 (4.00%) high severe MutableBuffer repeat slice/repeat_slice_n_times/slice_len=3 n=8192 time: [401.57 ns 405.71 ns 409.94 ns] Found 6 outliers among 100 measurements (6.00%) 6 (6.00%) high mild MutableBuffer repeat slice/extend_from_slice loop/slice_len=3 n=8192 time: [24.124 µs 24.170 µs 24.222 µs] Found 5 outliers among 100 measurements (5.00%) 3 (3.00%) high mild 2 (2.00%) high severe MutableBuffer repeat slice/repeat_slice_n_times/slice_len=20 n=3 time: [48.287 ns 48.437 ns 48.606 ns] Found 8 outliers among 100 measurements (8.00%) 5 (5.00%) high mild 3 (3.00%) high severe MutableBuffer repeat slice/extend_from_slice loop/slice_len=20 n=3 time: [46.289 ns 46.437 ns 46.611 ns] Found 6 outliers among 100 measurements (6.00%) 3 (3.00%) high mild 3 (3.00%) high severe MutableBuffer repeat slice/repeat_slice_n_times/slice_len=20 n=64 time: [74.625 ns 74.993 ns 75.395 ns] Found 3 outliers among 100 measurements (3.00%) 3 (3.00%) high mild MutableBuffer repeat slice/extend_from_slice loop/slice_len=20 n=64 time: [318.20 ns 319.04 ns 319.98 ns] Found 8 outliers among 100 measurements (8.00%) 3 (3.00%) high mild 5 (5.00%) high severe MutableBuffer repeat slice/repeat_slice_n_times/slice_len=20 n=1024 time: [346.66 ns 350.94 ns 355.17 ns] Found 3 outliers among 100 measurements (3.00%) 1 (1.00%) low mild 2 (2.00%) high severe MutableBuffer repeat slice/extend_from_slice loop/slice_len=20 n=1024 time: [4.4251 µs 4.4369 µs 4.4506 µs] Found 8 outliers among 100 measurements (8.00%) 1 (1.00%) low mild 2 (2.00%) high mild 5 (5.00%) high severe MutableBuffer repeat slice/repeat_slice_n_times/slice_len=20 n=8192 time: [2.4336 µs 2.4401 µs 2.4465 µs] Found 2 outliers among 100 measurements (2.00%) 1 (1.00%) high mild 1 (1.00%) high severe MutableBuffer repeat slice/extend_from_slice loop/slice_len=20 n=8192 time: [35.466 µs 35.524 µs 35.589 µs] Found 4 outliers among 100 measurements (4.00%) 1 (1.00%) low mild 2 (2.00%) high mild 1 (1.00%) high severe MutableBuffer repeat slice/repeat_slice_n_times/slice_len=100 n=3 time: [50.209 ns 50.369 ns 50.530 ns] Found 5 outliers among 100 measurements (5.00%) 5 (5.00%) high mild MutableBuffer repeat slice/extend_from_slice loop/slice_len=100 n=3 time: [47.439 ns 47.568 ns 47.701 ns] Found 2 outliers among 100 measurements (2.00%) 2 (2.00%) high mild MutableBuffer repeat slice/repeat_slice_n_times/slice_len=100 n=64 time: [117.77 ns 119.70 ns 122.00 ns] Found 12 outliers among 100 measurements (12.00%) 7 (7.00%) high mild 5 (5.00%) high severe MutableBuffer repeat slice/extend_from_slice loop/slice_len=100 n=64 time: [164.88 ns 165.37 ns 166.07 ns] Found 6 outliers among 100 measurements (6.00%) 5 (5.00%) high mild 1 (1.00%) high severe MutableBuffer repeat slice/repeat_slice_n_times/slice_len=100 n=1024 time: [1.7278 µs 1.7335 µs 1.7398 µs] Found 7 outliers among 100 measurements (7.00%) 1 (1.00%) low mild 5 (5.00%) high mild 1 (1.00%) high severe MutableBuffer repeat slice/extend_from_slice loop/slice_len=100 n=1024 time: [2.6176 µs 2.6232 µs 2.6305 µs] Found 5 outliers among 100 measurements (5.00%) 1 (1.00%) high mild 4 (4.00%) high severe MutableBuffer repeat slice/repeat_slice_n_times/slice_len=100 n=8192 time: [10.583 µs 10.615 µs 10.649 µs] Found 3 outliers among 100 measurements (3.00%) 3 (3.00%) high mild MutableBuffer repeat slice/extend_from_slice loop/slice_len=100 n=8192 time: [19.471 µs 19.750 µs 20.185 µs] Found 9 outliers among 100 measurements (9.00%) 2 (2.00%) high mild 7 (7.00%) high severe ``` </details>
# Conflicts: # arrow-buffer/src/buffer/mutable.rs
|
I am struggling to find enough contiguous focus time to review these PRs. They are on my radar, I just can't review them as fast as I want to Hopefully other people will be able to help review too |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thank you @rluvaton -- this is (also) great 🚀
I think the only thing really needed is additional test coverage for the fallback impl and the special cases in BytesScalarImpl
Adding an implementation for ByteView types (Utf8View and BinaryView) will likely also improve performance a lot, but we can file a follow on ticket to track doing so -- this is better than what is currently on main
| /// - either Datum is not a scalar (or has more than 1 element) | ||
| /// | ||
| pub fn try_new(truthy: &dyn Datum, falsy: &dyn Datum) -> Result<Self, ArrowError> { | ||
| let (truthy, truthy_is_scalar) = truthy.get(); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
you could potentially avoid the redundant call to truthy.get() and falsy.get() by returning Result<Option<Self>, ArrowError> (returning None if either argument was non scalar)
| fn combine_nulls_and_false(predicate: &BooleanArray) -> BooleanBuffer { | ||
| if let Some(nulls) = predicate.nulls().filter(|n| n.null_count() > 0) { | ||
| predicate.values().bitand( | ||
| // nulls are represented as 0 (false) in the values buffer | ||
| nulls.inner(), | ||
| ) | ||
| } else { | ||
| predicate.values().clone() | ||
| } | ||
| } |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I believe it is prep_null_mask_filter: https://github.com/apache/arrow-rs/blob/a0bbe7faaad6303355c5e9461f91a177e267861f/arrow-select/src/filter.rs#L122-L121
|
|
||
| let zip_impl = downcast_primitive! { | ||
| truthy.data_type() => (primitive_size_helper), | ||
| DataType::Utf8 => { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
A natural extension of this work would be to add a special case for Datatype::Utf8View and DataType::BinaryView (as a follow on PR)
That would likely be super fast for many cases as it could simply copy views around and pre-compute the value buffer.
I'll file a follow on ticket
| } | ||
|
|
||
| impl<T: ArrowPrimitiveType> PrimitiveScalarImpl<T> { | ||
| fn get_scalar_and_null_buffer_for_single_non_nullable( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I was confused about this naming for a while -- eventually I see it means something like
| fn get_scalar_and_null_buffer_for_single_non_nullable( | |
| /// return an output array that has | |
| /// `value` in all locations where predicate is true | |
| /// `null` otherwise | |
| fn get_scalar_and_null_buffer_for_single_non_nullable( |
| predicate: BooleanBuffer, | ||
| value: T::Native, | ||
| ) -> (Vec<T::Native>, Option<NullBuffer>) { | ||
| let result_len = predicate.len(); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I noticed a special case for all nulls in the bytes impl
let number_of_true = predicate.count_set_bits();Is there a reason you didn't include the same case in the primitive builder?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, it might not be worth it.
For bytes we need to know the number of set and unset bits so we can preallocate the values buffer for all the cases
|
|
||
| let true_repeat_count = end - start; | ||
| // fill with truthy values | ||
| mutable.repeat_slice_n_times(truthy_val, true_repeat_count); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
having to copy the same iterator so many times is quite unfortunate and is what Utf8View is designed to avoid -- you can have a single copy of the string and then copy them around
This is explained in blog form here if you are not familiar with them: https://datafusion.apache.org/blog/2024/09/13/string-view-german-style-strings-part-1/
| } | ||
|
|
||
| fn get_bytes_and_offset_for_all_same_value( | ||
| predicate: &BooleanBuffer, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I found it a little confusing at first that predicate was passed in but only its length is used. Maybe passing in the len would make it clearer that the callsite doesn't need to negate the predicate as is needed in get_scalar_and_null_buffer_for_single_non_nullable
| } | ||
| } | ||
|
|
||
| impl<T: ByteArrayType> BytesScalarImpl<T> { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
is there a reason this is in its own impl block (not in the same as above?)
| fn create_output(&self, input: &BooleanArray) -> Result<ArrayRef, ArrowError>; | ||
| } | ||
|
|
||
| #[derive(Debug, PartialEq)] |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Full report:
report.zip
| } | ||
| } | ||
|
|
||
| fn get_scalar_and_null_buffer_for_single_non_nullable( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.


Waiting for the PRs below to be merged first:
zipkernel benchmarks #8654 - zip benchmarksThis PR include the following other PRs (unless merged) to make the review easier, so please make sure to review them first
repeat_slice_n_timestoMutableBuffer#8658 - extracted from thisWhich issue does this PR close?
N/A
Rationale for this change
Making zip really fast for scalars
This is useful for
IF <expr> THEN <literal> ELSE <literal> ENDWhat changes are included in this PR?
Created couple of implementation for zipping scalar, for primitive, bytes and fallback
Are these changes tested?
existing tests
Are there any user-facing changes?
new struct
ScalarZipperTODO: