Conversation
|
Need to address the issues (might be code that does not expect the extra padding). |
|
run benchmark boolean_kernels |
|
🤖 |
|
🤖: Benchmark completed Details
|
|
run benchmark boolean_kernels |
|
🤖 |
|
🤖: Benchmark completed Details
|
arrow-buffer/src/buffer/boolean.rs
Outdated
| return result; | ||
| let (prefix, aligned_u64s, suffix) = | ||
| unsafe { aligned_start.as_ref().align_to::<u64>() }; | ||
| if prefix.is_empty() && suffix.is_empty() { |
There was a problem hiding this comment.
Handle aligned + suffix could maybe be a bit better for x86 (couldn't measure it on Apple M2 - I believe there is no performance difference).
Handling both prefix + suffix was a slightly slower than the unaligned version.
|
run benchmark boolean_kernels |
|
🤖 |
|
🤖: Benchmark completed Details
|
|
run benchmark boolean_kernels |
|
🤖 |
|
🤖: Benchmark completed Details
|
|
run benchmark boolean_kernels |
|
🤖 |
|
🤖: Benchmark completed Details
|
|
run benchmark boolean_kernels |
|
🤖 |
|
🤖: Benchmark completed Details
|
|
FYI @alamb I think it's as good as it can be now. |
|
|
||
| let aligned_start = &src.as_ref()[aligned_offset / 8..slice_end]; | ||
|
|
||
| let (prefix, aligned_u64s, suffix) = unsafe { aligned_start.as_ref().align_to::<u64>() }; |
There was a problem hiding this comment.
I think previous benchmark results are sometimes noisy because underlying buffer is not always aligned to 64 bits (not 100% sure but it would explain it) and then take the slower (unaligned) path.
Now the slower path is not much slower. We probably also want to make sure the array creation path aligns to u64 in most cases and we also keep the alignment in kernels.
|
@jhorstmann perhaps you want to take a look? |
from_bitwise_unary_opfrom_bitwise_unary_op for byte aligned case
alamb
left a comment
There was a problem hiding this comment.
This is very clever @Dandandan -- thank you
I don't understand the changes to the binary operations, and I do wonder if the "not creating aligned output" change is a concern.
arrow-buffer/src/buffer/boolean.rs
Outdated
| bit_offset: 0, | ||
| bit_len: self.bit_len, | ||
| } | ||
| BooleanBuffer::from_bitwise_binary_op( |
There was a problem hiding this comment.
This change seems unrelated to the improvements in bitwise binary op and is perhaps the source of the 50% reported slowdown of and?
group main optimize_from_bitwise_unary_op
----- ---- ------------------------------
and 1.00 209.3±2.38ns ? ?/sec 1.48 310.3±1.11ns ? ?/sec
There was a problem hiding this comment.
Hm not sure if the change is due to this, but the changes look unneeded I agree for this PR.
There was a problem hiding this comment.
I think the results for and might be noisy just like the results before for not were noisy - it sometimes hits the aligned case and sometimes not (due to if buffer is allocated aligned or not).
(See example of the same perf a run earlier:)
and 1.01 209.9±3.67ns ? ?/sec 1.00 208.4±0.58ns ? ?/sec
Also, the implemantation for buffer_bin_and is currently as follows (showing the difference should indeed be due to noise):
BooleanBuffer::from_bitwise_binary_op(
left,
left_offset_in_bits,
right,
right_offset_in_bits,
len_in_bits,
|a, b| a & b,
)
.into_inner()
There was a problem hiding this comment.
(We can make the same change for the binary case, I think the speedup there might be even ~5x)
| } | ||
|
|
||
| BooleanBuffer::from_bits(self.as_slice(), offset, len).into_inner() | ||
| let chunks = self.bit_chunks(offset, len); |
There was a problem hiding this comment.
This change also seems unrelated -- perhaps we can pull it into its own PR
There was a problem hiding this comment.
This is required to make it work/tests pass as into_inner throws away the bit offset and length.
| let remainder = chunks.remainder(); | ||
| let iter = chunks.map(|c| u64::from_le_bytes(c.try_into().unwrap())); | ||
| let vec_u64s: Vec<u64> = if remainder.is_empty() { | ||
| iter.map(&mut op).collect() |
There was a problem hiding this comment.
in theory the remainder should never be empty right? Otherwise the aligned path above would be hit
There was a problem hiding this comment.
Hm I think the buffer itself (the address at offset 0) could still be not aligned to 64 bytes and has a prefix in the path above (thus go to this path). It could still be the entire buffer from beginning to end is a multiple of 64 bits and the remainder is empty.
| let result_u64s: Vec<u64> = aligned_u6us.iter().map(|l| op(*l)).collect(); | ||
| let buffer = Buffer::from(result_u64s); | ||
| Some(BooleanBuffer::new(buffer, 0, len_in_bits)) | ||
| BooleanBuffer::new(vec_u64s.into(), offset_in_bits % 64, len_in_bits) |
There was a problem hiding this comment.
This is a key difference in the two approaches -- the current code on main will produce an output buffer that is aligned (offset is 0), but this code will produce an output buffer that is not aligned (same as the input)
That is probably why the benchmark results can be so much better in this case -- because the output is different (though still correct)
This is probably ok, but I wanted to point it out as a potential side effect
There was a problem hiding this comment.
Yes that's indeed the main reason (do not bitshift to create offset of 0 which is ~3.5x speedup).The other part (~15% or so) is to align to 8 bytes instead of 1 byte as much as possible to be able to use the fast path as much as possible.
I also found the combination of collect/from_trusted_len_iterator + either iterstor is slow due to a non-existent implementation of fold /being able to use it in from_trusted_len_iterator (which probably makes sense to still PR) but using chunks_exact it's not required.
| bit_len: len_in_bits, | ||
| } | ||
| } | ||
| // align to byte boundaries |
There was a problem hiding this comment.
Ah I made the other path too unlikely by "aligning" input to 64 bits, let's add a case for this.
from_bitwise_unary_op for byte aligned casefrom_bitwise_unary_op

Which issue does this PR close?
from_bitwise_unary_op#9364Rationale for this change
This is way better on non-byte-offsets (not_sliced_1). Also rounds down to 64 bits instead of by byte, so it's more likely the aligned path is taken (not_slice_24):
What changes are included in this PR?
chunks_exact(stable since version 1.31)Are these changes tested?
Are there any user-facing changes?
The inner buffer isn't truncated to the number of bytes, but to 64-bits, a small change.
However, given that
BooleanArrayis represented based on the offset and number of bits into any inner buffer,