Skip to content

Optimize from_bitwise_unary_op#9297

Open
Dandandan wants to merge 21 commits intoapache:mainfrom
Dandandan:optimize_from_bitwise_unary_op
Open

Optimize from_bitwise_unary_op#9297
Dandandan wants to merge 21 commits intoapache:mainfrom
Dandandan:optimize_from_bitwise_unary_op

Conversation

@Dandandan
Copy link
Contributor

@Dandandan Dandandan commented Jan 29, 2026

Which issue does this PR close?

Rationale for this change

This is way better on non-byte-offsets (not_sliced_1). Also rounds down to 64 bits instead of by byte, so it's more likely the aligned path is taken (not_slice_24):

main                                                    optimize_from_bitwise_unary_op
not_sliced_1     3.57    621.1±3.75ns        ? ?/sec    1.00    174.2±0.37ns        ? ?/sec
not_slice_24     1.13    194.2±0.64ns        ? ?/sec    1.00    172.2±0.78ns        ? ?/sec

What changes are included in this PR?

  • Change the code to use the 64-bit aligned (or aligned + suffix) path as much as possible
  • Speed up the non-aligned path using chunks_exact (stable since version 1.31)
  • Avoid truncation to avoid the need to us the suffix later
  • Update code that used the inner buffer and assumed truncation

Are these changes tested?

Are there any user-facing changes?

The inner buffer isn't truncated to the number of bytes, but to 64-bits, a small change.
However, given that BooleanArray is represented based on the offset and number of bits into any inner buffer,

@github-actions github-actions bot added the arrow Changes to the arrow crate label Jan 29, 2026
@Dandandan Dandandan marked this pull request as draft January 29, 2026 20:19
@Dandandan
Copy link
Contributor Author

Dandandan commented Jan 29, 2026

Need to address the issues (might be code that does not expect the extra padding).
We could perhaps reintroduce the truncation.

@Dandandan
Copy link
Contributor Author

run benchmark boolean_kernels

@alamb-ghbot
Copy link

🤖 ./gh_compare_arrow.sh gh_compare_arrow.sh Running
Linux aal-dev 6.14.0-1018-gcp #19~24.04.1-Ubuntu SMP Wed Sep 24 23:23:09 UTC 2025 x86_64 x86_64 x86_64 GNU/Linux
Comparing optimize_from_bitwise_unary_op (585b9f8) to bd76edd diff
BENCH_NAME=boolean_kernels
BENCH_COMMAND=cargo bench --features=arrow,async,test_common,experimental,object_store --bench boolean_kernels
BENCH_FILTER=
BENCH_BRANCH_NAME=optimize_from_bitwise_unary_op
Results will be posted here when complete

@alamb-ghbot
Copy link

🤖: Benchmark completed

Details

group            main                                   optimize_from_bitwise_unary_op
-----            ----                                   ------------------------------
and              1.00    208.7±5.12ns        ? ?/sec    1.01    211.5±5.11ns        ? ?/sec
and_sliced_1     1.01  1104.4±41.22ns        ? ?/sec    1.00   1095.5±2.44ns        ? ?/sec
and_sliced_24    1.00    245.8±1.88ns        ? ?/sec    1.36    335.4±0.46ns        ? ?/sec
not              1.01    145.9±0.42ns        ? ?/sec    1.00    144.8±0.28ns        ? ?/sec
not_slice_24     1.01    195.0±2.04ns        ? ?/sec    1.00    193.6±2.00ns        ? ?/sec
not_sliced_1     3.41    621.0±6.17ns        ? ?/sec    1.00    182.2±0.19ns        ? ?/sec
or               1.00    197.8±4.69ns        ? ?/sec    1.01    199.7±0.28ns        ? ?/sec
or_sliced_1      1.00  1101.3±19.05ns        ? ?/sec    1.03   1136.4±1.73ns        ? ?/sec
or_sliced_24     1.00    247.0±1.67ns        ? ?/sec    1.16    285.8±2.09ns        ? ?/sec

@Dandandan
Copy link
Contributor Author

run benchmark boolean_kernels

@alamb-ghbot
Copy link

🤖 ./gh_compare_arrow.sh gh_compare_arrow.sh Running
Linux aal-dev 6.14.0-1018-gcp #19~24.04.1-Ubuntu SMP Wed Sep 24 23:23:09 UTC 2025 x86_64 x86_64 x86_64 GNU/Linux
Comparing optimize_from_bitwise_unary_op (ccc9fe2) to bd76edd diff
BENCH_NAME=boolean_kernels
BENCH_COMMAND=cargo bench --features=arrow,async,test_common,experimental,object_store --bench boolean_kernels
BENCH_FILTER=
BENCH_BRANCH_NAME=optimize_from_bitwise_unary_op
Results will be posted here when complete

@alamb-ghbot
Copy link

🤖: Benchmark completed

Details

group            main                                   optimize_from_bitwise_unary_op
-----            ----                                   ------------------------------
and              1.01    208.9±2.37ns        ? ?/sec    1.00    207.5±0.38ns        ? ?/sec
and_sliced_1     1.00   1095.8±1.65ns        ? ?/sec    1.00   1096.6±6.02ns        ? ?/sec
and_sliced_24    1.00    245.7±3.72ns        ? ?/sec    1.37    335.8±1.56ns        ? ?/sec
not              1.03    146.8±2.26ns        ? ?/sec    1.00    142.0±0.71ns        ? ?/sec
not_slice_24     1.04    195.6±2.39ns        ? ?/sec    1.00    188.3±0.33ns        ? ?/sec
not_sliced_1     3.48    620.1±2.75ns        ? ?/sec    1.00    178.0±5.03ns        ? ?/sec
or               1.00    197.3±0.53ns        ? ?/sec    1.01    198.8±2.65ns        ? ?/sec
or_sliced_1      1.00   1096.2±1.38ns        ? ?/sec    1.04   1135.9±3.96ns        ? ?/sec
or_sliced_24     1.00    246.7±0.50ns        ? ?/sec    1.17    289.1±3.16ns        ? ?/sec

@Dandandan Dandandan marked this pull request as ready for review February 5, 2026 19:40
@Dandandan Dandandan requested a review from alamb February 5, 2026 19:51
return result;
let (prefix, aligned_u64s, suffix) =
unsafe { aligned_start.as_ref().align_to::<u64>() };
if prefix.is_empty() && suffix.is_empty() {
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Handle aligned + suffix could maybe be a bit better for x86 (couldn't measure it on Apple M2 - I believe there is no performance difference).
Handling both prefix + suffix was a slightly slower than the unaligned version.

@Dandandan
Copy link
Contributor Author

run benchmark boolean_kernels

@alamb-ghbot
Copy link

🤖 ./gh_compare_arrow.sh gh_compare_arrow.sh Running
Linux aal-dev 6.14.0-1018-gcp #19~24.04.1-Ubuntu SMP Wed Sep 24 23:23:09 UTC 2025 x86_64 x86_64 x86_64 GNU/Linux
Comparing optimize_from_bitwise_unary_op (df25192) to bd76edd diff
BENCH_NAME=boolean_kernels
BENCH_COMMAND=cargo bench --features=arrow,async,test_common,experimental,object_store --bench boolean_kernels
BENCH_FILTER=
BENCH_BRANCH_NAME=optimize_from_bitwise_unary_op
Results will be posted here when complete

@alamb-ghbot
Copy link

🤖: Benchmark completed

Details

group            main                                   optimize_from_bitwise_unary_op
-----            ----                                   ------------------------------
and              1.02    212.4±3.72ns        ? ?/sec    1.00    207.2±0.88ns        ? ?/sec
and_sliced_1     1.01   1101.7±5.11ns        ? ?/sec    1.00   1091.6±1.36ns        ? ?/sec
and_sliced_24    1.00    248.1±4.08ns        ? ?/sec    1.34    332.7±1.07ns        ? ?/sec
not              1.04    148.9±3.67ns        ? ?/sec    1.00    143.0±0.99ns        ? ?/sec
not_slice_24     1.03    197.0±4.09ns        ? ?/sec    1.00    191.6±0.48ns        ? ?/sec
not_sliced_1     3.57    621.1±3.75ns        ? ?/sec    1.00    174.2±0.37ns        ? ?/sec
or               1.00    199.4±3.54ns        ? ?/sec    1.00    199.8±0.71ns        ? ?/sec
or_sliced_1      1.00  1112.4±44.84ns        ? ?/sec    1.02   1139.1±1.96ns        ? ?/sec
or_sliced_24     1.00    251.9±8.50ns        ? ?/sec    1.14    286.5±1.98ns        ? ?/sec

@Dandandan
Copy link
Contributor Author

run benchmark boolean_kernels

@alamb-ghbot
Copy link

🤖 ./gh_compare_arrow.sh gh_compare_arrow.sh Running
Linux aal-dev 6.14.0-1018-gcp #19~24.04.1-Ubuntu SMP Wed Sep 24 23:23:09 UTC 2025 x86_64 x86_64 x86_64 GNU/Linux
Comparing optimize_from_bitwise_unary_op (6e95b3a) to bd76edd diff
BENCH_NAME=boolean_kernels
BENCH_COMMAND=cargo bench --features=arrow,async,test_common,experimental,object_store --bench boolean_kernels
BENCH_FILTER=
BENCH_BRANCH_NAME=optimize_from_bitwise_unary_op
Results will be posted here when complete

@alamb-ghbot
Copy link

🤖: Benchmark completed

Details

group            main                                   optimize_from_bitwise_unary_op
-----            ----                                   ------------------------------
and              1.02    209.7±5.22ns        ? ?/sec    1.00    206.1±0.60ns        ? ?/sec
and_sliced_1     1.00   1096.4±3.50ns        ? ?/sec    1.00  1092.0±20.97ns        ? ?/sec
and_sliced_24    1.00    245.4±1.05ns        ? ?/sec    1.34    329.6±1.54ns        ? ?/sec
not              1.01    146.2±0.59ns        ? ?/sec    1.00    144.6±2.28ns        ? ?/sec
not_slice_24     1.13    194.2±0.64ns        ? ?/sec    1.00    172.2±0.78ns        ? ?/sec
not_sliced_1     3.60    619.4±2.46ns        ? ?/sec    1.00    172.1±0.73ns        ? ?/sec
or               1.00    196.5±1.35ns        ? ?/sec    1.01    197.5±0.76ns        ? ?/sec
or_sliced_1      1.00  1100.6±14.77ns        ? ?/sec    1.04   1139.3±8.90ns        ? ?/sec
or_sliced_24     1.00    247.2±1.04ns        ? ?/sec    1.16    286.2±2.74ns        ? ?/sec

@Dandandan
Copy link
Contributor Author

run benchmark boolean_kernels

@alamb-ghbot
Copy link

🤖 ./gh_compare_arrow.sh gh_compare_arrow.sh Running
Linux aal-dev 6.14.0-1018-gcp #19~24.04.1-Ubuntu SMP Wed Sep 24 23:23:09 UTC 2025 x86_64 x86_64 x86_64 GNU/Linux
Comparing optimize_from_bitwise_unary_op (cf32fcb) to bd76edd diff
BENCH_NAME=boolean_kernels
BENCH_COMMAND=cargo bench --features=arrow,async,test_common,experimental,object_store --bench boolean_kernels
BENCH_FILTER=
BENCH_BRANCH_NAME=optimize_from_bitwise_unary_op
Results will be posted here when complete

@alamb-ghbot
Copy link

🤖: Benchmark completed

Details

group            main                                   optimize_from_bitwise_unary_op
-----            ----                                   ------------------------------
and              1.01    209.9±3.67ns        ? ?/sec    1.00    208.4±0.58ns        ? ?/sec
and_sliced_1     1.01   1098.0±9.70ns        ? ?/sec    1.00   1088.3±8.18ns        ? ?/sec
and_sliced_24    1.00    245.0±1.25ns        ? ?/sec    1.34    329.3±2.20ns        ? ?/sec
not              1.67    239.3±2.81ns        ? ?/sec    1.00    143.1±1.07ns        ? ?/sec
not_slice_24     1.31    227.1±0.56ns        ? ?/sec    1.00    173.2±2.35ns        ? ?/sec
not_sliced_1     3.70    641.0±5.75ns        ? ?/sec    1.00    173.4±4.10ns        ? ?/sec
or               1.15    229.2±4.99ns        ? ?/sec    1.00    199.6±1.35ns        ? ?/sec
or_sliced_1      1.00  1123.3±11.82ns        ? ?/sec    1.02  1141.6±16.14ns        ? ?/sec
or_sliced_24     1.00    282.8±1.55ns        ? ?/sec    1.01    286.4±1.70ns        ? ?/sec

@Dandandan
Copy link
Contributor Author

run benchmark boolean_kernels

@alamb-ghbot
Copy link

🤖 ./gh_compare_arrow.sh gh_compare_arrow.sh Running
Linux aal-dev 6.14.0-1018-gcp #19~24.04.1-Ubuntu SMP Wed Sep 24 23:23:09 UTC 2025 x86_64 x86_64 x86_64 GNU/Linux
Comparing optimize_from_bitwise_unary_op (cf32fcb) to bd76edd diff
BENCH_NAME=boolean_kernels
BENCH_COMMAND=cargo bench --features=arrow,async,test_common,experimental,object_store --bench boolean_kernels
BENCH_FILTER=
BENCH_BRANCH_NAME=optimize_from_bitwise_unary_op
Results will be posted here when complete

@alamb-ghbot
Copy link

🤖: Benchmark completed

Details

group            main                                   optimize_from_bitwise_unary_op
-----            ----                                   ------------------------------
and              1.00    209.3±2.38ns        ? ?/sec    1.48    310.3±1.11ns        ? ?/sec
and_sliced_1     1.01   1096.9±4.72ns        ? ?/sec    1.00   1088.9±8.76ns        ? ?/sec
and_sliced_24    1.00    244.8±0.94ns        ? ?/sec    1.35    330.2±3.40ns        ? ?/sec
not              1.02    146.4±1.28ns        ? ?/sec    1.00    143.2±1.52ns        ? ?/sec
not_slice_24     1.12    194.0±0.38ns        ? ?/sec    1.00    172.8±0.68ns        ? ?/sec
not_sliced_1     3.58    619.9±7.97ns        ? ?/sec    1.00    173.0±1.64ns        ? ?/sec
or               1.00    196.7±1.83ns        ? ?/sec    1.01    199.4±0.89ns        ? ?/sec
or_sliced_1      1.00   1098.2±3.17ns        ? ?/sec    1.04  1141.9±19.01ns        ? ?/sec
or_sliced_24     1.00    244.3±0.89ns        ? ?/sec    1.18    287.6±2.47ns        ? ?/sec

@Dandandan
Copy link
Contributor Author

FYI @alamb I think it's as good as it can be now.


let aligned_start = &src.as_ref()[aligned_offset / 8..slice_end];

let (prefix, aligned_u64s, suffix) = unsafe { aligned_start.as_ref().align_to::<u64>() };
Copy link
Contributor Author

@Dandandan Dandandan Feb 6, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think previous benchmark results are sometimes noisy because underlying buffer is not always aligned to 64 bits (not 100% sure but it would explain it) and then take the slower (unaligned) path.
Now the slower path is not much slower. We probably also want to make sure the array creation path aligns to u64 in most cases and we also keep the alignment in kernels.

@Dandandan
Copy link
Contributor Author

@jhorstmann perhaps you want to take a look?

@alamb alamb changed the title Optimize from_bitwise_unary_op Optimize from_bitwise_unary_op for byte aligned case Feb 8, 2026
Copy link
Contributor

@alamb alamb left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is very clever @Dandandan -- thank you

I don't understand the changes to the binary operations, and I do wonder if the "not creating aligned output" change is a concern.

bit_offset: 0,
bit_len: self.bit_len,
}
BooleanBuffer::from_bitwise_binary_op(
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This change seems unrelated to the improvements in bitwise binary op and is perhaps the source of the 50% reported slowdown of and?

group            main                                   optimize_from_bitwise_unary_op
-----            ----                                   ------------------------------
and              1.00    209.3±2.38ns        ? ?/sec    1.48    310.3±1.11ns        ? ?/sec

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hm not sure if the change is due to this, but the changes look unneeded I agree for this PR.

Copy link
Contributor Author

@Dandandan Dandandan Feb 8, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think the results for and might be noisy just like the results before for not were noisy - it sometimes hits the aligned case and sometimes not (due to if buffer is allocated aligned or not).

(See example of the same perf a run earlier:)

and              1.01    209.9±3.67ns        ? ?/sec    1.00    208.4±0.58ns        ? ?/sec

Also, the implemantation for buffer_bin_and is currently as follows (showing the difference should indeed be due to noise):

BooleanBuffer::from_bitwise_binary_op(
        left,
        left_offset_in_bits,
        right,
        right_offset_in_bits,
        len_in_bits,
        |a, b| a & b,
    )
    .into_inner()

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

(We can make the same change for the binary case, I think the speedup there might be even ~5x)

}

BooleanBuffer::from_bits(self.as_slice(), offset, len).into_inner()
let chunks = self.bit_chunks(offset, len);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This change also seems unrelated -- perhaps we can pull it into its own PR

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is required to make it work/tests pass as into_inner throws away the bit offset and length.

let remainder = chunks.remainder();
let iter = chunks.map(|c| u64::from_le_bytes(c.try_into().unwrap()));
let vec_u64s: Vec<u64> = if remainder.is_empty() {
iter.map(&mut op).collect()
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

in theory the remainder should never be empty right? Otherwise the aligned path above would be hit

Copy link
Contributor Author

@Dandandan Dandandan Feb 8, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hm I think the buffer itself (the address at offset 0) could still be not aligned to 64 bytes and has a prefix in the path above (thus go to this path). It could still be the entire buffer from beginning to end is a multiple of 64 bits and the remainder is empty.

let result_u64s: Vec<u64> = aligned_u6us.iter().map(|l| op(*l)).collect();
let buffer = Buffer::from(result_u64s);
Some(BooleanBuffer::new(buffer, 0, len_in_bits))
BooleanBuffer::new(vec_u64s.into(), offset_in_bits % 64, len_in_bits)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is a key difference in the two approaches -- the current code on main will produce an output buffer that is aligned (offset is 0), but this code will produce an output buffer that is not aligned (same as the input)

That is probably why the benchmark results can be so much better in this case -- because the output is different (though still correct)

This is probably ok, but I wanted to point it out as a potential side effect

Copy link
Contributor Author

@Dandandan Dandandan Feb 8, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes that's indeed the main reason (do not bitshift to create offset of 0 which is ~3.5x speedup).The other part (~15% or so) is to align to 8 bytes instead of 1 byte as much as possible to be able to use the fast path as much as possible.

I also found the combination of collect/from_trusted_len_iterator + either iterstor is slow due to a non-existent implementation of fold /being able to use it in from_trusted_len_iterator (which probably makes sense to still PR) but using chunks_exact it's not required.

bit_len: len_in_bits,
}
}
// align to byte boundaries
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This codepath appears untested by unit tests

cargo llvm-cov --html test -p arrow-buffer
Image

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ah I made the other path too unlikely by "aligning" input to 64 bits, let's add a case for this.

@Dandandan Dandandan changed the title Optimize from_bitwise_unary_op for byte aligned case Optimize from_bitwise_unary_op Feb 8, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

arrow Changes to the arrow crate

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Optimize from_bitwise_unary_op

3 participants