Optimize `from_bitwise_unary_op` by Dandandan · Pull Request #9297 · apache/arrow-rs

Dandandan · 2026-01-29T10:00:26Z

Which issue does this PR close?

Closes Optimize from_bitwise_unary_op #9364

Rationale for this change

This is way better on non-byte-offsets (not_sliced_1). Also rounds down to 64 bits instead of by byte, so it's more likely the aligned path is taken (not_slice_24):

main                                                    optimize_from_bitwise_unary_op
not_sliced_1     3.57    621.1±3.75ns        ? ?/sec    1.00    174.2±0.37ns        ? ?/sec
not_slice_24     1.13    194.2±0.64ns        ? ?/sec    1.00    172.2±0.78ns        ? ?/sec

What changes are included in this PR?

Change the code to use the 64-bit aligned (or aligned + suffix) path as much as possible
Speed up the non-aligned path using chunks_exact (stable since version 1.31)
Avoid truncation to avoid the need to us the suffix later
Update code that used the inner buffer and assumed truncation

Are these changes tested?

Are there any user-facing changes?

The inner buffer isn't truncated to the number of bytes, but to 64-bits, a small change.
However, given that BooleanArray is represented based on the offset and number of bits into any inner buffer,

Dandandan · 2026-01-29T20:20:00Z

Need to address the issues (might be code that does not expect the extra padding).
We could perhaps reintroduce the truncation.

Dandandan · 2026-02-05T16:36:25Z

run benchmark boolean_kernels

alamb-ghbot · 2026-02-05T17:23:09Z

🤖 ./gh_compare_arrow.sh gh_compare_arrow.sh Running
Linux aal-dev 6.14.0-1018-gcp #19~24.04.1-Ubuntu SMP Wed Sep 24 23:23:09 UTC 2025 x86_64 x86_64 x86_64 GNU/Linux
Comparing optimize_from_bitwise_unary_op (585b9f8) to bd76edd diff
BENCH_NAME=boolean_kernels
BENCH_COMMAND=cargo bench --features=arrow,async,test_common,experimental,object_store --bench boolean_kernels
BENCH_FILTER=
BENCH_BRANCH_NAME=optimize_from_bitwise_unary_op
Results will be posted here when complete

alamb-ghbot · 2026-02-05T17:28:01Z

🤖: Benchmark completed

Details

group            main                                   optimize_from_bitwise_unary_op
-----            ----                                   ------------------------------
and              1.00    208.7±5.12ns        ? ?/sec    1.01    211.5±5.11ns        ? ?/sec
and_sliced_1     1.01  1104.4±41.22ns        ? ?/sec    1.00   1095.5±2.44ns        ? ?/sec
and_sliced_24    1.00    245.8±1.88ns        ? ?/sec    1.36    335.4±0.46ns        ? ?/sec
not              1.01    145.9±0.42ns        ? ?/sec    1.00    144.8±0.28ns        ? ?/sec
not_slice_24     1.01    195.0±2.04ns        ? ?/sec    1.00    193.6±2.00ns        ? ?/sec
not_sliced_1     3.41    621.0±6.17ns        ? ?/sec    1.00    182.2±0.19ns        ? ?/sec
or               1.00    197.8±4.69ns        ? ?/sec    1.01    199.7±0.28ns        ? ?/sec
or_sliced_1      1.00  1101.3±19.05ns        ? ?/sec    1.03   1136.4±1.73ns        ? ?/sec
or_sliced_24     1.00    247.0±1.67ns        ? ?/sec    1.16    285.8±2.09ns        ? ?/sec

Dandandan · 2026-02-05T19:07:13Z

run benchmark boolean_kernels

alamb-ghbot · 2026-02-05T19:07:21Z

🤖 ./gh_compare_arrow.sh gh_compare_arrow.sh Running
Linux aal-dev 6.14.0-1018-gcp #19~24.04.1-Ubuntu SMP Wed Sep 24 23:23:09 UTC 2025 x86_64 x86_64 x86_64 GNU/Linux
Comparing optimize_from_bitwise_unary_op (ccc9fe2) to bd76edd diff
BENCH_NAME=boolean_kernels
BENCH_COMMAND=cargo bench --features=arrow,async,test_common,experimental,object_store --bench boolean_kernels
BENCH_FILTER=
BENCH_BRANCH_NAME=optimize_from_bitwise_unary_op
Results will be posted here when complete

alamb-ghbot · 2026-02-05T19:12:15Z

🤖: Benchmark completed

Details

group            main                                   optimize_from_bitwise_unary_op
-----            ----                                   ------------------------------
and              1.01    208.9±2.37ns        ? ?/sec    1.00    207.5±0.38ns        ? ?/sec
and_sliced_1     1.00   1095.8±1.65ns        ? ?/sec    1.00   1096.6±6.02ns        ? ?/sec
and_sliced_24    1.00    245.7±3.72ns        ? ?/sec    1.37    335.8±1.56ns        ? ?/sec
not              1.03    146.8±2.26ns        ? ?/sec    1.00    142.0±0.71ns        ? ?/sec
not_slice_24     1.04    195.6±2.39ns        ? ?/sec    1.00    188.3±0.33ns        ? ?/sec
not_sliced_1     3.48    620.1±2.75ns        ? ?/sec    1.00    178.0±5.03ns        ? ?/sec
or               1.00    197.3±0.53ns        ? ?/sec    1.01    198.8±2.65ns        ? ?/sec
or_sliced_1      1.00   1096.2±1.38ns        ? ?/sec    1.04   1135.9±3.96ns        ? ?/sec
or_sliced_24     1.00    246.7±0.50ns        ? ?/sec    1.17    289.1±3.16ns        ? ?/sec

Dandandan · 2026-02-05T20:08:53Z

arrow-buffer/src/buffer/boolean.rs

-                return result;
+            let (prefix, aligned_u64s, suffix) =
+                unsafe { aligned_start.as_ref().align_to::<u64>() };
+            if prefix.is_empty() && suffix.is_empty() {


Handle aligned + suffix could maybe be a bit better for x86 (couldn't measure it on Apple M2 - I believe there is no performance difference).
Handling both prefix + suffix was a slightly slower than the unaligned version.

Dandandan · 2026-02-06T08:11:27Z

run benchmark boolean_kernels

alamb-ghbot · 2026-02-06T08:11:34Z

🤖 ./gh_compare_arrow.sh gh_compare_arrow.sh Running
Linux aal-dev 6.14.0-1018-gcp #19~24.04.1-Ubuntu SMP Wed Sep 24 23:23:09 UTC 2025 x86_64 x86_64 x86_64 GNU/Linux
Comparing optimize_from_bitwise_unary_op (df25192) to bd76edd diff
BENCH_NAME=boolean_kernels
BENCH_COMMAND=cargo bench --features=arrow,async,test_common,experimental,object_store --bench boolean_kernels
BENCH_FILTER=
BENCH_BRANCH_NAME=optimize_from_bitwise_unary_op
Results will be posted here when complete

alamb-ghbot · 2026-02-06T08:16:36Z

🤖: Benchmark completed

Details

group            main                                   optimize_from_bitwise_unary_op
-----            ----                                   ------------------------------
and              1.02    212.4±3.72ns        ? ?/sec    1.00    207.2±0.88ns        ? ?/sec
and_sliced_1     1.01   1101.7±5.11ns        ? ?/sec    1.00   1091.6±1.36ns        ? ?/sec
and_sliced_24    1.00    248.1±4.08ns        ? ?/sec    1.34    332.7±1.07ns        ? ?/sec
not              1.04    148.9±3.67ns        ? ?/sec    1.00    143.0±0.99ns        ? ?/sec
not_slice_24     1.03    197.0±4.09ns        ? ?/sec    1.00    191.6±0.48ns        ? ?/sec
not_sliced_1     3.57    621.1±3.75ns        ? ?/sec    1.00    174.2±0.37ns        ? ?/sec
or               1.00    199.4±3.54ns        ? ?/sec    1.00    199.8±0.71ns        ? ?/sec
or_sliced_1      1.00  1112.4±44.84ns        ? ?/sec    1.02   1139.1±1.96ns        ? ?/sec
or_sliced_24     1.00    251.9±8.50ns        ? ?/sec    1.14    286.5±1.98ns        ? ?/sec

Dandandan · 2026-02-06T09:02:55Z

run benchmark boolean_kernels

alamb-ghbot · 2026-02-06T09:03:06Z

🤖 ./gh_compare_arrow.sh gh_compare_arrow.sh Running
Linux aal-dev 6.14.0-1018-gcp #19~24.04.1-Ubuntu SMP Wed Sep 24 23:23:09 UTC 2025 x86_64 x86_64 x86_64 GNU/Linux
Comparing optimize_from_bitwise_unary_op (6e95b3a) to bd76edd diff
BENCH_NAME=boolean_kernels
BENCH_COMMAND=cargo bench --features=arrow,async,test_common,experimental,object_store --bench boolean_kernels
BENCH_FILTER=
BENCH_BRANCH_NAME=optimize_from_bitwise_unary_op
Results will be posted here when complete

alamb-ghbot · 2026-02-06T09:08:02Z

🤖: Benchmark completed

Details

group            main                                   optimize_from_bitwise_unary_op
-----            ----                                   ------------------------------
and              1.02    209.7±5.22ns        ? ?/sec    1.00    206.1±0.60ns        ? ?/sec
and_sliced_1     1.00   1096.4±3.50ns        ? ?/sec    1.00  1092.0±20.97ns        ? ?/sec
and_sliced_24    1.00    245.4±1.05ns        ? ?/sec    1.34    329.6±1.54ns        ? ?/sec
not              1.01    146.2±0.59ns        ? ?/sec    1.00    144.6±2.28ns        ? ?/sec
not_slice_24     1.13    194.2±0.64ns        ? ?/sec    1.00    172.2±0.78ns        ? ?/sec
not_sliced_1     3.60    619.4±2.46ns        ? ?/sec    1.00    172.1±0.73ns        ? ?/sec
or               1.00    196.5±1.35ns        ? ?/sec    1.01    197.5±0.76ns        ? ?/sec
or_sliced_1      1.00  1100.6±14.77ns        ? ?/sec    1.04   1139.3±8.90ns        ? ?/sec
or_sliced_24     1.00    247.2±1.04ns        ? ?/sec    1.16    286.2±2.74ns        ? ?/sec

Dandandan · 2026-02-06T09:13:54Z

run benchmark boolean_kernels

alamb-ghbot · 2026-02-06T09:14:05Z

🤖 ./gh_compare_arrow.sh gh_compare_arrow.sh Running
Linux aal-dev 6.14.0-1018-gcp #19~24.04.1-Ubuntu SMP Wed Sep 24 23:23:09 UTC 2025 x86_64 x86_64 x86_64 GNU/Linux
Comparing optimize_from_bitwise_unary_op (cf32fcb) to bd76edd diff
BENCH_NAME=boolean_kernels
BENCH_COMMAND=cargo bench --features=arrow,async,test_common,experimental,object_store --bench boolean_kernels
BENCH_FILTER=
BENCH_BRANCH_NAME=optimize_from_bitwise_unary_op
Results will be posted here when complete

alamb-ghbot · 2026-02-06T09:19:02Z

🤖: Benchmark completed

Details

group            main                                   optimize_from_bitwise_unary_op
-----            ----                                   ------------------------------
and              1.01    209.9±3.67ns        ? ?/sec    1.00    208.4±0.58ns        ? ?/sec
and_sliced_1     1.01   1098.0±9.70ns        ? ?/sec    1.00   1088.3±8.18ns        ? ?/sec
and_sliced_24    1.00    245.0±1.25ns        ? ?/sec    1.34    329.3±2.20ns        ? ?/sec
not              1.67    239.3±2.81ns        ? ?/sec    1.00    143.1±1.07ns        ? ?/sec
not_slice_24     1.31    227.1±0.56ns        ? ?/sec    1.00    173.2±2.35ns        ? ?/sec
not_sliced_1     3.70    641.0±5.75ns        ? ?/sec    1.00    173.4±4.10ns        ? ?/sec
or               1.15    229.2±4.99ns        ? ?/sec    1.00    199.6±1.35ns        ? ?/sec
or_sliced_1      1.00  1123.3±11.82ns        ? ?/sec    1.02  1141.6±16.14ns        ? ?/sec
or_sliced_24     1.00    282.8±1.55ns        ? ?/sec    1.01    286.4±1.70ns        ? ?/sec

Dandandan · 2026-02-06T09:20:31Z

run benchmark boolean_kernels

alamb-ghbot · 2026-02-06T09:20:44Z

🤖 ./gh_compare_arrow.sh gh_compare_arrow.sh Running
Linux aal-dev 6.14.0-1018-gcp #19~24.04.1-Ubuntu SMP Wed Sep 24 23:23:09 UTC 2025 x86_64 x86_64 x86_64 GNU/Linux
Comparing optimize_from_bitwise_unary_op (cf32fcb) to bd76edd diff
BENCH_NAME=boolean_kernels
BENCH_COMMAND=cargo bench --features=arrow,async,test_common,experimental,object_store --bench boolean_kernels
BENCH_FILTER=
BENCH_BRANCH_NAME=optimize_from_bitwise_unary_op
Results will be posted here when complete

alamb-ghbot · 2026-02-06T09:25:40Z

🤖: Benchmark completed

Details

group            main                                   optimize_from_bitwise_unary_op
-----            ----                                   ------------------------------
and              1.00    209.3±2.38ns        ? ?/sec    1.48    310.3±1.11ns        ? ?/sec
and_sliced_1     1.01   1096.9±4.72ns        ? ?/sec    1.00   1088.9±8.76ns        ? ?/sec
and_sliced_24    1.00    244.8±0.94ns        ? ?/sec    1.35    330.2±3.40ns        ? ?/sec
not              1.02    146.4±1.28ns        ? ?/sec    1.00    143.2±1.52ns        ? ?/sec
not_slice_24     1.12    194.0±0.38ns        ? ?/sec    1.00    172.8±0.68ns        ? ?/sec
not_sliced_1     3.58    619.9±7.97ns        ? ?/sec    1.00    173.0±1.64ns        ? ?/sec
or               1.00    196.7±1.83ns        ? ?/sec    1.01    199.4±0.89ns        ? ?/sec
or_sliced_1      1.00   1098.2±3.17ns        ? ?/sec    1.04  1141.9±19.01ns        ? ?/sec
or_sliced_24     1.00    244.3±0.89ns        ? ?/sec    1.18    287.6±2.47ns        ? ?/sec

Dandandan · 2026-02-06T10:22:09Z

FYI @alamb I think it's as good as it can be now.

Dandandan · 2026-02-06T11:09:57Z

arrow-buffer/src/buffer/boolean.rs

+
+        let aligned_start = &src.as_ref()[aligned_offset / 8..slice_end];
+
+        let (prefix, aligned_u64s, suffix) = unsafe { aligned_start.as_ref().align_to::<u64>() };


I think previous benchmark results are sometimes noisy because underlying buffer is not always aligned to 64 bits (not 100% sure but it would explain it) and then take the slower (unaligned) path.
Now the slower path is not much slower. We probably also want to make sure the array creation path aligns to u64 in most cases and we also keep the alignment in kernels.

Dandandan · 2026-02-06T16:14:00Z

@jhorstmann perhaps you want to take a look?

alamb

This is very clever @Dandandan -- thank you

I don't understand the changes to the binary operations, and I do wonder if the "not creating aligned output" change is a concern.

alamb · 2026-02-08T11:37:46Z

arrow-buffer/src/buffer/boolean.rs

-            bit_offset: 0,
-            bit_len: self.bit_len,
-        }
+        BooleanBuffer::from_bitwise_binary_op(


This change seems unrelated to the improvements in bitwise binary op and is perhaps the source of the 50% reported slowdown of and?

group main optimize_from_bitwise_unary_op ----- ---- ------------------------------ and 1.00 209.3±2.38ns ? ?/sec 1.48 310.3±1.11ns ? ?/sec

Hm not sure if the change is due to this, but the changes look unneeded I agree for this PR.

I think the results for and might be noisy just like the results before for not were noisy - it sometimes hits the aligned case and sometimes not (due to if buffer is allocated aligned or not).

(See example of the same perf a run earlier:)

and 1.01 209.9±3.67ns ? ?/sec 1.00 208.4±0.58ns ? ?/sec

Also, the implemantation for buffer_bin_and is currently as follows (showing the difference should indeed be due to noise):

BooleanBuffer::from_bitwise_binary_op( left, left_offset_in_bits, right, right_offset_in_bits, len_in_bits, |a, b| a & b, ) .into_inner()

(We can make the same change for the binary case, I think the speedup there might be even ~5x)

alamb · 2026-02-08T11:38:59Z

arrow-buffer/src/buffer/immutable.rs

        }

-        BooleanBuffer::from_bits(self.as_slice(), offset, len).into_inner()
+        let chunks = self.bit_chunks(offset, len);


This change also seems unrelated -- perhaps we can pull it into its own PR

This is required to make it work/tests pass as into_inner throws away the bit offset and length.

alamb · 2026-02-08T11:43:59Z

arrow-buffer/src/buffer/boolean.rs

+        let remainder = chunks.remainder();
+        let iter = chunks.map(|c| u64::from_le_bytes(c.try_into().unwrap()));
+        let vec_u64s: Vec<u64> = if remainder.is_empty() {
+            iter.map(&mut op).collect()


in theory the remainder should never be empty right? Otherwise the aligned path above would be hit

Hm I think the buffer itself (the address at offset 0) could still be not aligned to 64 bytes and has a prefix in the path above (thus go to this path). It could still be the entire buffer from beginning to end is a multiple of 64 bits and the remainder is empty.

alamb · 2026-02-08T11:45:35Z

arrow-buffer/src/buffer/boolean.rs

-        let result_u64s: Vec<u64> = aligned_u6us.iter().map(|l| op(*l)).collect();
-        let buffer = Buffer::from(result_u64s);
-        Some(BooleanBuffer::new(buffer, 0, len_in_bits))
+        BooleanBuffer::new(vec_u64s.into(), offset_in_bits % 64, len_in_bits)


This is a key difference in the two approaches -- the current code on main will produce an output buffer that is aligned (offset is 0), but this code will produce an output buffer that is not aligned (same as the input)

That is probably why the benchmark results can be so much better in this case -- because the output is different (though still correct)

This is probably ok, but I wanted to point it out as a potential side effect

Yes that's indeed the main reason (do not bitshift to create offset of 0 which is ~3.5x speedup).The other part (~15% or so) is to align to 8 bytes instead of 1 byte as much as possible to be able to use the fast path as much as possible.

I also found the combination of collect/from_trusted_len_iterator + either iterstor is slow due to a non-existent implementation of fold /being able to use it in from_trusted_len_iterator (which probably makes sense to still PR) but using chunks_exact it's not required.

arrow-buffer/src/buffer/boolean.rs

alamb · 2026-02-08T11:52:02Z

arrow-buffer/src/buffer/boolean.rs

-            bit_len: len_in_bits,
-        }
-    }
+        // align to byte boundaries


This codepath appears untested by unit tests

cargo llvm-cov --html test -p arrow-buffer

Ah I made the other path too unlikely by "aligning" input to 64 bits, let's add a case for this.

Optimize from_bitwise_unary_op

8f180ae

github-actions bot added the arrow Changes to the arrow crate label Jan 29, 2026

Dandandan added 4 commits January 29, 2026 11:05

Optimize from_bitwise_unary_op

b283f3f

Optimize from_bitwise_unary_op

9f2fc72

Optimize from_bitwise_unary_op

84b13f5

Optimize from_bitwise_unary_op

fb0836c

Dandandan marked this pull request as draft January 29, 2026 20:19

Dandandan added 4 commits February 5, 2026 17:27

truncate

67dae63

truncate

0cf1ff6

truncate

2361dba

truncate

585b9f8

Fix

ccc9fe2

Import

42571b7

Dandandan added 3 commits February 5, 2026 20:27

Fix

91cd0f7

Fix

bb12e0b

Fix

cd1884b

Dandandan marked this pull request as ready for review February 5, 2026 19:40

Dandandan requested a review from alamb February 5, 2026 19:51

Dandandan commented Feb 5, 2026

View reviewed changes

Test

df25192

Clippy

13128ca

Test

6e95b3a

Test

cf32fcb

Alignment

128ecbb

Dandandan commented Feb 6, 2026

View reviewed changes

alamb changed the title ~~Optimize from_bitwise_unary_op~~ Optimize from_bitwise_unary_op for byte aligned case Feb 8, 2026

alamb reviewed Feb 8, 2026

View reviewed changes

Revert unneeded change

e779a60

Dandandan changed the title ~~Optimize from_bitwise_unary_op for byte aligned case~~ Optimize from_bitwise_unary_op Feb 8, 2026

Dandandan mentioned this pull request Feb 8, 2026

Optimize from_bitwise_binary_op #9378

Open

Revert unneeded change

47e3133


		let aligned_start = &src.as_ref()[aligned_offset / 8..slice_end];

		let (prefix, aligned_u64s, suffix) = unsafe { aligned_start.as_ref().align_to::<u64>() };

Conversation

Dandandan commented Jan 29, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Which issue does this PR close?

Rationale for this change

What changes are included in this PR?

Are these changes tested?

Are there any user-facing changes?

Uh oh!

Dandandan commented Jan 29, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Dandandan commented Feb 5, 2026

Uh oh!

alamb-ghbot commented Feb 5, 2026

Uh oh!

alamb-ghbot commented Feb 5, 2026

Uh oh!

Dandandan commented Feb 5, 2026

Uh oh!

alamb-ghbot commented Feb 5, 2026

Uh oh!

alamb-ghbot commented Feb 5, 2026

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Dandandan commented Feb 6, 2026

Uh oh!

alamb-ghbot commented Feb 6, 2026

Uh oh!

alamb-ghbot commented Feb 6, 2026

Uh oh!

Dandandan commented Feb 6, 2026

Uh oh!

alamb-ghbot commented Feb 6, 2026

Uh oh!

alamb-ghbot commented Feb 6, 2026

Uh oh!

Dandandan commented Feb 6, 2026

Uh oh!

alamb-ghbot commented Feb 6, 2026

Uh oh!

alamb-ghbot commented Feb 6, 2026

Uh oh!

Dandandan commented Feb 6, 2026

Uh oh!

alamb-ghbot commented Feb 6, 2026

Uh oh!

alamb-ghbot commented Feb 6, 2026

Uh oh!

Dandandan commented Feb 6, 2026

Uh oh!

Dandandan Feb 6, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Dandandan commented Feb 6, 2026

Uh oh!

alamb left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Dandandan Feb 8, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Dandandan Feb 8, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Dandandan commented Jan 29, 2026 •

edited

Loading

Dandandan commented Jan 29, 2026 •

edited

Loading

Dandandan Feb 6, 2026 •

edited

Loading

Dandandan Feb 8, 2026 •

edited

Loading

Dandandan Feb 8, 2026 •

edited

Loading

Dandandan Feb 8, 2026 •

edited

Loading