Speed up binary kernels by XXX%, add `BooleanBuffer::from_bitwise_binary` #9022

alamb · 2025-12-20T21:05:06Z

Which issue does this PR close?

This is the next step after

Speed up unary not kernel by 50%, add BooleanBuffer::from_bitwise_unary #8996

Rationale for this change

The current implementation of the binary kernels have an extra allocation when operating on sliced data which is not necessary.
Also, we can help rust / LLVM generate more optimal code by processing u64 words at a time when the buffer is already u64 aligned (see WIP: special case bitwise ops when buffers are u64 aligned #8807)

Also, it is hard to find the code to create new Buffers by applying bitwise unary operations.

What changes are included in this PR?

Introduce optimized BooleanBuffer::from_bitwise_binary
Deprecate bitwise_bin_op_helper

Are these changes tested?

Yes new tests are added

Performance results show XXX performance improvements

Are there any user-facing changes?

alamb-ghbot · 2025-12-20T21:34:07Z

🤖 ./gh_compare_arrow.sh gh_compare_arrow.sh Running
Linux aal-dev 6.14.0-1018-gcp #19~24.04.1-Ubuntu SMP Wed Sep 24 23:23:09 UTC 2025 x86_64 x86_64 x86_64 GNU/Linux
Comparing alamb/from_bitwise_binary_op (edfc085) to b318293 diff
BENCH_NAME=boolean_kernels
BENCH_COMMAND=cargo bench --features=arrow,async,test_common,experimental --bench boolean_kernels
BENCH_FILTER=
BENCH_BRANCH_NAME=alamb_from_bitwise_binary_op
Results will be posted here when complete

alamb-ghbot · 2025-12-20T21:39:05Z

🤖: Benchmark completed

Details

group            alamb_from_bitwise_binary_op            main
-----            ----------------------------            ----
and              1.00    209.4±4.96ns        ? ?/sec     1.30    272.5±1.15ns        ? ?/sec
and_sliced_1     1.14  1246.2±46.41ns        ? ?/sec     1.00  1097.4±16.34ns        ? ?/sec
and_sliced_24    1.40    337.0±2.76ns        ? ?/sec     1.00    240.1±1.03ns        ? ?/sec
not              1.00    146.3±1.41ns        ? ?/sec     1.00    145.7±0.62ns        ? ?/sec
not_slice_24     1.21    235.0±1.69ns        ? ?/sec     1.00    194.7±2.19ns        ? ?/sec
not_sliced_1     1.01   637.3±11.13ns        ? ?/sec     1.00   628.2±52.10ns        ? ?/sec
or               1.00    200.8±0.72ns        ? ?/sec     1.27    254.1±3.82ns        ? ?/sec
or_sliced_1      1.19  1459.4±238.94ns        ? ?/sec    1.00   1227.2±7.17ns        ? ?/sec
or_sliced_24     1.17    284.2±4.66ns        ? ?/sec     1.00    243.3±1.10ns        ? ?/sec

alamb · 2025-12-20T21:46:19Z

run benchmark boolean_kernels

alamb-ghbot · 2025-12-20T21:46:24Z

🤖 ./gh_compare_arrow.sh gh_compare_arrow.sh Running
Linux aal-dev 6.14.0-1018-gcp #19~24.04.1-Ubuntu SMP Wed Sep 24 23:23:09 UTC 2025 x86_64 x86_64 x86_64 GNU/Linux
Comparing alamb/from_bitwise_binary_op (edfc085) to b318293 diff
BENCH_NAME=boolean_kernels
BENCH_COMMAND=cargo bench --features=arrow,async,test_common,experimental --bench boolean_kernels
BENCH_FILTER=
BENCH_BRANCH_NAME=alamb_from_bitwise_binary_op
Results will be posted here when complete

alamb-ghbot · 2025-12-20T21:51:25Z

🤖: Benchmark completed

Details

group            alamb_from_bitwise_binary_op           main
-----            ----------------------------           ----
and              1.00    208.6±3.27ns        ? ?/sec    1.31    272.5±1.45ns        ? ?/sec
and_sliced_1     1.13  1242.2±20.53ns        ? ?/sec    1.00  1098.4±15.61ns        ? ?/sec
and_sliced_24    1.40    335.6±1.64ns        ? ?/sec    1.00    240.2±0.64ns        ? ?/sec
not              1.00    146.5±2.52ns        ? ?/sec    1.00    146.7±6.04ns        ? ?/sec
not_slice_24     1.21    235.6±6.02ns        ? ?/sec    1.00    194.6±1.80ns        ? ?/sec
not_sliced_1     1.02   636.8±12.25ns        ? ?/sec    1.00    621.6±9.92ns        ? ?/sec
or               1.00    201.0±0.64ns        ? ?/sec    1.26    253.5±0.61ns        ? ?/sec
or_sliced_1      1.15  1410.3±20.10ns        ? ?/sec    1.00  1231.2±42.98ns        ? ?/sec
or_sliced_24     1.16    283.3±2.77ns        ? ?/sec    1.00    243.3±1.87ns        ? ?/sec

alamb · 2025-12-20T22:13:11Z

run benchmark boolean_kernels

alamb-ghbot · 2025-12-20T22:32:46Z

🤖 ./gh_compare_arrow.sh gh_compare_arrow.sh Running
Linux aal-dev 6.14.0-1018-gcp #19~24.04.1-Ubuntu SMP Wed Sep 24 23:23:09 UTC 2025 x86_64 x86_64 x86_64 GNU/Linux
Comparing alamb/from_bitwise_binary_op (be3a64f) to b318293 diff
BENCH_NAME=boolean_kernels
BENCH_COMMAND=cargo bench --features=arrow,async,test_common,experimental --bench boolean_kernels
BENCH_FILTER=
BENCH_BRANCH_NAME=alamb_from_bitwise_binary_op
Results will be posted here when complete

alamb-ghbot · 2025-12-20T22:37:47Z

🤖: Benchmark completed

Details

group            alamb_from_bitwise_binary_op           main
-----            ----------------------------           ----
and              1.00    211.1±3.73ns        ? ?/sec    1.30    273.9±6.37ns        ? ?/sec
and_sliced_1     1.15   1270.1±7.09ns        ? ?/sec    1.00  1100.2±15.18ns        ? ?/sec
and_sliced_24    1.41    342.8±2.71ns        ? ?/sec    1.00   243.3±10.39ns        ? ?/sec
not              1.00    146.1±2.86ns        ? ?/sec    1.00    146.1±1.33ns        ? ?/sec
not_slice_24     1.22    238.5±4.66ns        ? ?/sec    1.00    195.6±3.54ns        ? ?/sec
not_sliced_1     1.03    639.3±8.10ns        ? ?/sec    1.00    620.5±1.10ns        ? ?/sec
or               1.00    200.5±4.42ns        ? ?/sec    1.28    256.6±2.84ns        ? ?/sec
or_sliced_1      1.07  1312.0±10.19ns        ? ?/sec    1.00   1229.3±8.35ns        ? ?/sec
or_sliced_24     1.19    291.5±3.72ns        ? ?/sec    1.00    245.0±3.92ns        ? ?/sec

alamb · 2025-12-21T14:51:11Z

run benchmark boolean_kernels

alamb-ghbot · 2025-12-21T15:48:44Z

🤖 ./gh_compare_arrow.sh gh_compare_arrow.sh Running
Linux aal-dev 6.14.0-1018-gcp #19~24.04.1-Ubuntu SMP Wed Sep 24 23:23:09 UTC 2025 x86_64 x86_64 x86_64 GNU/Linux
Comparing alamb/from_bitwise_binary_op (2b7f80d) to b318293 diff
BENCH_NAME=boolean_kernels
BENCH_COMMAND=cargo bench --features=arrow,async,test_common,experimental --bench boolean_kernels
BENCH_FILTER=
BENCH_BRANCH_NAME=alamb_from_bitwise_binary_op
Results will be posted here when complete

alamb-ghbot · 2025-12-21T15:53:49Z

🤖: Benchmark completed

Details

group            alamb_from_bitwise_binary_op           main
-----            ----------------------------           ----
and              1.00    241.3±2.31ns        ? ?/sec    1.14    275.2±1.60ns        ? ?/sec
and_sliced_1     1.03  1273.2±20.36ns        ? ?/sec    1.00  1241.0±103.60ns        ? ?/sec
and_sliced_24    1.41    342.9±3.59ns        ? ?/sec    1.00    243.8±2.40ns        ? ?/sec
not              1.01    147.2±2.85ns        ? ?/sec    1.00    146.3±0.74ns        ? ?/sec
not_slice_24     1.24    240.1±1.70ns        ? ?/sec    1.00    193.6±1.75ns        ? ?/sec
not_sliced_1     1.03    639.6±4.34ns        ? ?/sec    1.00    621.0±9.94ns        ? ?/sec
or               1.00    202.0±2.89ns        ? ?/sec    1.23    249.2±1.59ns        ? ?/sec
or_sliced_1      1.19  1315.0±16.95ns        ? ?/sec    1.00  1101.9±20.25ns        ? ?/sec
or_sliced_24     1.20    292.0±2.20ns        ? ?/sec    1.00   243.6±21.65ns        ? ?/sec

alamb · 2025-12-21T17:54:41Z

run benchmark boolean_kernels

alamb-ghbot · 2025-12-21T17:54:49Z

🤖 ./gh_compare_arrow.sh gh_compare_arrow.sh Running
Linux aal-dev 6.14.0-1018-gcp #19~24.04.1-Ubuntu SMP Wed Sep 24 23:23:09 UTC 2025 x86_64 x86_64 x86_64 GNU/Linux
Comparing alamb/from_bitwise_binary_op (3fa4683) to b318293 diff
BENCH_NAME=boolean_kernels
BENCH_COMMAND=cargo bench --features=arrow,async,test_common,experimental --bench boolean_kernels
BENCH_FILTER=
BENCH_BRANCH_NAME=alamb_from_bitwise_binary_op
Results will be posted here when complete

alamb-ghbot · 2025-12-21T17:59:48Z

🤖: Benchmark completed

Details

group            alamb_from_bitwise_binary_op           main
-----            ----------------------------           ----
and              1.00    211.5±1.92ns        ? ?/sec    1.30    275.5±4.61ns        ? ?/sec
and_sliced_1     1.00   1095.6±3.62ns        ? ?/sec    1.12  1228.6±12.32ns        ? ?/sec
and_sliced_24    1.39    337.9±3.92ns        ? ?/sec    1.00    243.4±1.74ns        ? ?/sec
not              1.00    146.0±1.14ns        ? ?/sec    1.00    146.0±0.79ns        ? ?/sec
not_slice_24     1.21    233.3±1.16ns        ? ?/sec    1.00    193.2±0.92ns        ? ?/sec
not_sliced_1     1.02    633.1±2.22ns        ? ?/sec    1.00    619.7±6.82ns        ? ?/sec
or               1.00    200.2±0.67ns        ? ?/sec    1.25    249.6±2.64ns        ? ?/sec
or_sliced_1      1.03   1134.5±3.95ns        ? ?/sec    1.00   1099.2±9.42ns        ? ?/sec
or_sliced_24     1.20    287.0±1.10ns        ? ?/sec    1.00    239.1±0.89ns        ? ?/sec

jhorstmann · 2025-12-21T22:00:13Z

arrow-buffer/src/buffer/boolean.rs

+        F: FnMut(u64, u64) -> u64,
+    {
+        // Safety: all valid bytes are valid u64s
+        let (left_prefix, left_u64s, left_suffix) = unsafe { left.align_to::<u64>() };


Do you know how much of a difference the full u64 alignment really makes on current target architectures? I think unaligned loads on x86 are not really slower. Maybe only if they cross a cache line, which should happen every 8th load. With that assumption, using slice::as_chunks and u64::from_le_bytes on each chunk might be simpler, and would then only require handling a suffix.

Staying with align_to and bailing on non-empty prefixes should also be fine, I would expect buffers to be aligned most of the time, but having a byte length that is a multiple of 8 might be less likely.

Other than this nit the code looks good to me 👍

TLDR is I don't really know how much of a difference this makes, and I am struggling with measuring the difference

Right now, the benchmarks seems super sensitive to any change. For example, in the most recent version of this PR I simply moved the code and added the check for alignment. And this results in consistently reproducible slowdowns on some of the sliced bechmarks

and 1.00 211.5±1.92ns ? ?/sec 1.30 275.5±4.61ns ? ?/sec and_sliced_1 1.00 1095.6±3.62ns ? ?/sec 1.12 1228.6±12.32ns ? ?/sec and_sliced_24 1.39 337.9±3.92ns ? ?/sec 1.00 243.4±1.74ns ? ?/sec

With that assumption, using slice::as_chunks and u64::from_le_bytes on each chunk might be simpler, and would then only require handling a suffix.

I tried a version like this and will see how it performs

/// Like [`Self::from_bitwise_binary_op`] but optimized for the case where the /// inputs are aligned to byte boundaries /// /// Returns `None` if the inputs are not fully u64 aligned fn try_from_aligned_bitwise_binary_op<F>( left: &[u8], right: &[u8], len_in_bits: usize, op: &mut F, ) -> Option<Self> where F: FnMut(u64, u64) -> u64, { // trim to length let len_in_bytes = ceil(len_in_bits, 8); let left = &left[0..len_in_bytes]; let right = &right[0..len_in_bytes]; // Safety: all valid bytes are valid u64s let (left_prefix, left_u64s, left_suffix) = unsafe { left.align_to::<u64>() }; let (right_prefix, right_u64s, right_suffix) = unsafe { right.align_to::<u64>() }; if left_prefix.is_empty() && right_prefix.is_empty() && left_suffix.is_empty() && right_suffix.is_empty() { // the buffers are word (64 bit) aligned, so use optimized Vec code. let result_u64s = left_u64s .iter() .zip(right_u64s.iter()) .map(|(l, r)| op(*l, *r)) .collect::<Vec<u64>>(); Some(BooleanBuffer::new( Buffer::from(result_u64s), 0, len_in_bits, )) } else { let (left_slices, left_remainder) = left.as_chunks::<8>(); let (right_slices, right_remainder) = right.as_chunks::<8>(); debug_assert_eq!(left_slices.len(), right_slices.len()); debug_assert_eq!(left_remainder.len(), right_remainder.len()); let mut mutable_result = MutableBuffer::with_capacity(left_slices.len() * 8 + left_remainder.len()); mutable_result.extend_from_iter(left_slices.iter().zip(right_slices.iter()) .map(|(l,r)| { op(u64::from_le_bytes(*l), u64::from_le_bytes(*r)) })); if !left_remainder.is_empty() { let rem = op( u64::from_le_bytes({ let mut bytes = [0u8; 8]; bytes[..left_remainder.len()].copy_from_slice(left_remainder); bytes }), u64::from_le_bytes({ let mut bytes = [0u8; 8]; bytes[..right_remainder.len()].copy_from_slice(right_remainder); bytes }), ); println!("copying {} remainder bytes: {:?}", left_remainder.len(), &rem.to_le_bytes()[..left_remainder.len()]); mutable_result.extend_from_slice(&rem.to_le_bytes()[..left_remainder.len()]); } Some(BooleanBuffer { buffer: Buffer::from(mutable_result), offset: 0, len: len_in_bits, }) } }

Sadly, it seems like we are not going to be able to use as_chunks

``rust
error: current MSRV (Minimum Supported Rust Version) is 1.85.0 but this item is stable since `1.88.0`
--> arrow-buffer/src/buffer/boolean.rs:335:54
|
335 | let (left_slices, left_remainder) = left.as_chunks::<8>();
| ^^^^^^^^^^^^^^^^
|
= note: you may want to conditionally increase the MSRV considered by Clippy using the `clippy::msrv` attribute
= help: for further information visit https://rust-lang.github.io/rust-clippy/rust-1.91.0/index.html#incompatible_msrv
= note: `-D clippy::incompatible-msrv` implied by `-D warnings`
= help: to override `-D warnings` add `#[allow(clippy::incompatible_msrv)]`

alamb · 2025-12-22T13:55:06Z

run benchmark boolean_kernels

alamb-ghbot · 2025-12-22T13:55:14Z

🤖 ./gh_compare_arrow.sh gh_compare_arrow.sh Running
Linux aal-dev 6.14.0-1018-gcp #19~24.04.1-Ubuntu SMP Wed Sep 24 23:23:09 UTC 2025 x86_64 x86_64 x86_64 GNU/Linux
Comparing alamb/from_bitwise_binary_op (03d060b) to b318293 diff
BENCH_NAME=boolean_kernels
BENCH_COMMAND=cargo bench --features=arrow,async,test_common,experimental --bench boolean_kernels
BENCH_FILTER=
BENCH_BRANCH_NAME=alamb_from_bitwise_binary_op
Results will be posted here when complete

alamb · 2025-12-22T13:55:48Z

I may try to regigger the boolean kernel benchmarks to include multiple allocations / sizes to try and get a better sense of this code

alamb-ghbot · 2025-12-22T14:00:14Z

🤖: Benchmark completed

Details

group            alamb_from_bitwise_binary_op           main
-----            ----------------------------           ----
and              1.00    208.0±5.91ns        ? ?/sec    1.34   278.8±10.07ns        ? ?/sec
and_sliced_1     1.00   1100.2±6.53ns        ? ?/sec    1.12   1226.9±6.11ns        ? ?/sec
and_sliced_24    1.40    340.9±2.49ns        ? ?/sec    1.00    243.7±2.13ns        ? ?/sec
not              1.00    145.6±1.12ns        ? ?/sec    1.00    145.2±1.27ns        ? ?/sec
not_slice_24     1.22    235.2±4.16ns        ? ?/sec    1.00    193.1±1.44ns        ? ?/sec
not_sliced_1     1.02   637.6±12.78ns        ? ?/sec    1.00   623.7±29.39ns        ? ?/sec
or               1.00    199.2±3.09ns        ? ?/sec    1.26    250.4±5.14ns        ? ?/sec
or_sliced_1      1.04   1137.6±9.72ns        ? ?/sec    1.00   1097.7±8.11ns        ? ?/sec
or_sliced_24     1.21    289.7±5.66ns        ? ?/sec    1.00    239.7±2.13ns        ? ?/sec

alamb · 2025-12-22T14:12:32Z

Ok, given that totally specializing the slicing didn't help, I am now 100% convinced that the difference in performance reported for _sliced24 is due to alignment or some other crazy thing.

What I plan to do now is:

Try and update the benchmarks to 1) allocate input arrays before any other allocations happen and 2) include more / multiple different alignments (will allocate a few different ways)

alamb · 2025-12-23T13:30:44Z

I made a PR to try and make the benchmarks more reproducable:

Allocate buffers before work in boolean_kernels benchmark #9035

github-actions bot added the arrow Changes to the arrow crate label Dec 20, 2025

Add BinaryBuffer::from_bitwise_binary_op

bedcfd6

alamb force-pushed the alamb/from_bitwise_binary_op branch from 04045d2 to bedcfd6 Compare December 20, 2025 21:18

Improve error message

edfc085

apache deleted a comment from alamb-ghbot Dec 20, 2025

alamb added 2 commits December 20, 2025 17:10

llvm-encouragement

e7ef097

clippy

be3a64f

alamb added 2 commits December 21, 2025 08:11

check

2cdb0fe

More messing around with vectorization

8a7f747

alamb changed the title ~~Speed up binary not kernels by XXX%, add BooleanBuffer::from_bitwise_binary~~ Speed up binary kernels by XXX%, add BooleanBuffer::from_bitwise_binary Dec 21, 2025

fmt

2b7f80d

use prior implementation

3fa4683

jhorstmann reviewed Dec 21, 2025

View reviewed changes

Add fast path using as_cunks

03d060b

alamb mentioned this pull request Dec 23, 2025

Allocate buffers before work in boolean_kernels benchmark #9035

Open

Speed up binary kernels by XXX%, add BooleanBuffer::from_bitwise_binary #9022

Are you sure you want to change the base?

Speed up binary kernels by XXX%, add BooleanBuffer::from_bitwise_binary #9022

Conversation

alamb commented Dec 20, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Which issue does this PR close?

Rationale for this change

What changes are included in this PR?

Are these changes tested?

Are there any user-facing changes?

Uh oh!

alamb-ghbot commented Dec 20, 2025

Uh oh!

alamb-ghbot commented Dec 20, 2025

Uh oh!

alamb commented Dec 20, 2025

Uh oh!

alamb-ghbot commented Dec 20, 2025

Uh oh!

alamb-ghbot commented Dec 20, 2025

Uh oh!

alamb commented Dec 20, 2025

Uh oh!

alamb-ghbot commented Dec 20, 2025

Uh oh!

alamb-ghbot commented Dec 20, 2025

Uh oh!

alamb commented Dec 21, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

alamb-ghbot commented Dec 21, 2025

Uh oh!

alamb-ghbot commented Dec 21, 2025

Uh oh!

alamb commented Dec 21, 2025

Uh oh!

alamb-ghbot commented Dec 21, 2025

Uh oh!

alamb-ghbot commented Dec 21, 2025

Uh oh!

jhorstmann Dec 21, 2025

Choose a reason for hiding this comment

Uh oh!

alamb Dec 22, 2025

Choose a reason for hiding this comment

Uh oh!

alamb commented Dec 22, 2025

Uh oh!

alamb-ghbot commented Dec 22, 2025

Uh oh!

alamb commented Dec 22, 2025

Uh oh!

alamb-ghbot commented Dec 22, 2025

Uh oh!

alamb commented Dec 22, 2025

Uh oh!

alamb commented Dec 23, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

Speed up binary kernels by XXX%, add `BooleanBuffer::from_bitwise_binary` #9022

Speed up binary kernels by XXX%, add `BooleanBuffer::from_bitwise_binary` #9022

alamb commented Dec 20, 2025 •

edited

Loading

alamb commented Dec 21, 2025 •

edited

Loading