Skip to content

Conversation

@alamb
Copy link
Contributor

@alamb alamb commented Dec 20, 2025

Which issue does this PR close?

This is the next step after

Rationale for this change

  • The current implementation of the binary kernels have an extra allocation when operating on sliced data which is not necessary.
  • Also, we can help rust / LLVM generate more optimal code by processing u64 words at a time when the buffer is already u64 aligned (see WIP: special case bitwise ops when buffers are u64 aligned #8807)

Also, it is hard to find the code to create new Buffers by applying bitwise unary operations.

What changes are included in this PR?

  • Introduce optimized BooleanBuffer::from_bitwise_binary
  • Deprecate bitwise_bin_op_helper

Are these changes tested?

Yes new tests are added

Performance results show XXX performance improvements

Are there any user-facing changes?

@github-actions github-actions bot added the arrow Changes to the arrow crate label Dec 20, 2025
@alamb alamb force-pushed the alamb/from_bitwise_binary_op branch from 04045d2 to bedcfd6 Compare December 20, 2025 21:18
@alamb-ghbot
Copy link

🤖 ./gh_compare_arrow.sh gh_compare_arrow.sh Running
Linux aal-dev 6.14.0-1018-gcp #19~24.04.1-Ubuntu SMP Wed Sep 24 23:23:09 UTC 2025 x86_64 x86_64 x86_64 GNU/Linux
Comparing alamb/from_bitwise_binary_op (edfc085) to b318293 diff
BENCH_NAME=boolean_kernels
BENCH_COMMAND=cargo bench --features=arrow,async,test_common,experimental --bench boolean_kernels
BENCH_FILTER=
BENCH_BRANCH_NAME=alamb_from_bitwise_binary_op
Results will be posted here when complete

@alamb-ghbot
Copy link

🤖: Benchmark completed

Details

group            alamb_from_bitwise_binary_op            main
-----            ----------------------------            ----
and              1.00    209.4±4.96ns        ? ?/sec     1.30    272.5±1.15ns        ? ?/sec
and_sliced_1     1.14  1246.2±46.41ns        ? ?/sec     1.00  1097.4±16.34ns        ? ?/sec
and_sliced_24    1.40    337.0±2.76ns        ? ?/sec     1.00    240.1±1.03ns        ? ?/sec
not              1.00    146.3±1.41ns        ? ?/sec     1.00    145.7±0.62ns        ? ?/sec
not_slice_24     1.21    235.0±1.69ns        ? ?/sec     1.00    194.7±2.19ns        ? ?/sec
not_sliced_1     1.01   637.3±11.13ns        ? ?/sec     1.00   628.2±52.10ns        ? ?/sec
or               1.00    200.8±0.72ns        ? ?/sec     1.27    254.1±3.82ns        ? ?/sec
or_sliced_1      1.19  1459.4±238.94ns        ? ?/sec    1.00   1227.2±7.17ns        ? ?/sec
or_sliced_24     1.17    284.2±4.66ns        ? ?/sec     1.00    243.3±1.10ns        ? ?/sec

@apache apache deleted a comment from alamb-ghbot Dec 20, 2025
@apache apache deleted a comment from alamb-ghbot Dec 20, 2025
@alamb
Copy link
Contributor Author

alamb commented Dec 20, 2025

run benchmark boolean_kernels

@alamb-ghbot
Copy link

🤖 ./gh_compare_arrow.sh gh_compare_arrow.sh Running
Linux aal-dev 6.14.0-1018-gcp #19~24.04.1-Ubuntu SMP Wed Sep 24 23:23:09 UTC 2025 x86_64 x86_64 x86_64 GNU/Linux
Comparing alamb/from_bitwise_binary_op (edfc085) to b318293 diff
BENCH_NAME=boolean_kernels
BENCH_COMMAND=cargo bench --features=arrow,async,test_common,experimental --bench boolean_kernels
BENCH_FILTER=
BENCH_BRANCH_NAME=alamb_from_bitwise_binary_op
Results will be posted here when complete

@alamb-ghbot
Copy link

🤖: Benchmark completed

Details

group            alamb_from_bitwise_binary_op           main
-----            ----------------------------           ----
and              1.00    208.6±3.27ns        ? ?/sec    1.31    272.5±1.45ns        ? ?/sec
and_sliced_1     1.13  1242.2±20.53ns        ? ?/sec    1.00  1098.4±15.61ns        ? ?/sec
and_sliced_24    1.40    335.6±1.64ns        ? ?/sec    1.00    240.2±0.64ns        ? ?/sec
not              1.00    146.5±2.52ns        ? ?/sec    1.00    146.7±6.04ns        ? ?/sec
not_slice_24     1.21    235.6±6.02ns        ? ?/sec    1.00    194.6±1.80ns        ? ?/sec
not_sliced_1     1.02   636.8±12.25ns        ? ?/sec    1.00    621.6±9.92ns        ? ?/sec
or               1.00    201.0±0.64ns        ? ?/sec    1.26    253.5±0.61ns        ? ?/sec
or_sliced_1      1.15  1410.3±20.10ns        ? ?/sec    1.00  1231.2±42.98ns        ? ?/sec
or_sliced_24     1.16    283.3±2.77ns        ? ?/sec    1.00    243.3±1.87ns        ? ?/sec

@alamb
Copy link
Contributor Author

alamb commented Dec 20, 2025

run benchmark boolean_kernels

@alamb-ghbot
Copy link

🤖 ./gh_compare_arrow.sh gh_compare_arrow.sh Running
Linux aal-dev 6.14.0-1018-gcp #19~24.04.1-Ubuntu SMP Wed Sep 24 23:23:09 UTC 2025 x86_64 x86_64 x86_64 GNU/Linux
Comparing alamb/from_bitwise_binary_op (be3a64f) to b318293 diff
BENCH_NAME=boolean_kernels
BENCH_COMMAND=cargo bench --features=arrow,async,test_common,experimental --bench boolean_kernels
BENCH_FILTER=
BENCH_BRANCH_NAME=alamb_from_bitwise_binary_op
Results will be posted here when complete

@alamb-ghbot
Copy link

🤖: Benchmark completed

Details

group            alamb_from_bitwise_binary_op           main
-----            ----------------------------           ----
and              1.00    211.1±3.73ns        ? ?/sec    1.30    273.9±6.37ns        ? ?/sec
and_sliced_1     1.15   1270.1±7.09ns        ? ?/sec    1.00  1100.2±15.18ns        ? ?/sec
and_sliced_24    1.41    342.8±2.71ns        ? ?/sec    1.00   243.3±10.39ns        ? ?/sec
not              1.00    146.1±2.86ns        ? ?/sec    1.00    146.1±1.33ns        ? ?/sec
not_slice_24     1.22    238.5±4.66ns        ? ?/sec    1.00    195.6±3.54ns        ? ?/sec
not_sliced_1     1.03    639.3±8.10ns        ? ?/sec    1.00    620.5±1.10ns        ? ?/sec
or               1.00    200.5±4.42ns        ? ?/sec    1.28    256.6±2.84ns        ? ?/sec
or_sliced_1      1.07  1312.0±10.19ns        ? ?/sec    1.00   1229.3±8.35ns        ? ?/sec
or_sliced_24     1.19    291.5±3.72ns        ? ?/sec    1.00    245.0±3.92ns        ? ?/sec

@alamb alamb changed the title Speed up binary not kernels by XXX%, add BooleanBuffer::from_bitwise_binary Speed up binary kernels by XXX%, add BooleanBuffer::from_bitwise_binary Dec 21, 2025
@alamb
Copy link
Contributor Author

alamb commented Dec 21, 2025

run benchmark boolean_kernels

@alamb-ghbot
Copy link

🤖 ./gh_compare_arrow.sh gh_compare_arrow.sh Running
Linux aal-dev 6.14.0-1018-gcp #19~24.04.1-Ubuntu SMP Wed Sep 24 23:23:09 UTC 2025 x86_64 x86_64 x86_64 GNU/Linux
Comparing alamb/from_bitwise_binary_op (2b7f80d) to b318293 diff
BENCH_NAME=boolean_kernels
BENCH_COMMAND=cargo bench --features=arrow,async,test_common,experimental --bench boolean_kernels
BENCH_FILTER=
BENCH_BRANCH_NAME=alamb_from_bitwise_binary_op
Results will be posted here when complete

@alamb-ghbot
Copy link

🤖: Benchmark completed

Details

group            alamb_from_bitwise_binary_op           main
-----            ----------------------------           ----
and              1.00    241.3±2.31ns        ? ?/sec    1.14    275.2±1.60ns        ? ?/sec
and_sliced_1     1.03  1273.2±20.36ns        ? ?/sec    1.00  1241.0±103.60ns        ? ?/sec
and_sliced_24    1.41    342.9±3.59ns        ? ?/sec    1.00    243.8±2.40ns        ? ?/sec
not              1.01    147.2±2.85ns        ? ?/sec    1.00    146.3±0.74ns        ? ?/sec
not_slice_24     1.24    240.1±1.70ns        ? ?/sec    1.00    193.6±1.75ns        ? ?/sec
not_sliced_1     1.03    639.6±4.34ns        ? ?/sec    1.00    621.0±9.94ns        ? ?/sec
or               1.00    202.0±2.89ns        ? ?/sec    1.23    249.2±1.59ns        ? ?/sec
or_sliced_1      1.19  1315.0±16.95ns        ? ?/sec    1.00  1101.9±20.25ns        ? ?/sec
or_sliced_24     1.20    292.0±2.20ns        ? ?/sec    1.00   243.6±21.65ns        ? ?/sec

@alamb
Copy link
Contributor Author

alamb commented Dec 21, 2025

run benchmark boolean_kernels

@alamb-ghbot
Copy link

🤖 ./gh_compare_arrow.sh gh_compare_arrow.sh Running
Linux aal-dev 6.14.0-1018-gcp #19~24.04.1-Ubuntu SMP Wed Sep 24 23:23:09 UTC 2025 x86_64 x86_64 x86_64 GNU/Linux
Comparing alamb/from_bitwise_binary_op (3fa4683) to b318293 diff
BENCH_NAME=boolean_kernels
BENCH_COMMAND=cargo bench --features=arrow,async,test_common,experimental --bench boolean_kernels
BENCH_FILTER=
BENCH_BRANCH_NAME=alamb_from_bitwise_binary_op
Results will be posted here when complete

@alamb-ghbot
Copy link

🤖: Benchmark completed

Details

group            alamb_from_bitwise_binary_op           main
-----            ----------------------------           ----
and              1.00    211.5±1.92ns        ? ?/sec    1.30    275.5±4.61ns        ? ?/sec
and_sliced_1     1.00   1095.6±3.62ns        ? ?/sec    1.12  1228.6±12.32ns        ? ?/sec
and_sliced_24    1.39    337.9±3.92ns        ? ?/sec    1.00    243.4±1.74ns        ? ?/sec
not              1.00    146.0±1.14ns        ? ?/sec    1.00    146.0±0.79ns        ? ?/sec
not_slice_24     1.21    233.3±1.16ns        ? ?/sec    1.00    193.2±0.92ns        ? ?/sec
not_sliced_1     1.02    633.1±2.22ns        ? ?/sec    1.00    619.7±6.82ns        ? ?/sec
or               1.00    200.2±0.67ns        ? ?/sec    1.25    249.6±2.64ns        ? ?/sec
or_sliced_1      1.03   1134.5±3.95ns        ? ?/sec    1.00   1099.2±9.42ns        ? ?/sec
or_sliced_24     1.20    287.0±1.10ns        ? ?/sec    1.00    239.1±0.89ns        ? ?/sec

F: FnMut(u64, u64) -> u64,
{
// Safety: all valid bytes are valid u64s
let (left_prefix, left_u64s, left_suffix) = unsafe { left.align_to::<u64>() };
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do you know how much of a difference the full u64 alignment really makes on current target architectures? I think unaligned loads on x86 are not really slower. Maybe only if they cross a cache line, which should happen every 8th load. With that assumption, using slice::as_chunks and u64::from_le_bytes on each chunk might be simpler, and would then only require handling a suffix.

Staying with align_to and bailing on non-empty prefixes should also be fine, I would expect buffers to be aligned most of the time, but having a byte length that is a multiple of 8 might be less likely.

Other than this nit the code looks good to me 👍

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

TLDR is I don't really know how much of a difference this makes, and I am struggling with measuring the difference

Right now, the benchmarks seems super sensitive to any change. For example, in the most recent version of this PR I simply moved the code and added the check for alignment. And this results in consistently reproducible slowdowns on some of the sliced bechmarks

and              1.00    211.5±1.92ns        ? ?/sec    1.30    275.5±4.61ns        ? ?/sec
and_sliced_1     1.00   1095.6±3.62ns        ? ?/sec    1.12  1228.6±12.32ns        ? ?/sec
and_sliced_24    1.39    337.9±3.92ns        ? ?/sec    1.00    243.4±1.74ns        ? ?/sec

With that assumption, using slice::as_chunks and u64::from_le_bytes on each chunk might be simpler, and would then only require handling a suffix.

I tried a version like this and will see how it performs

    /// Like [`Self::from_bitwise_binary_op`] but optimized for the case where the
    /// inputs are aligned to byte boundaries
    ///
    /// Returns `None` if the inputs are not fully u64 aligned
    fn try_from_aligned_bitwise_binary_op<F>(
        left: &[u8],
        right: &[u8],
        len_in_bits: usize,
        op: &mut F,
    ) -> Option<Self>
    where
        F: FnMut(u64, u64) -> u64,
    {
        // trim to length
        let len_in_bytes = ceil(len_in_bits, 8);
        let left = &left[0..len_in_bytes];
        let right = &right[0..len_in_bytes];
        // Safety: all valid bytes are valid u64s
        let (left_prefix, left_u64s, left_suffix) = unsafe { left.align_to::<u64>() };
        let (right_prefix, right_u64s, right_suffix) = unsafe { right.align_to::<u64>() };
        if left_prefix.is_empty()
            && right_prefix.is_empty()
            && left_suffix.is_empty()
            && right_suffix.is_empty()
        {
            // the buffers are word (64 bit) aligned, so use optimized Vec code.
            let result_u64s = left_u64s
                .iter()
                .zip(right_u64s.iter())
                .map(|(l, r)| op(*l, *r))
                .collect::<Vec<u64>>();
            Some(BooleanBuffer::new(
                Buffer::from(result_u64s),
                0,
                len_in_bits,
            ))
        } else {
            let (left_slices, left_remainder) = left.as_chunks::<8>();
            let (right_slices, right_remainder) = right.as_chunks::<8>();
            debug_assert_eq!(left_slices.len(), right_slices.len());
            debug_assert_eq!(left_remainder.len(), right_remainder.len());
            let mut mutable_result = MutableBuffer::with_capacity(left_slices.len() * 8 + left_remainder.len());
            mutable_result.extend_from_iter(left_slices.iter().zip(right_slices.iter())
                .map(|(l,r)| {
                    op(u64::from_le_bytes(*l),
                       u64::from_le_bytes(*r))
            }));
            if !left_remainder.is_empty() {
                let rem = op(
                    u64::from_le_bytes({
                        let mut bytes = [0u8; 8];
                        bytes[..left_remainder.len()].copy_from_slice(left_remainder);
                        bytes
                    }),
                    u64::from_le_bytes({
                        let mut bytes = [0u8; 8];
                        bytes[..right_remainder.len()].copy_from_slice(right_remainder);
                        bytes
                    }),
                );
                println!("copying {} remainder bytes: {:?}", left_remainder.len(), &rem.to_le_bytes()[..left_remainder.len()]);
                mutable_result.extend_from_slice(&rem.to_le_bytes()[..left_remainder.len()]);
            }
            Some(BooleanBuffer {
                buffer: Buffer::from(mutable_result),
                offset: 0,
                len: len_in_bits,
            })
        }
    }

Sadly, it seems like we are not going to be able to use as_chunks

``rust
error: current MSRV (Minimum Supported Rust Version) is 1.85.0 but this item is stable since `1.88.0`
--> arrow-buffer/src/buffer/boolean.rs:335:54
|
335 | let (left_slices, left_remainder) = left.as_chunks::<8>();
| ^^^^^^^^^^^^^^^^
|
= note: you may want to conditionally increase the MSRV considered by Clippy using the `clippy::msrv` attribute
= help: for further information visit https://rust-lang.github.io/rust-clippy/rust-1.91.0/index.html#incompatible_msrv
= note: `-D clippy::incompatible-msrv` implied by `-D warnings`
= help: to override `-D warnings` add `#[allow(clippy::incompatible_msrv)]`

@alamb
Copy link
Contributor Author

alamb commented Dec 22, 2025

run benchmark boolean_kernels

@alamb-ghbot
Copy link

🤖 ./gh_compare_arrow.sh gh_compare_arrow.sh Running
Linux aal-dev 6.14.0-1018-gcp #19~24.04.1-Ubuntu SMP Wed Sep 24 23:23:09 UTC 2025 x86_64 x86_64 x86_64 GNU/Linux
Comparing alamb/from_bitwise_binary_op (03d060b) to b318293 diff
BENCH_NAME=boolean_kernels
BENCH_COMMAND=cargo bench --features=arrow,async,test_common,experimental --bench boolean_kernels
BENCH_FILTER=
BENCH_BRANCH_NAME=alamb_from_bitwise_binary_op
Results will be posted here when complete

@alamb
Copy link
Contributor Author

alamb commented Dec 22, 2025

I may try to regigger the boolean kernel benchmarks to include multiple allocations / sizes to try and get a better sense of this code

@alamb-ghbot
Copy link

🤖: Benchmark completed

Details

group            alamb_from_bitwise_binary_op           main
-----            ----------------------------           ----
and              1.00    208.0±5.91ns        ? ?/sec    1.34   278.8±10.07ns        ? ?/sec
and_sliced_1     1.00   1100.2±6.53ns        ? ?/sec    1.12   1226.9±6.11ns        ? ?/sec
and_sliced_24    1.40    340.9±2.49ns        ? ?/sec    1.00    243.7±2.13ns        ? ?/sec
not              1.00    145.6±1.12ns        ? ?/sec    1.00    145.2±1.27ns        ? ?/sec
not_slice_24     1.22    235.2±4.16ns        ? ?/sec    1.00    193.1±1.44ns        ? ?/sec
not_sliced_1     1.02   637.6±12.78ns        ? ?/sec    1.00   623.7±29.39ns        ? ?/sec
or               1.00    199.2±3.09ns        ? ?/sec    1.26    250.4±5.14ns        ? ?/sec
or_sliced_1      1.04   1137.6±9.72ns        ? ?/sec    1.00   1097.7±8.11ns        ? ?/sec
or_sliced_24     1.21    289.7±5.66ns        ? ?/sec    1.00    239.7±2.13ns        ? ?/sec

@alamb
Copy link
Contributor Author

alamb commented Dec 22, 2025

Ok, given that totally specializing the slicing didn't help, I am now 100% convinced that the difference in performance reported for _sliced24 is due to alignment or some other crazy thing.

What I plan to do now is:

  1. Try and update the benchmarks to 1) allocate input arrays before any other allocations happen and 2) include more / multiple different alignments (will allocate a few different ways)

@alamb
Copy link
Contributor Author

alamb commented Dec 23, 2025

I made a PR to try and make the benchmarks more reproducable:

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

arrow Changes to the arrow crate

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants