-
Notifications
You must be signed in to change notification settings - Fork 1.1k
Speed up binary kernels by XXX%, add BooleanBuffer::from_bitwise_binary
#9022
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
Conversation
04045d2 to
bedcfd6
Compare
|
🤖 |
|
🤖: Benchmark completed Details
|
|
run benchmark boolean_kernels |
|
🤖 |
|
🤖: Benchmark completed Details
|
|
run benchmark boolean_kernels |
|
🤖 |
|
🤖: Benchmark completed Details
|
not kernels by XXX%, add BooleanBuffer::from_bitwise_binaryBooleanBuffer::from_bitwise_binary
|
run benchmark boolean_kernels |
|
🤖 |
|
🤖: Benchmark completed Details
|
|
run benchmark boolean_kernels |
|
🤖 |
|
🤖: Benchmark completed Details
|
| F: FnMut(u64, u64) -> u64, | ||
| { | ||
| // Safety: all valid bytes are valid u64s | ||
| let (left_prefix, left_u64s, left_suffix) = unsafe { left.align_to::<u64>() }; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Do you know how much of a difference the full u64 alignment really makes on current target architectures? I think unaligned loads on x86 are not really slower. Maybe only if they cross a cache line, which should happen every 8th load. With that assumption, using slice::as_chunks and u64::from_le_bytes on each chunk might be simpler, and would then only require handling a suffix.
Staying with align_to and bailing on non-empty prefixes should also be fine, I would expect buffers to be aligned most of the time, but having a byte length that is a multiple of 8 might be less likely.
Other than this nit the code looks good to me 👍
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
TLDR is I don't really know how much of a difference this makes, and I am struggling with measuring the difference
Right now, the benchmarks seems super sensitive to any change. For example, in the most recent version of this PR I simply moved the code and added the check for alignment. And this results in consistently reproducible slowdowns on some of the sliced bechmarks
and 1.00 211.5±1.92ns ? ?/sec 1.30 275.5±4.61ns ? ?/sec
and_sliced_1 1.00 1095.6±3.62ns ? ?/sec 1.12 1228.6±12.32ns ? ?/sec
and_sliced_24 1.39 337.9±3.92ns ? ?/sec 1.00 243.4±1.74ns ? ?/sec
With that assumption, using slice::as_chunks and u64::from_le_bytes on each chunk might be simpler, and would then only require handling a suffix.
I tried a version like this and will see how it performs
/// Like [`Self::from_bitwise_binary_op`] but optimized for the case where the
/// inputs are aligned to byte boundaries
///
/// Returns `None` if the inputs are not fully u64 aligned
fn try_from_aligned_bitwise_binary_op<F>(
left: &[u8],
right: &[u8],
len_in_bits: usize,
op: &mut F,
) -> Option<Self>
where
F: FnMut(u64, u64) -> u64,
{
// trim to length
let len_in_bytes = ceil(len_in_bits, 8);
let left = &left[0..len_in_bytes];
let right = &right[0..len_in_bytes];
// Safety: all valid bytes are valid u64s
let (left_prefix, left_u64s, left_suffix) = unsafe { left.align_to::<u64>() };
let (right_prefix, right_u64s, right_suffix) = unsafe { right.align_to::<u64>() };
if left_prefix.is_empty()
&& right_prefix.is_empty()
&& left_suffix.is_empty()
&& right_suffix.is_empty()
{
// the buffers are word (64 bit) aligned, so use optimized Vec code.
let result_u64s = left_u64s
.iter()
.zip(right_u64s.iter())
.map(|(l, r)| op(*l, *r))
.collect::<Vec<u64>>();
Some(BooleanBuffer::new(
Buffer::from(result_u64s),
0,
len_in_bits,
))
} else {
let (left_slices, left_remainder) = left.as_chunks::<8>();
let (right_slices, right_remainder) = right.as_chunks::<8>();
debug_assert_eq!(left_slices.len(), right_slices.len());
debug_assert_eq!(left_remainder.len(), right_remainder.len());
let mut mutable_result = MutableBuffer::with_capacity(left_slices.len() * 8 + left_remainder.len());
mutable_result.extend_from_iter(left_slices.iter().zip(right_slices.iter())
.map(|(l,r)| {
op(u64::from_le_bytes(*l),
u64::from_le_bytes(*r))
}));
if !left_remainder.is_empty() {
let rem = op(
u64::from_le_bytes({
let mut bytes = [0u8; 8];
bytes[..left_remainder.len()].copy_from_slice(left_remainder);
bytes
}),
u64::from_le_bytes({
let mut bytes = [0u8; 8];
bytes[..right_remainder.len()].copy_from_slice(right_remainder);
bytes
}),
);
println!("copying {} remainder bytes: {:?}", left_remainder.len(), &rem.to_le_bytes()[..left_remainder.len()]);
mutable_result.extend_from_slice(&rem.to_le_bytes()[..left_remainder.len()]);
}
Some(BooleanBuffer {
buffer: Buffer::from(mutable_result),
offset: 0,
len: len_in_bits,
})
}
}Sadly, it seems like we are not going to be able to use as_chunks
``rust
error: current MSRV (Minimum Supported Rust Version) is 1.85.0 but this item is stable since `1.88.0`
--> arrow-buffer/src/buffer/boolean.rs:335:54
|
335 | let (left_slices, left_remainder) = left.as_chunks::<8>();
| ^^^^^^^^^^^^^^^^
|
= note: you may want to conditionally increase the MSRV considered by Clippy using the `clippy::msrv` attribute
= help: for further information visit https://rust-lang.github.io/rust-clippy/rust-1.91.0/index.html#incompatible_msrv
= note: `-D clippy::incompatible-msrv` implied by `-D warnings`
= help: to override `-D warnings` add `#[allow(clippy::incompatible_msrv)]`
|
run benchmark boolean_kernels |
|
🤖 |
|
I may try to regigger the boolean kernel benchmarks to include multiple allocations / sizes to try and get a better sense of this code |
|
🤖: Benchmark completed Details
|
|
Ok, given that totally specializing the slicing didn't help, I am now 100% convinced that the difference in performance reported for _sliced24 is due to alignment or some other crazy thing. What I plan to do now is:
|
|
I made a PR to try and make the benchmarks more reproducable: |
Which issue does this PR close?
Buffer::from_bitwise_unaryandBuffer::from_bitwise_binaryme… #8854This is the next step after
notkernel by 50%, addBooleanBuffer::from_bitwise_unary#8996Rationale for this change
Also, it is hard to find the code to create new Buffers by applying bitwise unary operations.
What changes are included in this PR?
BooleanBuffer::from_bitwise_binarybitwise_bin_op_helperAre these changes tested?
Yes new tests are added
Performance results show XXX performance improvements
Are there any user-facing changes?