Skip to content

Conversation

@mkleen
Copy link

@mkleen mkleen commented Dec 13, 2025

Which issue does this PR close?

N/A

Rationale for this change

I have a PR to improve zip perf for Utf8View/BinaryView scalars and I need benchmarks for that.

What changes are included in this PR?

This extends the zip benchmarks by one new Input Generator for StringViews and two more functions to test scalar combinations of different StringViews combinations.

Are these changes tested?

N/A

Are there any user-facing changes?

No

@mkleen mkleen marked this pull request as ready for review December 13, 2025 14:40
@github-actions github-actions bot added the arrow Changes to the arrow crate label Dec 13, 2025
@mkleen mkleen force-pushed the zip-string-view-bench branch from 93a5b60 to 6d5422a Compare December 13, 2025 14:43
Copy link
Contributor

@alamb alamb left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks @mkleen


let null_scalar = input_generator_1.generate_scalar_with_null_value();

let [non_null_scalar_1]: [_; 1] = input_generator_1
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't understand this code:

  1. It seems to be making an array (not a scalar)
  2. generate_non_null_scalars returns a vector but then the callsite only ever seems to use a single return value -- why not just return a single array?

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good point, generate_non_null_scalars is also used at https://github.com/apache/arrow-rs/blob/main/arrow/benches/zip_kernels.rs#L156-L157 to retrieve two values.

Copy link
Author

@mkleen mkleen Dec 14, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Would you prefer to introduce a generate_non_null_scalar function returning a single Array for all InputGenerators ?

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I pushed a simplified version. I hope this makes more sense to you.

@mkleen mkleen requested a review from alamb December 15, 2025 00:26
@mkleen mkleen force-pushed the zip-string-view-bench branch 6 times, most recently from fc37028 to 5b6d229 Compare December 15, 2025 07:43
Copy link
Contributor

@alamb alamb left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks @mkleen


let null_scalar = input_generator_1.generate_null();

let non_null_scalar_1 = input_generator_1.generate();
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sorry -- I still don't get this benchmark -- calling input_generator_1.generate() results in a array (not a scalar):

impl GenerateStringView {
    fn name(&self) -> &str {
        self.description.as_str()
    }
    fn generate_null(&self) -> ArrayRef {
        new_null_array(&DataType::Utf8View, 1)
    }

    fn generate(&self) -> ArrayRef {
        Arc::new(create_string_view_array_with_fixed_len(
            1,
            0.0,
            self.str_len,
        ))
    }
}

I would have expected a test something like

let null_scalar = input_generator.generate_null_scalar();
let non_null_scalar = input_generator.generate_scalar(); // make a 1 row array
let array = input_generator.generate_array(); // make a 8192 array

...
And then you could run the combinations 

(array, non_null_scalar),
(non_null_scalar, array),
(array, null_scalar),
(null_scalar, array)


Copy link
Author

@mkleen mkleen Dec 15, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I intended to create the benchmark for the combination where both sides are scalars, with the following combinations:

(non_null_scalar, non_null_scalar)
(null_scalar, non_null_scalar)
(non_null_scalar, null_scalar)

and different string-view sizes on each side with different distributions of true/false etc. Each value is an array with a single value, which is then wrapped in a Scalar to match this condition:

https://github.com/apache/arrow-rs/blob/main/arrow-select/src/zip.rs#L104

and dispatch to the implementation to zip two scalars.

So you are saying this assumption is not correct and it should be instead (array, non_null_scalar) with an array of the length 8192 ?

Then i need to rework the benchmarks completely.

Copy link
Author

@mkleen mkleen Dec 15, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I took this code from the given zip benchmark as foundation:

https://github.com/apache/arrow-rs/blob/main/arrow/benches/zip_kernels.rs#L149-L183

Copy link
Author

@mkleen mkleen Dec 15, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I just double-checked the benchmarks from #8653 (the original optimization) see #8653 (comment). The optimisation only kicks in when both truthy and falsy are scalars, meaning arrays with single rows. The array vs scalar case is not improved.

Copy link
Author

@mkleen mkleen Dec 15, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ok, the generator returns now scalars instead of arrays directly. Regarding the combinations: I still think these are valid. The optimisation is based on scalar vs scalar values.

@mkleen mkleen requested a review from alamb December 15, 2025 22:41
@mkleen mkleen force-pushed the zip-string-view-bench branch from 53577d7 to f2c7015 Compare December 18, 2025 09:56
@mkleen
Copy link
Author

mkleen commented Dec 18, 2025

Dear @alamb,

I just pushed another version. This just implements the existing InputGenerator trait for StringViews and uses the already existing benchmark function generating the following combinations:

(null_scalar, non_null_scalar)
(non_null_scalar,  null_scalar)
(non_nulls_scalar, non_nulls_scalar)
(array, non_null_scalar)
(non_null_scalar, array)
(array, array)

This does the job to show the improvements of #8963 for the first three combinations in the spirit of #8653

If you prefer to also refactor the benchmarks further with your suggestion regarding the return types of the InputGenerator, I am happy to do so, however this would mean I would need to adapt already existing benchmark code. Let me know.

@mkleen mkleen force-pushed the zip-string-view-bench branch from f2c7015 to 2c558de Compare December 18, 2025 12:29
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

arrow Changes to the arrow crate

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants