Fix UTF-8 boundary validation for sliced substring #9015

UtkarshSahay123 · 2025-12-18T06:22:01Z

What does this PR do?

This PR fixes UTF-8 boundary validation in substring kernels for sliced
Utf8 and LargeUtf8 arrays.

Previously, UTF-8 boundary checks were performed against the full underlying
buffer, which could lead to incorrect validation when arrays were sliced.
This change ensures boundaries are validated relative to each value.

Why is this change needed?

Substring kernels operate on value-relative offsets. Validating offsets
against the global buffer can incorrectly reject valid boundaries or accept
invalid ones when arrays are sliced. This fix aligns validation with
value-level semantics.

What changes were made?

Perform UTF-8 boundary validation relative to per-value slices
Preserve existing behavior for unsliced arrays
No API changes

Tests

Existing substring tests cover this behavior
No new tests were required

mhilton

I don't understand why this change is necessary. Slicing a (Large)StringArray doesn't change the data buffer, so the offsets into the data buffer also do not change. Do you have an example of a StringArray where the existing code produces incorrect results?

UtkarshSahay123 · 2025-12-18T09:03:51Z

Thanks for the clarification — that helps. You’re absolutely right that slicing a (Large)StringArray does not change the underlying data buffer, and that the stored offsets remain relative to that buffer. The concern here is not that slicing mutates offsets, but that substring operates on value-relative ranges derived from the offset pairs, while the existing UTF-8 boundary validation reasons about the full buffer. This means the validation is effectively checking a stronger condition than required: that the offset is a UTF-8 boundary in the entire buffer, rather than within the value’s byte range. Conceptually, substring only needs to ensure that the computed offset is a valid UTF-8 boundary relative to the value slice `[offsets[i], offsets[i+1])`. Validating against the full buffer can reject offsets that are value-aligned but not globally meaningful outside that range. That said, I agree that this distinction is subtle, and without a concrete example where the current implementation produces incorrect results, the change may not be justified. I will try to construct a minimal reproducer or add a targeted test demonstrating this behavior. If I’m unable to do so, I’m happy to drop or revise the change. Thanks again for taking the time to review this — I appreciate the guidance. Utkarsh Sahay

…

On Thu, 18 Dec 2025, 1:53 pm Martin Hilton, ***@***.***> wrote: ***@***.**** commented on this pull request. I don't understand why this change is necessary. Slicing a (Large)StringArray doesn't change the data buffer, so the offsets into the data buffer also do not change. Do you have an example of a StringArray where the existing code produces incorrect results? — Reply to this email directly, view it on GitHub <#9015 (review)>, or unsubscribe <https://github.com/notifications/unsubscribe-auth/AYHU2BMNJILFA3HEP4Q7QTD4CJP7PAVCNFSM6AAAAACPMSF64KVHI2DSMVQWIX3LMV43YUDVNRWFEZLROVSXG5CSMV3GSZLXHMZTKOJRGM3TSOBWGA> . You are receiving this because you authored the thread.Message ID: ***@***.***>

alamb · 2025-12-19T20:09:31Z

That said, I agree that this distinction is subtle, and without a concrete
example where the current implementation produces incorrect results, the
change may not be justified. I will try to construct a minimal reproducer or
add a targeted test demonstrating this behavior. If I’m unable to do so, I’m
happy to drop or revise the change.

This sounds like a good plan

alamb · 2025-12-19T20:09:51Z

Marking as draft as I think this PR is no longer waiting on feedback and I am trying to make it easier to find PRs in need of review. Please mark it as ready for review when it is ready for another look

UtkarshSahay123 · 2025-12-20T16:37:16Z

Thanks for the note! I’ve marked the PR as ready for review now. Please let me know if any further changes are needed.

alamb · 2025-12-23T14:33:11Z

I don't think we can accept this PR without a test demonstrating what issue it is fixing

Fix UTF-8 boundary validation for sliced substring

f83541c

github-actions bot added the arrow Changes to the arrow crate label Dec 18, 2025

mhilton reviewed Dec 18, 2025

View reviewed changes

alamb marked this pull request as draft December 19, 2025 20:09

UtkarshSahay123 marked this pull request as ready for review December 20, 2025 16:36

Jefffrey marked this pull request as draft December 23, 2025 15:43

UtkarshSahay123 marked this pull request as ready for review December 24, 2025 17:05

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Fix UTF-8 boundary validation for sliced substring #9015

Fix UTF-8 boundary validation for sliced substring #9015

Uh oh!

UtkarshSahay123 commented Dec 18, 2025

Uh oh!

mhilton left a comment

Uh oh!

UtkarshSahay123 commented Dec 18, 2025 via email

Uh oh!

alamb commented Dec 19, 2025

Uh oh!

alamb commented Dec 19, 2025

Uh oh!

UtkarshSahay123 commented Dec 20, 2025

Uh oh!

alamb commented Dec 23, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

Fix UTF-8 boundary validation for sliced substring #9015

Are you sure you want to change the base?

Fix UTF-8 boundary validation for sliced substring #9015

Uh oh!

Conversation

UtkarshSahay123 commented Dec 18, 2025

What does this PR do?

Why is this change needed?

What changes were made?

Tests

Uh oh!

mhilton left a comment

Choose a reason for hiding this comment

Uh oh!

UtkarshSahay123 commented Dec 18, 2025 via email

Uh oh!

alamb commented Dec 19, 2025

Uh oh!

alamb commented Dec 19, 2025

Uh oh!

UtkarshSahay123 commented Dec 20, 2025

Uh oh!

alamb commented Dec 23, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants