-
Notifications
You must be signed in to change notification settings - Fork 1.1k
Fix UTF-8 boundary validation for sliced substring #9015
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
Fix UTF-8 boundary validation for sliced substring #9015
Conversation
mhilton
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't understand why this change is necessary. Slicing a (Large)StringArray doesn't change the data buffer, so the offsets into the data buffer also do not change. Do you have an example of a StringArray where the existing code produces incorrect results?
|
Thanks for the clarification — that helps.
You’re absolutely right that slicing a (Large)StringArray does not change
the
underlying data buffer, and that the stored offsets remain relative to that
buffer.
The concern here is not that slicing mutates offsets, but that substring
operates on value-relative ranges derived from the offset pairs, while the
existing UTF-8 boundary validation reasons about the full buffer. This means
the validation is effectively checking a stronger condition than required:
that the offset is a UTF-8 boundary in the entire buffer, rather than within
the value’s byte range.
Conceptually, substring only needs to ensure that the computed offset is a
valid UTF-8 boundary relative to the value slice `[offsets[i],
offsets[i+1])`.
Validating against the full buffer can reject offsets that are value-aligned
but not globally meaningful outside that range.
That said, I agree that this distinction is subtle, and without a concrete
example where the current implementation produces incorrect results, the
change may not be justified. I will try to construct a minimal reproducer or
add a targeted test demonstrating this behavior. If I’m unable to do so, I’m
happy to drop or revise the change.
Thanks again for taking the time to review this — I appreciate the guidance.
Utkarsh Sahay
…On Thu, 18 Dec 2025, 1:53 pm Martin Hilton, ***@***.***> wrote:
***@***.**** commented on this pull request.
I don't understand why this change is necessary. Slicing a
(Large)StringArray doesn't change the data buffer, so the offsets into
the data buffer also do not change. Do you have an example of a StringArray
where the existing code produces incorrect results?
—
Reply to this email directly, view it on GitHub
<#9015 (review)>,
or unsubscribe
<https://github.com/notifications/unsubscribe-auth/AYHU2BMNJILFA3HEP4Q7QTD4CJP7PAVCNFSM6AAAAACPMSF64KVHI2DSMVQWIX3LMV43YUDVNRWFEZLROVSXG5CSMV3GSZLXHMZTKOJRGM3TSOBWGA>
.
You are receiving this because you authored the thread.Message ID:
***@***.***>
|
This sounds like a good plan |
|
Marking as draft as I think this PR is no longer waiting on feedback and I am trying to make it easier to find PRs in need of review. Please mark it as ready for review when it is ready for another look |
|
Thanks for the note! I’ve marked the PR as ready for review now. Please let me know if any further changes are needed. |
|
I don't think we can accept this PR without a test demonstrating what issue it is fixing |
What does this PR do?
This PR fixes UTF-8 boundary validation in substring kernels for sliced
Utf8andLargeUtf8arrays.Previously, UTF-8 boundary checks were performed against the full underlying
buffer, which could lead to incorrect validation when arrays were sliced.
This change ensures boundaries are validated relative to each value.
Why is this change needed?
Substring kernels operate on value-relative offsets. Validating offsets
against the global buffer can incorrectly reject valid boundaries or accept
invalid ones when arrays are sliced. This fix aligns validation with
value-level semantics.
What changes were made?
Tests