-
Notifications
You must be signed in to change notification settings - Fork 1.1k
Fix row slice bug in Union column decoding with many columns #9000
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
Fix row slice bug in Union column decoding with many columns #9000
Conversation
c64321b to
ba7698c
Compare
|
@Jefffrey would you have time to review this |
I'll try take a look when I have time |
|
@friendlymatthew when I check this PR out locally, it seems to fail to compile regarding the tests; I assume it's because it depends on #8891? Funny that the CI didn't fail, presumably because it merged in from main? 🤔 |
Jefffrey
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is a nice pickup, and the fix makes sense to me.
One minor note:
This PR fixes a bug in the row-to-column conversion for Union types when multiple union columns are present in the same row converter
I think the bug would occur as long as there are multiple columns and there is a union column that isn't last; so something like [Union, Int64] array input would also fail, not necessarily about multiple union columns.
arrow-row/src/lib.rs
Outdated
| // track bytes consumed for rows that belong to this field | ||
| for (row_idx, child_row) in field_rows.iter() { | ||
| let remaining_len = sparse_data[*row_idx].len(); | ||
| bytes_consumed[*row_idx] = 1 + child_row.len() - remaining_len; | ||
| } |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
| // track bytes consumed for rows that belong to this field | |
| for (row_idx, child_row) in field_rows.iter() { | |
| let remaining_len = sparse_data[*row_idx].len(); | |
| bytes_consumed[*row_idx] = 1 + child_row.len() - remaining_len; | |
| } | |
| // ensure we advance pass consumed bytes in rows | |
| for (row_idx, child_row) in field_rows.iter() { | |
| let remaining_len = sparse_data[*row_idx].len(); | |
| let consumed_length = 1 + child_row.len() - remaining_len; | |
| rows[*row_idx] = &rows[*row_idx][consumed_length..]; | |
| } |
Thoughts of inlining it like this, which can remove the need for a separate bytes_consumed vec?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I like it!
ba7698c to
9a418ec
Compare
Which issue does this PR close?
Rationale for this change
This PR fixes a bug in the row-to-column conversion for Union types when multiple union columns are present in the same row converter
Previously, the row slice was being consumed from reading their data correctly. The fix tracks bytes consumed per row across all union fields, this way it properly advances row slices