Skip to content

Conversation

@friendlymatthew
Copy link
Contributor

Which issue does this PR close?

Rationale for this change

This PR fixes a bug in the row-to-column conversion for Union types when multiple union columns are present in the same row converter

Previously, the row slice was being consumed from reading their data correctly. The fix tracks bytes consumed per row across all union fields, this way it properly advances row slices

@github-actions github-actions bot added the arrow Changes to the arrow crate label Dec 15, 2025
@friendlymatthew friendlymatthew force-pushed the friendlymatthew/track-bytes-during-union-row-decoding branch from c64321b to ba7698c Compare December 15, 2025 18:50
@friendlymatthew
Copy link
Contributor Author

@Jefffrey would you have time to review this

@Jefffrey
Copy link
Contributor

@Jefffrey would you have time to review this

I'll try take a look when I have time

@Jefffrey
Copy link
Contributor

Jefffrey commented Dec 23, 2025

@friendlymatthew when I check this PR out locally, it seems to fail to compile regarding the tests; I assume it's because it depends on #8891?

Funny that the CI didn't fail, presumably because it merged in from main? 🤔

Copy link
Contributor

@Jefffrey Jefffrey left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is a nice pickup, and the fix makes sense to me.

One minor note:

This PR fixes a bug in the row-to-column conversion for Union types when multiple union columns are present in the same row converter

I think the bug would occur as long as there are multiple columns and there is a union column that isn't last; so something like [Union, Int64] array input would also fail, not necessarily about multiple union columns.

Comment on lines 1960 to 1964
// track bytes consumed for rows that belong to this field
for (row_idx, child_row) in field_rows.iter() {
let remaining_len = sparse_data[*row_idx].len();
bytes_consumed[*row_idx] = 1 + child_row.len() - remaining_len;
}
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
// track bytes consumed for rows that belong to this field
for (row_idx, child_row) in field_rows.iter() {
let remaining_len = sparse_data[*row_idx].len();
bytes_consumed[*row_idx] = 1 + child_row.len() - remaining_len;
}
// ensure we advance pass consumed bytes in rows
for (row_idx, child_row) in field_rows.iter() {
let remaining_len = sparse_data[*row_idx].len();
let consumed_length = 1 + child_row.len() - remaining_len;
rows[*row_idx] = &rows[*row_idx][consumed_length..];
}

Thoughts of inlining it like this, which can remove the need for a separate bytes_consumed vec?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I like it!

@friendlymatthew friendlymatthew force-pushed the friendlymatthew/track-bytes-during-union-row-decoding branch from ba7698c to 9a418ec Compare December 24, 2025 16:40
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

arrow Changes to the arrow crate

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Fix panic when decoding multiple Union columns in RowConverter

2 participants