-
Notifications
You must be signed in to change notification settings - Fork 18
Description
When Spark outputs a parquet file, I believe it always uses the inner list item name of element as opposed to item:
message spark_schema {
....
OPTIONAL group mylistcolumn (LIST) {
REPEATED group list {
OPTIONAL BYTE_ARRAY element (UTF8);
}
}
...
}It appears this crate (or one of its dependencies, perhaps arrow2 itself?), is always assuming that the inner field name of a list is item rather than element.
Expected: Struct([Field { name: "mylistcolumn", data_type: List(Field { name: "item", data_type: Int32, is_nullable: false, metadata: {} }), is_nullable: false, metadata: {} }])
Actual: Struct([Field { name: "mylistcolumn", data_type: List(Field { name: "element", data_type: Int32, is_nullable: false, metadata: {} }), is_nullable: false, metadata: {} }])
I'm guessing this is because of this line of code?
arrow2-convert/arrow2_convert/src/field.rs
Line 214 in 7d9e132
| arrow2::datatypes::DataType::List(Box::new(<T as ArrowField>::field("item"))) |
- If this is controlled by arrow2-convert, can we perhaps customize this via an annotation on the struct member?
- Should the default by re-evaluated if parquet-mr / Spark uses
element?
P.S. Likely not related, but I ran into a very similar error in this other crate as well: timvw/qv#31