Skip to content

enable customizing list inner child element name? #84

@AlJohri

Description

@AlJohri

When Spark outputs a parquet file, I believe it always uses the inner list item name of element as opposed to item:

message spark_schema {
  ....
  OPTIONAL group mylistcolumn (LIST) {
    REPEATED group list {
      OPTIONAL BYTE_ARRAY element (UTF8);
    }
  }
  ...
}

It appears this crate (or one of its dependencies, perhaps arrow2 itself?), is always assuming that the inner field name of a list is item rather than element.

Expected: Struct([Field { name: "mylistcolumn", data_type: List(Field { name: "item", data_type: Int32, is_nullable: false, metadata: {} }), is_nullable: false, metadata: {} }])

Actual: Struct([Field { name: "mylistcolumn", data_type: List(Field { name: "element", data_type: Int32, is_nullable: false, metadata: {} }), is_nullable: false, metadata: {} }])

I'm guessing this is because of this line of code?

arrow2::datatypes::DataType::List(Box::new(<T as ArrowField>::field("item")))

  1. If this is controlled by arrow2-convert, can we perhaps customize this via an annotation on the struct member?
  2. Should the default by re-evaluated if parquet-mr / Spark uses element?

P.S. Likely not related, but I ran into a very similar error in this other crate as well: timvw/qv#31

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions