Skip to content

Parquet-cli unable to read variant shredding tests 86 and 126? #97

@scovich

Description

@scovich

While adding support for variant array unshredding to arrow-rs, I discovered that parquet-cli is unable to correctly read the parquet files for cases 86 and 126, both due to the same index out of bounds error:

% parquet cat parquet-testing/shredded_variant/case-086.parquet
Unknown error
java.lang.RuntimeException: Failed on record 0 in file parquet-testing/shredded_variant/case-086.parquet
        at org.apache.parquet.cli.commands.CatCommand.run(CatCommand.java:89)
        at org.apache.parquet.cli.Main.run(Main.java:169)
        at org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:76)
        at org.apache.parquet.cli.Main.main(Main.java:197)
Caused by: org.apache.parquet.io.ParquetDecodingException: Can not read value at 0 in block -1 in file file:/Users/ryan.johnson/arrow-rs/parquet-testing/shredded_variant/case-086.parquet
        at org.apache.parquet.hadoop.InternalParquetRecordReader.nextKeyValue(InternalParquetRecordReader.java:280)
        at org.apache.parquet.hadoop.ParquetReader.read(ParquetReader.java:136)
        at org.apache.parquet.hadoop.ParquetReader.read(ParquetReader.java:140)
        at org.apache.parquet.cli.BaseCommand$1$1.advance(BaseCommand.java:356)
        at org.apache.parquet.cli.BaseCommand$1$1.<init>(BaseCommand.java:337)
        at org.apache.parquet.cli.BaseCommand$1.iterator(BaseCommand.java:335)
        at org.apache.parquet.cli.commands.CatCommand.run(CatCommand.java:76)
        ... 3 more
Caused by: java.lang.IndexOutOfBoundsException: Index -1 out of bounds for length 0
        at java.base/jdk.internal.util.Preconditions.outOfBounds(Preconditions.java:100)
        at java.base/jdk.internal.util.Preconditions.outOfBoundsCheckIndex(Preconditions.java:106)
        at java.base/jdk.internal.util.Preconditions.checkIndex(Preconditions.java:302)
        at java.base/java.util.Objects.checkIndex(Objects.java:385)
        at java.base/java.util.ArrayList.get(ArrayList.java:427)
        at org.apache.parquet.io.GroupColumnIO.getLast(GroupColumnIO.java:94)
        at org.apache.parquet.io.GroupColumnIO.getLast(GroupColumnIO.java:94)
        at org.apache.parquet.io.GroupColumnIO.getLast(GroupColumnIO.java:94)
        at org.apache.parquet.io.PrimitiveColumnIO.getLast(PrimitiveColumnIO.java:72)
        at org.apache.parquet.io.PrimitiveColumnIO.isLast(PrimitiveColumnIO.java:66)
        at org.apache.parquet.io.RecordReaderImplementation.<init>(RecordReaderImplementation.java:308)
        at org.apache.parquet.io.MessageColumnIO$1.visit(MessageColumnIO.java:141)
        at org.apache.parquet.io.MessageColumnIO$1.visit(MessageColumnIO.java:105)
        at org.apache.parquet.filter2.compat.FilterCompat$NoOpFilter.accept(FilterCompat.java:186)
        at org.apache.parquet.io.MessageColumnIO.getRecordReader(MessageColumnIO.java:105)
        at org.apache.parquet.hadoop.InternalParquetRecordReader.checkRead(InternalParquetRecordReader.java:156)
        at org.apache.parquet.hadoop.InternalParquetRecordReader.nextKeyValue(InternalParquetRecordReader.java:245)
        ... 9 more

The backtrace for case-126.parquet is identical.

Looking at arrow-rs debug printouts of the arrays, I don't see anything obviously wrong, tho?

arrow-rs debug printout

For case-086.parquet, the input data is:

typed_value array: ListArray
[
  StructArray
-- validity:
[
  valid,
  valid,
  valid,
]
[
-- child 0: "value" (BinaryView)
BinaryViewArray
[
  null,
  [0],
  null,
]
-- child 1: "typed_value" (Utf8)
StringArray
[
  "comedy",
  null,
  "drama",
]
],
]

And for case-126.parquet, we have:

typed_value array: ListArray
[
  StructArray
-- validity:
[
  valid,
  valid,
]
[
-- child 0: "value" (BinaryView)
BinaryViewArray
[
  null,
  null,
]
-- child 1: "typed_value" (Struct([Field { name: "a", data_type: Struct([Field { name: "value", data_type: BinaryView, nullable: true }, Field { name: "typed_value", data_type: Int32, nullable: true }]) }, Field { name: "b", data_type: Struct([Field { name: "value", dat\
a_type: BinaryView, nullable: true }, Field { name: "typed_value", data_type: Utf8, nullable: true }]) }]))
StructArray
-- validity:
[
  valid,
  valid,
]
[
-- child 0: "a" (Struct([Field { name: "value", data_type: BinaryView, nullable: true }, Field { name: "typed_value", data_type: Int32, nullable: true }]))
StructArray
-- validity:
[
  valid,
  valid,
]
[
-- child 0: "value" (BinaryView)
BinaryViewArray
[
  null,
  null,
]
-- child 1: "typed_value" (Int32)
PrimitiveArray<Int32>
[
  1,
  2,
]
]
-- child 1: "b" (Struct([Field { name: "value", data_type: BinaryView, nullable: true }, Field { name: "typed_value", data_type: Utf8, nullable: true }]))
StructArray
-- validity:
[
  valid,
  valid,
]
[
-- child 0: "value" (BinaryView)
BinaryViewArray
[
  null,
  null,
]
-- child 1: "typed_value" (Utf8)
StringArray
[
  "comedy",
  "drama",
]
]
]
],
  StructArray
-- validity:
[
  valid,
  valid,
]
[
-- child 0: "value" (BinaryView)
BinaryViewArray
[
  [2, 1, 2, 0, 4, 13, 115, 116, 114],
  [2, 1, 3, 0, 5, 44, 40, 77, 0, 0],
]
-- child 1: "typed_value" (Struct([Field { name: "a", data_type: Struct([Field { name: "value", data_type: BinaryView, nullable: true }, Field { name: "typed_value", data_type: Int32, nullable: true }]) }, Field { name: "b", data_type: Struct([Field { name: "value", dat\
a_type: BinaryView, nullable: true }, Field { name: "typed_value", data_type: Utf8, nullable: true }]) }]))
StructArray
-- validity:
[
  valid,
  valid,
]
[
-- child 0: "a" (Struct([Field { name: "value", data_type: BinaryView, nullable: true }, Field { name: "typed_value", data_type: Int32, nullable: true }]))
StructArray
-- validity:
[
  valid,
  valid,
]
[
-- child 0: "value" (BinaryView)
BinaryViewArray
[
  null,
  null,
]
-- child 1: "typed_value" (Int32)
PrimitiveArray<Int32>
[
  3,
  4,
]
]
-- child 1: "b" (Struct([Field { name: "value", data_type: BinaryView, nullable: true }, Field { name: "typed_value", data_type: Utf8, nullable: true }]))
StructArray
-- validity:
[
  valid,
  valid,
]
[
-- child 0: "value" (BinaryView)
BinaryViewArray
[
  null,
  null,
]
-- child 1: "typed_value" (Utf8)
StringArray
[
  "action",
  "horror",
]
]
]
],
]

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions