-
Notifications
You must be signed in to change notification settings - Fork 70
Open
Description
While adding support for variant array unshredding to arrow-rs, I discovered that parquet-cli is unable to correctly read the parquet files for cases 86 and 126, both due to the same index out of bounds error:
% parquet cat parquet-testing/shredded_variant/case-086.parquet
Unknown error
java.lang.RuntimeException: Failed on record 0 in file parquet-testing/shredded_variant/case-086.parquet
at org.apache.parquet.cli.commands.CatCommand.run(CatCommand.java:89)
at org.apache.parquet.cli.Main.run(Main.java:169)
at org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:76)
at org.apache.parquet.cli.Main.main(Main.java:197)
Caused by: org.apache.parquet.io.ParquetDecodingException: Can not read value at 0 in block -1 in file file:/Users/ryan.johnson/arrow-rs/parquet-testing/shredded_variant/case-086.parquet
at org.apache.parquet.hadoop.InternalParquetRecordReader.nextKeyValue(InternalParquetRecordReader.java:280)
at org.apache.parquet.hadoop.ParquetReader.read(ParquetReader.java:136)
at org.apache.parquet.hadoop.ParquetReader.read(ParquetReader.java:140)
at org.apache.parquet.cli.BaseCommand$1$1.advance(BaseCommand.java:356)
at org.apache.parquet.cli.BaseCommand$1$1.<init>(BaseCommand.java:337)
at org.apache.parquet.cli.BaseCommand$1.iterator(BaseCommand.java:335)
at org.apache.parquet.cli.commands.CatCommand.run(CatCommand.java:76)
... 3 more
Caused by: java.lang.IndexOutOfBoundsException: Index -1 out of bounds for length 0
at java.base/jdk.internal.util.Preconditions.outOfBounds(Preconditions.java:100)
at java.base/jdk.internal.util.Preconditions.outOfBoundsCheckIndex(Preconditions.java:106)
at java.base/jdk.internal.util.Preconditions.checkIndex(Preconditions.java:302)
at java.base/java.util.Objects.checkIndex(Objects.java:385)
at java.base/java.util.ArrayList.get(ArrayList.java:427)
at org.apache.parquet.io.GroupColumnIO.getLast(GroupColumnIO.java:94)
at org.apache.parquet.io.GroupColumnIO.getLast(GroupColumnIO.java:94)
at org.apache.parquet.io.GroupColumnIO.getLast(GroupColumnIO.java:94)
at org.apache.parquet.io.PrimitiveColumnIO.getLast(PrimitiveColumnIO.java:72)
at org.apache.parquet.io.PrimitiveColumnIO.isLast(PrimitiveColumnIO.java:66)
at org.apache.parquet.io.RecordReaderImplementation.<init>(RecordReaderImplementation.java:308)
at org.apache.parquet.io.MessageColumnIO$1.visit(MessageColumnIO.java:141)
at org.apache.parquet.io.MessageColumnIO$1.visit(MessageColumnIO.java:105)
at org.apache.parquet.filter2.compat.FilterCompat$NoOpFilter.accept(FilterCompat.java:186)
at org.apache.parquet.io.MessageColumnIO.getRecordReader(MessageColumnIO.java:105)
at org.apache.parquet.hadoop.InternalParquetRecordReader.checkRead(InternalParquetRecordReader.java:156)
at org.apache.parquet.hadoop.InternalParquetRecordReader.nextKeyValue(InternalParquetRecordReader.java:245)
... 9 moreThe backtrace for case-126.parquet is identical.
Looking at arrow-rs debug printouts of the arrays, I don't see anything obviously wrong, tho?
arrow-rs debug printout
For case-086.parquet, the input data is:
typed_value array: ListArray
[
StructArray
-- validity:
[
valid,
valid,
valid,
]
[
-- child 0: "value" (BinaryView)
BinaryViewArray
[
null,
[0],
null,
]
-- child 1: "typed_value" (Utf8)
StringArray
[
"comedy",
null,
"drama",
]
],
]
And for case-126.parquet, we have:
typed_value array: ListArray
[
StructArray
-- validity:
[
valid,
valid,
]
[
-- child 0: "value" (BinaryView)
BinaryViewArray
[
null,
null,
]
-- child 1: "typed_value" (Struct([Field { name: "a", data_type: Struct([Field { name: "value", data_type: BinaryView, nullable: true }, Field { name: "typed_value", data_type: Int32, nullable: true }]) }, Field { name: "b", data_type: Struct([Field { name: "value", dat\
a_type: BinaryView, nullable: true }, Field { name: "typed_value", data_type: Utf8, nullable: true }]) }]))
StructArray
-- validity:
[
valid,
valid,
]
[
-- child 0: "a" (Struct([Field { name: "value", data_type: BinaryView, nullable: true }, Field { name: "typed_value", data_type: Int32, nullable: true }]))
StructArray
-- validity:
[
valid,
valid,
]
[
-- child 0: "value" (BinaryView)
BinaryViewArray
[
null,
null,
]
-- child 1: "typed_value" (Int32)
PrimitiveArray<Int32>
[
1,
2,
]
]
-- child 1: "b" (Struct([Field { name: "value", data_type: BinaryView, nullable: true }, Field { name: "typed_value", data_type: Utf8, nullable: true }]))
StructArray
-- validity:
[
valid,
valid,
]
[
-- child 0: "value" (BinaryView)
BinaryViewArray
[
null,
null,
]
-- child 1: "typed_value" (Utf8)
StringArray
[
"comedy",
"drama",
]
]
]
],
StructArray
-- validity:
[
valid,
valid,
]
[
-- child 0: "value" (BinaryView)
BinaryViewArray
[
[2, 1, 2, 0, 4, 13, 115, 116, 114],
[2, 1, 3, 0, 5, 44, 40, 77, 0, 0],
]
-- child 1: "typed_value" (Struct([Field { name: "a", data_type: Struct([Field { name: "value", data_type: BinaryView, nullable: true }, Field { name: "typed_value", data_type: Int32, nullable: true }]) }, Field { name: "b", data_type: Struct([Field { name: "value", dat\
a_type: BinaryView, nullable: true }, Field { name: "typed_value", data_type: Utf8, nullable: true }]) }]))
StructArray
-- validity:
[
valid,
valid,
]
[
-- child 0: "a" (Struct([Field { name: "value", data_type: BinaryView, nullable: true }, Field { name: "typed_value", data_type: Int32, nullable: true }]))
StructArray
-- validity:
[
valid,
valid,
]
[
-- child 0: "value" (BinaryView)
BinaryViewArray
[
null,
null,
]
-- child 1: "typed_value" (Int32)
PrimitiveArray<Int32>
[
3,
4,
]
]
-- child 1: "b" (Struct([Field { name: "value", data_type: BinaryView, nullable: true }, Field { name: "typed_value", data_type: Utf8, nullable: true }]))
StructArray
-- validity:
[
valid,
valid,
]
[
-- child 0: "value" (BinaryView)
BinaryViewArray
[
null,
null,
]
-- child 1: "typed_value" (Utf8)
StringArray
[
"action",
"horror",
]
]
]
],
]
Metadata
Metadata
Assignees
Labels
No labels