-
Notifications
You must be signed in to change notification settings - Fork 4k
Open
Description
Describe the enhancement requested
We recently discovered that nested data structures within parquet files, such as struct of lists, do not benefit from multi-threading enabled by default in pyarrow's parquet reader. However, if these are instead represented by a top-level data structure list a set of list fields, then the multi-threading works as expected. It would be nice, if possible, to enable multi-threading within nested structures that contain multiple fields. Here's a few code snippets/screenshots for context and reproducibility:
File Generation:
# Code block to generate needed parquet files
from nested_pandas.datasets import generate_data
# Generate a parquet dataset with struct-list format
nf = generate_data(100,2000, seed=1)[["nested"]]
nf.to_parquet("nested_parquet.parquet")
# Generate a parquet dataset with list-array format
nf["nested"].to_lists().to_parquet("list_parquet.parquet")
Versioning & Storage Context
import pyarrow as pa
pa.__version__
> '22.0.0'
# struct of lists storage as read by pyarrow
pa.parquet.read_table("nested_parquet.parquet").field("nested")
> pyarrow.Field<nested: struct<t: list<element: double>, flux: list<element: double>, band: list<element: string>>>
# list storage as read by pyarrow
pa.parquet.read_table("list_parquet.parquet").field("t")
> pyarrow.Field<t: list<element: double>>
Single-Thread Timings:
Multi-Thread Timings:
We see that multi-threading improves the read speed for list-arrays, but not for struct-list formatted data.
Component(s)
Parquet
hombit