Skip to content

[Python] Enable multi-threaded reads of struct-list stored data in parquet files #48636

@dougbrn

Description

@dougbrn

Describe the enhancement requested

We recently discovered that nested data structures within parquet files, such as struct of lists, do not benefit from multi-threading enabled by default in pyarrow's parquet reader. However, if these are instead represented by a top-level data structure list a set of list fields, then the multi-threading works as expected. It would be nice, if possible, to enable multi-threading within nested structures that contain multiple fields. Here's a few code snippets/screenshots for context and reproducibility:

File Generation:

# Code block to generate needed parquet files
from nested_pandas.datasets import generate_data

# Generate a parquet dataset with struct-list format
nf = generate_data(100,2000, seed=1)[["nested"]]
nf.to_parquet("nested_parquet.parquet")

# Generate a parquet dataset with list-array format
nf["nested"].to_lists().to_parquet("list_parquet.parquet")

Versioning & Storage Context

import pyarrow as pa
pa.__version__
> '22.0.0'

# struct of lists storage as read by pyarrow
pa.parquet.read_table("nested_parquet.parquet").field("nested")
> pyarrow.Field<nested: struct<t: list<element: double>, flux: list<element: double>, band: list<element: string>>>

# list storage as read by pyarrow
pa.parquet.read_table("list_parquet.parquet").field("t")
> pyarrow.Field<t: list<element: double>>

Single-Thread Timings:

Image

Multi-Thread Timings:

Image

We see that multi-threading improves the read speed for list-arrays, but not for struct-list formatted data.

Component(s)

Parquet

Metadata

Metadata

Assignees

No one assigned

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions