Skip to content

Conversation

@etseidl
Copy link
Contributor

@etseidl etseidl commented Dec 16, 2025

Which issue does this PR close?

Rationale for this change

Add ability to skip the decoding of more types of statistics contained in the Parquet column metadata. While this currently doesn't have a huge impact on decode time, it can reduce the amount of memory used by the ParquetMetaData.

What changes are included in this PR?

Adds more options and tests for those options. Also adds size statistics to the metadata bench.

Are these changes tested?

Yes

Are there any user-facing changes?

Only adds new options, no breaking changes.

@github-actions github-actions bot added the parquet Changes to the parquet crate label Dec 16, 2025
@alamb
Copy link
Contributor

alamb commented Dec 19, 2025

Sorry I didn't see this one before. I'll try and review it shortly

@alamb
Copy link
Contributor

alamb commented Dec 19, 2025

run benchmark encoding metadata

@alamb-ghbot
Copy link

🤖 ./gh_compare_arrow.sh gh_compare_arrow.sh Running
Linux aal-dev 6.14.0-1018-gcp #19~24.04.1-Ubuntu SMP Wed Sep 24 23:23:09 UTC 2025 x86_64 x86_64 x86_64 GNU/Linux
Comparing skip_column_stats (072ecf6) to c2bd7d9 diff
BENCH_NAME=encoding
BENCH_COMMAND=cargo bench --features=arrow,async,test_common,experimental --bench encoding
BENCH_FILTER=
BENCH_BRANCH_NAME=skip_column_stats
Results will be posted here when complete

Copy link
Contributor

@alamb alamb left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Makes sense to me -- thank you @etseidl

I launched a few more benchmarks off just to be sure this doesn't have some weird impact but I don't expect it to

@alamb-ghbot
Copy link

🤖: Benchmark completed

Details

group                                                                main                                   skip_column_stats
-----                                                                ----                                   -----------------
decoding: dtype=FixedLenByteArray(16), encoding=BYTE_STREAM_SPLIT    1.01    560.7±6.71µs        ? ?/sec    1.00    557.3±4.57µs        ? ?/sec
decoding: dtype=FixedLenByteArray(2), encoding=BYTE_STREAM_SPLIT     1.00    371.3±4.78µs        ? ?/sec    1.03   383.7±15.37µs        ? ?/sec
decoding: dtype=f32, encoding=BYTE_STREAM_SPLIT                      1.00     34.4±0.21µs        ? ?/sec    1.00     34.5±0.45µs        ? ?/sec
decoding: dtype=f64, encoding=BYTE_STREAM_SPLIT                      1.00    105.0±1.28µs        ? ?/sec    1.00    105.2±1.16µs        ? ?/sec
encoding: dtype=FixedLenByteArray(16), encoding=BYTE_STREAM_SPLIT    1.00    325.4±3.17µs        ? ?/sec    1.00    325.0±3.15µs        ? ?/sec
encoding: dtype=FixedLenByteArray(2), encoding=BYTE_STREAM_SPLIT     1.05     44.2±6.92µs        ? ?/sec    1.00     42.1±5.30µs        ? ?/sec
encoding: dtype=f32, encoding=BYTE_STREAM_SPLIT                      1.00     22.9±0.16µs        ? ?/sec    1.00     22.9±0.06µs        ? ?/sec
encoding: dtype=f64, encoding=BYTE_STREAM_SPLIT                      1.00     56.9±0.27µs        ? ?/sec    1.00     57.0±0.34µs        ? ?/sec

@alamb-ghbot
Copy link

🤖 ./gh_compare_arrow.sh gh_compare_arrow.sh Running
Linux aal-dev 6.14.0-1018-gcp #19~24.04.1-Ubuntu SMP Wed Sep 24 23:23:09 UTC 2025 x86_64 x86_64 x86_64 GNU/Linux
Comparing skip_column_stats (072ecf6) to c2bd7d9 diff
BENCH_NAME=metadata
BENCH_COMMAND=cargo bench --features=arrow,async,test_common,experimental --bench metadata
BENCH_FILTER=
BENCH_BRANCH_NAME=skip_column_stats
Results will be posted here when complete

@alamb-ghbot
Copy link

🤖: Benchmark completed

Details

group                                            main                                   skip_column_stats
-----                                            ----                                   -----------------
decode metadata (wide) with schema               1.02     40.1±0.31ms        ? ?/sec    1.00     39.5±0.42ms        ? ?/sec
decode metadata (wide) with skip PES             1.04     40.1±0.70ms        ? ?/sec    1.00     38.7±0.25ms        ? ?/sec
decode metadata (wide) with skip all stats                                              1.00     44.2±0.44ms        ? ?/sec
decode metadata (wide) with skip column stats                                           1.00     42.6±0.50ms        ? ?/sec
decode metadata (wide) with skip size stats                                             1.00     49.0±1.01ms        ? ?/sec
decode metadata (wide) with stats mask           1.02     39.8±0.33ms        ? ?/sec    1.00     38.9±0.40ms        ? ?/sec
decode metadata with schema                      1.01      5.7±0.08µs        ? ?/sec    1.00      5.6±0.03µs        ? ?/sec
decode metadata with skip PES                    1.02      8.9±0.06µs        ? ?/sec    1.00      8.7±0.07µs        ? ?/sec
decode metadata with skip column stats                                                  1.00      8.9±0.13µs        ? ?/sec
decode metadata with stats mask                  1.02      8.9±0.14µs        ? ?/sec    1.00      8.8±0.10µs        ? ?/sec
decode parquet metadata                          1.00      9.4±0.11µs        ? ?/sec    1.00      9.4±0.08µs        ? ?/sec
decode parquet metadata (wide)                   1.02     43.4±0.39ms        ? ?/sec    1.00     42.4±0.43ms        ? ?/sec
decode parquet metadata w/ size stats (wide)                                            1.00     55.8±0.65ms        ? ?/sec
open(default)                                    1.02     10.0±0.20µs        ? ?/sec    1.00      9.8±0.11µs        ? ?/sec
open(page index)                                 1.00    166.1±1.37µs        ? ?/sec    1.00    165.9±0.71µs        ? ?/sec

Copy link
Contributor

@scovich scovich left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

parquet Changes to the parquet crate

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants