Skip to content

Conversation

@DarkWanderer
Copy link

Which issue does this PR close?

Rationale for this change

Some databases, one example being Grafana Tempo, utilize column dictionaries as makeshift column indexes, to improve filtering speed ad-hoc. Checking if low-cardinality value is present in dictionary allows to effectively pre-filter data by skipping whole row group. This PR adds this capability

What changes are included in this PR?

Add public get_row_group_column_dictionary function to ParquetRecordBatchStreamBuilder

Are these changes tested?

Tests have been added

Are there any user-facing changes?

Public API extension for ParquetRecordBatchStreamBuilder

@github-actions github-actions bot added the parquet Changes to the parquet crate label Dec 17, 2025
@DarkWanderer DarkWanderer marked this pull request as draft December 17, 2025 17:17
@tustvold
Copy link
Contributor

tustvold commented Dec 17, 2025

Unfortunately dictionary encoding is best effort, and writers will fallback to different encodings if the dictionary gets too large. The result is you need to know if all the pages are dictionary encoded in order to be able to make this optimisation - iirc this information is not encoded anywhere but the page header itself...

Putting this aside there are likely some challenges around typing with this approach.

IMO bloom filters are the recommended way to handle this sort of thing, dictionaries are more of an encoding optimisation.

Edit: I could see a world where users could opt-in to have ArrowFilter passed the dictionary in a pre-pass, and for this to then allow the reader to skip decoding dictionary encoded pages if there are no matches, but unless I'm remembering incorrectly this wouldn't allow skipping the IO...

@DarkWanderer
Copy link
Author

Edit: I could see a world where users could opt-in to have ArrowFilter passed the dictionary in a pre-pass, and for this to then allow the reader to skip decoding dictionary encoded pages if there are no matches, but unless I'm remembering incorrectly this wouldn't allow skipping the IO...

That is exactly what I am hoping for - to perform a multiple-range fetch of a few MB from object_store to filter down row groups to only ones I need, which saves me multiple gigabytes of actual S3 read.

@etseidl
Copy link
Contributor

etseidl commented Dec 18, 2025

Unfortunately dictionary encoding is best effort, and writers will fallback to different encodings if the dictionary gets too large. The result is you need to know if all the pages are dictionary encoded in order to be able to make this optimisation - iirc this information is not encoded anywhere but the page header itself...

This is why the page encoding stats exist in the column metadata. This will tell you if all pages in a given chunk are dictionary encoded. (See

/// Returns the page encoding statistics reduced to a bitmask, or `None` if statistics are
/// not available (or they were left in their original form).
///
/// The [`PageEncodingStats`] struct was added to the Parquet specification specifically to
/// enable fast determination of whether all pages in a column chunk are dictionary encoded
/// (see <https://github.com/apache/parquet-format/pull/16>).
/// Decoding the full page encoding statistics, however, can be very costly, and is not
/// necessary to support the aforementioned use case. As an alternative, this crate can
/// instead distill the list of `PageEncodingStats` down to a bitmask of just the encodings
/// used for data pages
/// (see [`ParquetMetaDataOptions::set_encoding_stats_as_mask`]).
/// To test for an all-dictionary-encoded chunk one could use this bitmask in the following way:
)

That said, I'm not nuts about duplicating the logic to decode the dictionary page, and just for the async reader. If we're going down this path, then I think all readers should have access. Perhaps this could be in ArrowReaderBuilder or even ParquetMetaData 🤷

@DarkWanderer
Copy link
Author

DarkWanderer commented Dec 19, 2025

This is why the page encoding stats exist in the column metadata. This will tell you if all pages in a given chunk are dictionary encoded.

Thanks for confirming, that was my intuition as well

That said, I'm not nuts about duplicating the logic to decode the dictionary page, and just for the async reader. If we're going down this path, then I think all readers should have access. Perhaps this could be in ArrowReaderBuilder or even ParquetMetaData

I would note that there is already a disparity in available APIs. Specifically, SerializedFileReader gives access to RowGroupReader, which in turn gives raw access to PageReader - this capability is missing in ParquetRecordBatchStreamBuilder. Also, method get_row_group_column_bloom_filter is already exposed in ParquetRecordBatchStreamBuilder, that was my motivation for placing get_row_group_column_dictionary near it.

I hear your argument about consistent API and code duplication however, will think of a better place to insert it.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

parquet Changes to the parquet crate

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Enable access to column dictionaries in async reader

3 participants