-
Notifications
You must be signed in to change notification settings - Fork 1.1k
Added capability to fetch dictionary values #9011
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
Conversation
74358d8 to
db8b8e4
Compare
|
Unfortunately dictionary encoding is best effort, and writers will fallback to different encodings if the dictionary gets too large. The result is you need to know if all the pages are dictionary encoded in order to be able to make this optimisation - iirc this information is not encoded anywhere but the page header itself... Putting this aside there are likely some challenges around typing with this approach. IMO bloom filters are the recommended way to handle this sort of thing, dictionaries are more of an encoding optimisation. Edit: I could see a world where users could opt-in to have ArrowFilter passed the dictionary in a pre-pass, and for this to then allow the reader to skip decoding dictionary encoded pages if there are no matches, but unless I'm remembering incorrectly this wouldn't allow skipping the IO... |
That is exactly what I am hoping for - to perform a multiple-range fetch of a few MB from |
This is why the page encoding stats exist in the column metadata. This will tell you if all pages in a given chunk are dictionary encoded. (See arrow-rs/parquet/src/file/metadata/mod.rs Lines 1072 to 1083 in 116ae12
That said, I'm not nuts about duplicating the logic to decode the dictionary page, and just for the async reader. If we're going down this path, then I think all readers should have access. Perhaps this could be in |
Thanks for confirming, that was my intuition as well
I would note that there is already a disparity in available APIs. Specifically, I hear your argument about consistent API and code duplication however, will think of a better place to insert it. |
Which issue does this PR close?
Rationale for this change
Some databases, one example being Grafana Tempo, utilize column dictionaries as makeshift column indexes, to improve filtering speed ad-hoc. Checking if low-cardinality value is present in dictionary allows to effectively pre-filter data by skipping whole row group. This PR adds this capability
What changes are included in this PR?
Add public
get_row_group_column_dictionaryfunction to ParquetRecordBatchStreamBuilderAre these changes tested?
Tests have been added
Are there any user-facing changes?
Public API extension for
ParquetRecordBatchStreamBuilder