-
Notifications
You must be signed in to change notification settings - Fork 1k
[thrift-remodel] Refactor Parquet Thrift code into new thrift module
#8599
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
| /// Create a [`crate::file::statistics::Statistics`] from a thrift [`Statistics`] object. | ||
| pub(crate) fn convert_stats( | ||
| physical_type: Type, | ||
| fn convert_stats( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I have a branch where this becomes read_stats and avoids the intermediate Statistics<'a> struct, but that didn't move the needle much on performance. I'll revisit this later.
|
🤖 |
alamb
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks great to me -- thanks @etseidl
I also kicked off some benchmarks to see if we can see an improvement due to the more efficient row group / column decoding
| list_ident.size | ||
| )); | ||
| } | ||
| let mut cols = Vec::with_capacity(list_ident.size as usize); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nice -- this gets rid of another allocation in the parsing code
|
🤖: Benchmark completed Details
|
|
Thanks for the fast review @alamb. I think this is it for The next speedup will be skipping decoding of column chunk stats and encoding statistics. The latter can also become a bitmap to satisfy the only use I've heard of for the stats (enabling use of the dictionary for pruning). But enabling these optimizations is going to involve passing arguments down somehow (likely on the metadata readers) and may involve further breaking changes. |
Which issue does this PR close?
Rationale for this change
Earlier work had introduced some code duplication dealing with decoding of the
ColumnMetaDataThrift struct. This PR addresses that, and also addresses earlier review comments (#8587 (comment)).What changes are included in this PR?
This PR changes how some metadata structures are parsed, utilizing a flag for required fields rather than relying on
Option::is_some. This allows for passing around partially initializedColumnChunkMetaDatastructs which in turn allows for sharing of theColumnMetaDataparsing code between the encrypted and unencrypted code paths.This PR also moves the
file/metadata/{encryption,thrift_gen}.rsfiles to a newfile::metadata::thriftmodule.Are these changes tested?
Covered by existing tests.
Are there any user-facing changes?
No, only makes changes to private APIs.