-
Notifications
You must be signed in to change notification settings - Fork 1k
Complete phase 1 of the Thrift remodel for the parquet crate #8530
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Draft
etseidl
wants to merge
40
commits into
main
Choose a base branch
from
gh5854_thrift_remodel
base: main
Could not load branches
Branch not found: {{ refName }}
Loading
Could not load tags
Nothing to show
Loading
Are you sure you want to change the base?
Some commits from the old base branch may be removed from the timeline,
and old review comments may become outdated.
+7,454
−3,846
Conversation
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
…8048) **Note: this targets a feature branch, not main** # Which issue does this PR close? - Part of #5854. - related to #6129 # Rationale for this change This is the first step towards a rework of Parquet metadata handling. This PR attempts to remove as many references as possible to the `parquet::format` module in the public API. This is done by creating new enums and structs that mirror their `format` counterparts and using them in publicly exposed structures like `FileMetaData`. # What changes are included in this PR? # Are these changes tested? Current tests should suffice for now. More thorough tests will be added as needed. # Are there any user-facing changes? Yes, public facing interfaces should no longer expose `format`
# Which issue does this PR close? **Note: this targets a feature branch, not main** - Part of #5854. - Related to #6129 # Rationale for this change Next step. # What changes are included in this PR? Attempt to make use of macros originally developed by @jhorstmann to implement some thrift enums and unions in `basic.rs`. Also adds yet another spin on a thrift decoder based heavily on `TCompactSliceInputProtocol`. The current approach is to use this new decoder with `TryFrom` impls on the data structures in question. This PR does not yet complete the process far enough to parse a parquet footer, but I'm putting it out here to get early feedback on the design. Some structures already deviate enough from their thrift counterparts that the macro based parsing is not practical. Likewise, the macro approach doesn't work for structs that need lifetime annotations (necessary for reading binary fields as slices). # Are these changes tested? Not yet. # Are there any user-facing changes? Yes.
…aData` (#8111) # Which issue does this PR close? **Note: this targets a feature branch, not main** - Part of #5854. # Rationale for this change # What changes are included in this PR? This PR completes reading of the `FileMetaData` and `RowGroupMetaData` pieces of the `ParquetMetaData`. Column indexes and encryption will be follow-on work. This replaces the macro for generating structs with a more general one that can take visibility and lifetime specifiers. Also (temporarily) adds a new function `ParquetMetaDataReader::decode_file_metadata` which should be a drop-in replacement for `ParquetMetaDataReader::decode_metadata`. Still todo: 1. Add some tests that verify this produces the same output as `ParquetMetaDataReader::decode_metadata` 2. Read column indexes with new decoder 3. Read page headers with new decoder 4. Integrate with @alamb's push decoder work #8080 # Are these changes tested? Not yet # Are there any user-facing changes? Yes
# Which issue does this PR close? **Note: this targets a feature branch, not main** We generally require a GitHub issue to be filed for all bug fixes and enhancements and this helps us generate change logs for our releases. You can link an issue to this PR using the GitHub syntax. - Part of #5854. # Rationale for this change Speed # What changes are included in this PR? Still a work in progress, but begins the process of converting page index parsing to the new thrift decoder. # Are these changes tested? This PR actually uses the new decoder when parsing the page indexes using the existing machinery. As such all tests involving the page indexes should apply to this code. # Are there any user-facing changes? Yes
# Which issue does this PR close? **Note: this targets a feature branch, not main** - Part of #5854. # Rationale for this change Parsing the column index is _very_ slow. The largest part of the cost is taking the thrift structure (which is a struct of arrays) and converting it to an array of structs. This results in a large number of allocations when dealing with binary columns. This is an experiment in creating a new structure to hold the column index info that is a little friendlier to parse. It may also be easier to consume on the datafusion side. # What changes are included in this PR? A new `ColumnIndexMetaData` enum is added along with a type parameterized `NativeColumnIndex` struct. # Are these changes tested? No, this is an experiment only. If this work can be honed into an acceptible `Index` replacement, then tests will be added at that time. # Are there any user-facing changes? Yes, this would be a radical change to the column indexes in `ParquetMetaData`.
…ng of page indexes (#8190) # Which issue does this PR close? **Note: this targets a feature branch, not main** We generally require a GitHub issue to be filed for all bug fixes and enhancements and this helps us generate change logs for our releases. You can link an issue to this PR using the GitHub syntax. - Part of #5854. # Rationale for this change Add a custom parser for `PageLocation` as the decoding of this struct is one of several hot spots. # What changes are included in this PR? This adds a faster means of obtaining the struct field ids to `ThriftCompactInputProtocol`. For a small struct (3 fields) with all of them required, we can save a good bit of time bypassing `ThriftCompactInputProtocol::read_field_begin` which is very general and can handle out-of-order fields, among other things. By adding a new function `read_field_header`, we can avoid the costly branching that occurs when calculating the new field id (as well as special handling needed for boolean fields). Field validation is then handled on the consuming side while decoding the `PageLocation` struct. Note that to obtain the speed up seen, we need to assume the fields will always be in order, and the field ids will all be encoded as field deltas. This is probably a fairly safe assumption, but there does exist the possibility of custom thrift writers that use absolute field ids. If we encounter such a writer in the wild, this change will need to be reverted. # Are these changes tested? These changes should be covered by existing changes. # Are there any user-facing changes? None beyond the changes in this branch.
# Which issue does this PR close? **Note: this targets a feature branch, not main** - Part of #5854. # Rationale for this change Begins adding custom thrift write support. # What changes are included in this PR? Adds traits to aid in writing of thrift and modifies thrift macros to generate writing code. # Are these changes tested? Yes, adds some roundtrip tests to validate encoded data can be decoded to the same state. # Are there any user-facing changes? No
…ter decryption code (#8313) # Which issue does this PR close? **Note: this targets a feature branch, not main** - Part of #5854. # Rationale for this change Continuing the remodel # What changes are included in this PR? This begins the process of replacing the current footer parsing code with the new version. As part of this much of the decryption machinery also needed to be changed. # Are these changes tested? Should be covered by existing tests # Are there any user-facing changes? Yes
# Which issue does this PR close? **Note: this targets a feature branch, not main** - Part of #5854. # Rationale for this change As I started on decoding thrift page headers, I found that the way I had been going was no longer going to work. This PR begins the process of abstracting the thrift reader to allow for other implementations. # What changes are included in this PR? In addition to reworking the reader itself, this PR moves away from the previous `TryFrom` approach and instead adds a `ReadThrift` trait. # Are these changes tested? Should be covered by existing tests # Are there any user-facing changes? Yes
…ers (#8376) # Which issue does this PR close? **Note: this targets a feature branch, not main** - Part of #5854. # Rationale for this change This continues the remodel, moving on to `PageHeader` support. # What changes are included in this PR? Swaps out old `format` page header structs for new ones. This also adds a `Read` based implementation of the thrift compact protocol reader (the sizes of the Thrift encoded page headers are not knowable in advance, so we need a way to read them from the thrift input stream used by the page decoder). This PR also makes decoding of the `Statistics` in the page header optional (defaults to `false`). We do not use them, and the decode takes a good chunk of time. # Are these changes tested? These changes should be covered by existing tests # Are there any user-facing changes? Yes, page level stats are no longer decoded by default
# Which issue does this PR close? **Note: this targets a feature branch, not main** - Closes #5854. # Rationale for this change Continues the remodel by implementing writing of the page index structures. # What changes are included in this PR? This PR removes the old `parquet::file::page_index::Index` enum and replaces with the new `ColumnIndexMetaData` struct. # Are these changes tested? Covered by existing tests # Are there any user-facing changes? Yes.
# Which issue does this PR close? **Note: this targets a feature branch, not main** - Part of #5854. # Rationale for this change This PR closes the loop and and now Parquet metadata is completely handled by the new code. # What changes are included in this PR? Changes the metadata builders to use the new structs rather than those from `format`. As a consequence, the `close` methods no longer return a `format::FileMetaData` but instead return a `ParquetMetaData`. # Are these changes tested? Covered by existing tests, but many tests were modified to deal with the switch to `ParquetMetaData` mentioned above. # Are there any user-facing changes? Yes
…8476) # Which issue does this PR close? **Note: this targets a feature branch, not main** - Part of #5854. # Rationale for this change This should complete the major work of the remodel. Follow-on tasks include performance tweaks, documentation, and adding the ability to skip unneeded structures when decoding. None of the latter should involve breaking changes. # What changes are included in this PR? The major change is removing all of the code to convert to `parquet::format` structures. The only places those are still used are in the benchmark and `bin` code which are not strictly in the parquet crate. Once those are cleaned up we can deprecate the `format` module. This also adds a markdown file documenting the use of the new Thrift macros. # Are these changes tested? Should be covered by existing tests. # Are there any user-facing changes? Yes --------- Co-authored-by: Matthijs Brobbel <m1brobbel@gmail.com>
Most recent benchmark (on Core i7 macbook)
|
Benchmark on Intel i7-12700K
|
…8528) # Which issue does this PR close? **Note: this targets a feature branch, not main** - Part of #5854. # Rationale for this change This brings over changes to handling of geo-spatial statistics introduced by @paleolimbot in #8520. # What changes are included in this PR? Primarily adds documentation and tests to changes already made. The only significant change is adding a `Default` implementation for `EdgeInterpolationAlgorithm`. # Are these changes tested? Yes # Are there any user-facing changes? Yes --------- Co-authored-by: Matthijs Brobbel <m1brobbel@gmail.com>
FYI I am starting to test this branch with DataFusion but it will take me a while as the arrow upgrade is quite substantial due to how DataTypes are represented |
I got far enough to start testing DataFusion with this actual branch. I am not done yet but making good progress #8513 (comment) |
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Labels
api-change
Changes to the arrow API
next-major-release
the PR has API changes and it waiting on the next major version
parquet
Changes to the parquet crate
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
Which issue does this PR close?
Rationale for this change
See issue, but this change is needed to allow greater control over Parquet metadata handling. These changes speed up Thrift decoding about 2X, and enable further optimizations down the line, such as allowing selective decoding of only that metadata that is needed.
What changes are included in this PR?
In short, the
format
crate is now no longer necessary. This crate will decode from Thrift encoded bytes straight to the metadata objects defined by this crate (ParquetMetaData
and children).Are these changes tested?
Yes, but more should be added by follow-on PRs
Are there any user-facing changes?
Yes, many.