-
Notifications
You must be signed in to change notification settings - Fork 1k
[thrift-remodel] Refactor Thrift encryption and store encodings as bitmask #8587
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
|
This PR shows the value of reducing vector allocations. Two major offenders during decoding are the |
|
Note to reviewers: this PR is actually two things, a refactor and a performance enhancement. I've left them as two commits to make it easier to see what changed in each part. |
| #[derive(Clone, Debug, Eq, PartialEq)] | ||
| struct ColumnMetaData<'a> { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Most of this diff is code moved here from thrift_gen.rs. From this point on is code added to support the change to encodings.
parquet/src/file/metadata/mod.rs
Outdated
| /// All encodings used for this column. | ||
| pub fn encodings(&self) -> &Vec<Encoding> { | ||
| &self.encodings | ||
| pub fn encodings(&self) -> Vec<Encoding> { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is the breaking change
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
How about we also mark this function as #[deprecated] to help people know they should use the more efficient encoding_mask if they are using it?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I also recommend changing the signature to avoid forcing an allocation -- for example how about returning an iterator like this?
| pub fn encodings(&self) -> Vec<Encoding> { | |
| pub fn encodings(&self) -> impl Iterator<Item = &Encoding> { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
will do
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Done in c94c263
I have a good feeling that this thrift decoder is going to be the gift that keeps on giving performance wise 🎁 |
There is more to come 😁 |
alamb
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks @etseidl -- this looks great!
I have some suggestions for the interface design of the Encodings mask, though I think we could do it as a follow on PR as well
parquet/src/file/metadata/mod.rs
Outdated
| /// All encodings used for this column. | ||
| pub fn encodings(&self) -> &Vec<Encoding> { | ||
| &self.encodings | ||
| pub fn encodings(&self) -> Vec<Encoding> { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
How about we also mark this function as #[deprecated] to help people know they should use the more efficient encoding_mask if they are using it?
parquet/src/file/metadata/mod.rs
Outdated
| Self(ColumnChunkMetaData { | ||
| column_descr, | ||
| encodings: Vec::new(), | ||
| encodings: 0, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
parquet/src/file/metadata/mod.rs
Outdated
| } | ||
|
|
||
| /// All encodings used for this column, returned as a bitmask. | ||
| pub fn encodings_mask(&self) -> i32 { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think this will be very hard to use as a user given the lack of documentation / export of how to interpret the values.
I left a suggestion above about how to improve it
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks. Speed first, then usability 😆
parquet/src/file/metadata/mod.rs
Outdated
| /// All encodings used for this column. | ||
| pub fn encodings(&self) -> &Vec<Encoding> { | ||
| &self.encodings | ||
| pub fn encodings(&self) -> Vec<Encoding> { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I also recommend changing the signature to avoid forcing an allocation -- for example how about returning an iterator like this?
| pub fn encodings(&self) -> Vec<Encoding> { | |
| pub fn encodings(&self) -> impl Iterator<Item = &Encoding> { |
parquet/src/basic.rs
Outdated
| } | ||
| } | ||
|
|
||
| const MAX_ENCODING: i32 = Encoding::BYTE_STREAM_SPLIT as i32; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It is probably my OOP / old age, but I would find this API easier to deal with if it were wrapped in a type that had a documented API
Something like
struct EncodingMask(i32);
impl EncodingMask {
fn new(val: i32) -> Self {
...
}
fn as_i32(&self) -> i32 {
self.0
}
}
// then you can implement From impls to/from i32 as well as a no allocation iteratorI am happy to help hack this out if you like the idea
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sounds good. I started down that path before but got lazy 😅. I'll add it to the queue.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
shall I file a ticket? (I personally get an endorphin hit for each ticket that I close, so want to offer you the same incentive 😆 )
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Since this is thrift parsing related, perhaps it would be better organized as a submodule of thrift_get?
In other words, how about renaming it into parquet/src/file/metadata/thrift_gen/encryption.rs
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sounds good. And rename thrift_gen to just thrift?
| // specific language governing permissions and limitations | ||
| // under the License. | ||
|
|
||
| // a collection of generated structs used to parse thrift metadata |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
technically speaking they aren't generated anymore
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
True. Also this is just cut-and-paste, so I should update it to match what's actually in the module.
Also, we (I) need to rename thrift_gen at some point, since that's not strictly true either.
|
I also queued up a benchmark run to confirm the performance improvement (I am sure it is there!) |
|
🤖 |
|
🤖: Benchmark completed Details
|
suggested in review.
|
Bencmark results look consistent with yours (though I didn't run with just default features) |
|
The wide results are surprising (and disappointing 😢). I'll test on a better machine. |
|
Benchmark on my workstation makes me feel a little better |
| /// | ||
| /// [`ColumnMetaData`]: https://github.com/apache/parquet-format/blob/9fd57b59e0ce1a82a69237dcf8977d3e72a2965d/src/main/thrift/parquet.thrift#L875 | ||
| #[derive(Debug, Clone, Default, PartialEq, Eq)] | ||
| pub struct EncodingMask(i32); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
❤️
| } | ||
|
|
||
| /// Sets list of encodings for this column chunk. | ||
| pub fn set_encodings(mut self, encodings: Vec<Encoding>) -> Self { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
As a follow on PR it may be worth deprecating the old APIs
parquet/src/basic.rs
Outdated
| let list_ident = prot.read_list_begin()?; | ||
| for _ in 0..list_ident.size { | ||
| let val = i32::read_thrift(prot)?; | ||
| if (0..=Self::MAX_ENCODING).contains(&val) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I've been thinking about this representation some more since I proposed it. The main usecase is to detect whether any used encoding is not yet supported. I think this means we also need to handle bits outside the current MAX_ENCODING, and also iterate over all set bits in the encodings method. It's very unlikely there will ever be over 32 encodings, but maybe we should also clamp the shift value to 31 to avoid any overflow.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks @jhorstmann, that's an excellent point. I think for each enum/union we should figure out what we want to do when we encounter an unknown field. For now they all (with the exception of LogicalType, EdgeInerpolationAlgorithm, and ColumnOrder) return an error.
Encoding likely should return an error since the column will not be decodable. I think I'll change the line above to Encoding::read_thrift which will detect the error so we don't have to check for it here. The test on the range can then be removed as well.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Oh, and I was wondering if i32 was too large, but we're already up to 10 bits with the current encodings, and with @alamb starting the discussion about adding more, I think we're likely to exhaust an i16 in the not-to-distant future.
|
Merging this to move on to some refactoring. |
#8599) # Which issue does this PR close? - Part of #5853. # Rationale for this change Earlier work had introduced some code duplication dealing with decoding of the `ColumnMetaData` Thrift struct. This PR addresses that, and also addresses earlier review comments (#8587 (comment)). # What changes are included in this PR? This PR changes how some metadata structures are parsed, utilizing a flag for required fields rather than relying on `Option::is_some`. This allows for passing around partially initialized `ColumnChunkMetaData` structs which in turn allows for sharing of the `ColumnMetaData` parsing code between the encrypted and unencrypted code paths. This PR also moves the `file/metadata/{encryption,thrift_gen}.rs` files to a new `file::metadata::thrift` module. # Are these changes tested? Covered by existing tests. # Are there any user-facing changes? No, only makes changes to private APIs.



Which issue does this PR close?
Rationale for this change
Reduce code clutter (too many
[cfg(feature = "encryption")]s).Also reduces the number of vector allocations using a strategy proposed by @jhorstmann (#8518 (comment)).
What changes are included in this PR?
Refactors (most) of the Thrift decoding involving encryption into a new module.
Replaces the
encodingsvector inColumnChunkMetaDatawith a bitmask.Are these changes tested?
Covered by existing and new tests.
Are there any user-facing changes?
Yes. This changes the return type of
ColumnChunkMetaData::encodingsfrom&Vec<Encoding>toVec<Encoding>(the vector is now created on demand rather than stored).