-
Notifications
You must be signed in to change notification settings - Fork 1k
[thrift-remodel] Remove conversion functions for row group and column metadata #8574
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
|
Benchmarks relative to current main. |
| impl<T: HeapSize> HeapSize for Box<T> { | ||
| fn heap_size(&self) -> usize { | ||
| self.as_ref().heap_size() | ||
| } | ||
| } |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Not sure if this is correct. Should this also include the size of T?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Per the description of
arrow-rs/parquet/src/file/metadata/memory.rs
Lines 35 to 39 in fe9a92f
| /// Return the size of any bytes allocated on the heap by this object, | |
| /// including heap memory in those structures | |
| /// | |
| /// Note that the size of the type itself is not included in the result -- | |
| /// instead, that size is added by the caller (e.g. container). |
So in that case I do think the size of T should be included. The size of self (the pointer/Box) should not be included, but the memory it points to should be
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks, I'll fix.
| } | ||
| ); | ||
|
|
||
| // TODO(ets): move a lot of the encryption stuff to its own module |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There is a follow-on PR ready to do just this.
| thrift_struct!( | ||
| pub(crate) struct Statistics<'a> { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
this just moved from below
| return Err(general_err!("Column order length mismatch")); | ||
| // using ThriftSliceInputProtocol rather than ThriftCompactInputProtocl trait because | ||
| // these are all internal and operate on slices. | ||
| fn read_column_chunk<'a>( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I realize this is quite ugly, but necessary for both performance and for stats skipping and other optimizations. I do have a simplified version of this function in the queue, so it will get better with time.
|
👀 |
alamb
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks @etseidl -- I started going through this PR but didn't make it. I will finish up tomorrow
| impl<T: HeapSize> HeapSize for Box<T> { | ||
| fn heap_size(&self) -> usize { | ||
| self.as_ref().heap_size() | ||
| } | ||
| } |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Per the description of
arrow-rs/parquet/src/file/metadata/memory.rs
Lines 35 to 39 in fe9a92f
| /// Return the size of any bytes allocated on the heap by this object, | |
| /// including heap memory in those structures | |
| /// | |
| /// Note that the size of the type itself is not included in the result -- | |
| /// instead, that size is added by the caller (e.g. container). |
So in that case I do think the size of T should be included. The size of self (the pointer/Box) should not be included, but the memory it points to should be
| + self.unencoded_byte_array_data_bytes.heap_size() | ||
| + self.repetition_level_histogram.heap_size() | ||
| + self.definition_level_histogram.heap_size() | ||
| + self.geo_statistics.heap_size() |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
💯
|
|
||
| impl HeapSize for FileMetaData { | ||
| fn heap_size(&self) -> usize { | ||
| #[cfg(feature = "encryption")] |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
- Does this change also fix ParquetMetaData memory size is not reported accurately when
encryptionis enabled #8472 ?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
No, we still need to implement HeapSize for FileDecryptor.
| #[test] | ||
| fn test_parquet_1481() { | ||
| let err = read_file("PARQUET-1481.parquet").unwrap_err(); | ||
| #[cfg(feature = "encryption")] |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why did this test change?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Because I've unified the file metadata decoding for both paths, so now we get the same error regardless of whether encryption is enabled or not.
|
🤖 |
|
🤖: Benchmark completed Details
|
🚀 that is quite nice @etseidl -- love it. |
alamb
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thank you @etseidl -- this is a great exampe of what the new thrift decoding infrastructure allows ❤️
| } | ||
|
|
||
| /// Create a [`crate::file::statistics::Statistics`] from a thrift [`Statistics`] object. | ||
| pub(crate) fn convert_stats( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is probably a good target to consider reworking eventually as well it it shows up in traces
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Added to the list.
|
|
||
| let compression = codec; | ||
|
|
||
| // NOTE: I tried using the builder for this, but it added 20% to the execution time |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
that is interesting and unexpected. Sounds like a good idea to keep it like this
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Spoiler: I eventually take another page from @jhorstmann's book and use the builder to create a default initialized ColumnChunkMetaData whose fields I then fill in directly. It reduces the code bloat and allows me to again split out the ColumnMetaData parsing without any drop in performance.
| } | ||
|
|
||
| Ok(ParquetMetaData::new(fmd, row_groups)) | ||
| /// Create [`ParquetMetaData`] from thrift input. Note that this only decodes the file metadata in |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
this is really neat. With this code, I can easily see a bunch more potential optimizations (e.g. for example use a single Vec of ColumnChunkMetadata and have each RowGroup store the starting offset into that vec rather than have Vecs of Vecs of Vec.
(not for this PR obviously, but for a future one)
| file_offset = Some(i64::read_thrift(&mut *prot)?); | ||
| } | ||
| 3 => { | ||
| // `ColumnMetaData`. Read inline for performance sake. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
what do you mean "for performance sake" -- does that means the rust compiler isn't inlining this when using the macro? Maybe we should sprinkle some #[inline] 🤔
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It was hard to tell from profiling exactly what was addding to the overhead, but splitting this into a separate call that decodes and returns a ColumnMetaData object which is then moved into the ColumnChunkMetaData added about 5% to execution time IIRC. As mentioned above, my latest code passes the ColumnChunkMetaData in, so no temporary object is created. Then I get some code reuse and this becomes easier to read 😄
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Another benefit of all this thrift remodel work is that we can do this level of optimziation (which is basically impractical with generated code)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
especially when it comes to things like recasting the encodings vector as a bitmask. That gets rid of a ton of allocations 😄
| let mut cols = Vec::with_capacity(list_ident.size as usize); | ||
| for i in 0..list_ident.size as usize { | ||
| let col = read_column_chunk(prot, &schema_descr.columns()[i])?; | ||
| cols.push(col); | ||
| } |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I am just curious -- did you try a more functional form (I never know quite what the rust compiler knows how to optimize).
Sometimes using this type of loop lets the compiler avoid the bounds checks, which may or may not help in this case
| let mut cols = Vec::with_capacity(list_ident.size as usize); | |
| for i in 0..list_ident.size as usize { | |
| let col = read_column_chunk(prot, &schema_descr.columns()[i])?; | |
| cols.push(col); | |
| } | |
| let cols = schema_descr | |
| .columns() | |
| .iter() | |
| .enumerate() | |
| .map(|(i, c)| read_column_chunk(prot, c)) | |
| .collect::<Result<Vec<_>>>()?; |
(I don't think we should do this as part of this PR, but it might be interesting to try in the future)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I tried that last night 😅. It was a bit slower. That pattern has been a bit of a mixed bag. I have used it where I can.
https://github.com/etseidl/arrow-rs/blob/12a8c2f34dd9410d449b8562a72665670a6b65d7/parquet/src/file/metadata/thrift_gen.rs#L656-L659
I keep trying it here and there. I keep when it doesn't hurt performance.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
fascinating!
|
Going to merge this and move on to the encryption refactor + part of #8518 (the latter has a small breaking change I want to get in while I can 😁). |
Which issue does this PR close?
convert_row_groups#8517.Rationale for this change
A good bit (around 15%) of Parquet metadata parsing involves first decoding to thrift structs (
FileMetaData,RowGroup, etc), and then converting to the metadata structs used by this crate (ParquetMetaData,RowGroupMetaData, etc). This PR removes some of the intermediate structures and parses straight to the crate structs.What changes are included in this PR?
Some thrift generated structures are removed, and the code necessary to decode has been hand written. This will enable future optimizations such as selectively decoding parts of the metadata.
In addition to the above, this PR cleans up some of the memory size computation, and also boxes some of the objects used for decryption.
Are these changes tested?
Should be covered by existing tests.
Are there any user-facing changes?
No, only private APIs are changed.