-
Notifications
You must be signed in to change notification settings - Fork 1k
Description
Which part is this question about
Related to
On the parquet list there is a discussion about potentially adding a new flatbuffers based format to replace/augment the existing thrift encoding from @alkis: https://lists.apache.org/thread/1v2ww0w5956j6p64wgp6sdbo5sw7lcp6
Andrew, do you have a more precise estimate for the speedup we could expect
in C++? It's also important to note that Thrift's format does not allow for
random access, meaning we will always have to parse the entire footer,
regardless of which columns are requested.
This is a commonly raised concern about the current thrift format, which is that the footer must always be scanned due to thrift's
The same thread estimates using a flatbuffers footer achieves a 20x (10x * 2) improvement for the 99.9'th percentile:
Even with this conversion, we're observing a
greater than 10x improvement in footer decoding time for footers that
perform poorly with Thrift (at the p999 percentile). Removing the
FileMetadatatranslation should easily provide another 2x speedup.
Describe your question
What is the best case performance improvement our custom thrift parser over the existing thrift parser can achieve? Specifically, how close is it 20x?
Here is what I think is the best case:
- A file with many columns (10,000 to 100,000)
- Written with statistics (both in the ColumnChunks and Page/Column index)
The comparison is:
- Load
ParquetMetadatausing the existingParquetMetadataReader - Load
ParquetMetadatawith the new reader, and only parse the necessary fields (e.g. skip over statistics, etc)
Additional context
The question asks about the C++ performance, which while is important, is not something I have time to pursue, I am only interested in the Rust implementation for now.