Measure  best case performance for the new thrift-remodel / custom thrift parser

# Which part is this question about

Related to
- #5854 

On the parquet list there is a discussion about potentially adding a new flatbuffers based format to replace/augment the existing thrift encoding from @alkis: https://lists.apache.org/thread/1v2ww0w5956j6p64wgp6sdbo5sw7lcp6

> Andrew, do you have a more precise estimate for the speedup we could expect
> in C++? It's also important to note that Thrift's format does not allow for
> random access, meaning we will always have to parse the entire footer,
> regardless of which columns are requested.

This is a commonly raised concern about the current thrift format, which is that the footer must always be scanned due to thrift's 

The same thread estimates using a flatbuffers footer achieves a 20x (10x * 2) improvement for the 99.9'th percentile:

>  Even with this conversion, we're observing a
> greater than 10x improvement in footer decoding time for footers that
> perform poorly with Thrift (at the p999 percentile). Removing the
> `FileMetadata` translation should easily provide another 2x speedup.

# Describe your question

What is the **best case** performance improvement our custom thrift parser over the existing thrift parser can achieve? Specifically, how close is it 20x?

Here is what I think is the best case:
1.  A file with many columns (10,000 to 100,000) 
2. Written with statistics (both in the ColumnChunks and Page/Column index)

The comparison is:
1. Load  `ParquetMetadata` using the existing `ParquetMetadataReader`
2. Load `ParquetMetadata` with the new reader, and only parse the necessary fields (e.g. skip over statistics, etc) 




# Additional context
The question asks about the C++ performance, which while is important, is not something I have time to pursue, I am only interested in the Rust implementation for now.


Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Measure best case performance for the new thrift-remodel / custom thrift parser #8441

Which part is this question about

Describe your question

Additional context

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Measure best case performance for the new thrift-remodel / custom thrift parser #8441

Description

Which part is this question about

Describe your question

Additional context

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions