Skip to content

Measure best case performance for the new thrift-remodel / custom thrift parser #8441

@alamb

Description

@alamb

Which part is this question about

Related to

On the parquet list there is a discussion about potentially adding a new flatbuffers based format to replace/augment the existing thrift encoding from @alkis: https://lists.apache.org/thread/1v2ww0w5956j6p64wgp6sdbo5sw7lcp6

Andrew, do you have a more precise estimate for the speedup we could expect
in C++? It's also important to note that Thrift's format does not allow for
random access, meaning we will always have to parse the entire footer,
regardless of which columns are requested.

This is a commonly raised concern about the current thrift format, which is that the footer must always be scanned due to thrift's

The same thread estimates using a flatbuffers footer achieves a 20x (10x * 2) improvement for the 99.9'th percentile:

Even with this conversion, we're observing a
greater than 10x improvement in footer decoding time for footers that
perform poorly with Thrift (at the p999 percentile). Removing the
FileMetadata translation should easily provide another 2x speedup.

Describe your question

What is the best case performance improvement our custom thrift parser over the existing thrift parser can achieve? Specifically, how close is it 20x?

Here is what I think is the best case:

  1. A file with many columns (10,000 to 100,000)
  2. Written with statistics (both in the ColumnChunks and Page/Column index)

The comparison is:

  1. Load ParquetMetadata using the existing ParquetMetadataReader
  2. Load ParquetMetadata with the new reader, and only parse the necessary fields (e.g. skip over statistics, etc)

Additional context

The question asks about the C++ performance, which while is important, is not something I have time to pursue, I am only interested in the Rust implementation for now.

Metadata

Metadata

Assignees

Labels

questionFurther information is requested

Type

No type

Projects

No projects

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions