Skip to content

[Parquet] Split ParquetMetadataReader into IO/decoder state machine and thrift parsing #8439

@alamb

Description

@alamb

Is your feature request related to a problem or challenge? Please describe what you are trying to do.

The current ParquetMetadataReader intermixes three things:

  1. The state machine for decoding parquet metadata (footer, then metadata, then (optional) indexes)
  2. orchestrating IO (aka calling read, etc)
  3. Decoding thrift encoded byte into objects

This makes it almost impossible to add features like "only decode a subset of the columns in the ColumnIndex" and other potentially advanced usecases

Describe the solution you'd like

Now that we have a "push" style API for metadata decoding that avoids IO, I would like to separate out these three parts so that we can add better features

Describe alternatives you've considered

Additional context

Metadata

Metadata

Assignees

Labels

enhancementAny new improvement worthy of a entry in the changelog

Type

No type

Projects

No projects

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions