Skip to content

Pass options to Parquet metadata readers #8643

@etseidl

Description

@etseidl

Is your feature request related to a problem or challenge? Please describe what you are trying to do.

One of the goals of the Thrift remodel project (#5854) was to enable such things as selective decoding of parts of the Parquet metadata. The parsers are now in place to enable this, but was is lacking now is a way to communicate what bits of the metadata are required.

Describe the solution you'd like
Some mechanism to communicate to the metadata parsers what is needed. Options can include such things as:

  • Skip some statistics fields in ColumnMetaData (Statistics, PageEncodingStatistics, SizeStatistics, etc).
  • Parse page encoding statistics into some other form (boolean, bitmask) to support dictionary based pushdown.
  • Column projections (i.e. skip decoding metadata/page indexes for columns that will not be read).
  • Row group selection (only parse metadata for requested set of row groups).
  • Only decode chunk statistics/column indexes for columns used in predicates.
  • Only return schema.
  • Skip schema and use a provided schema (perhaps from an earlier decode).
  • Add an optional "skip index" that will enable random access to the metadata
  • Perhaps move encryption parameters here as well.
  • Others I haven't yet thought of.

Describe alternatives you've considered
These options could be added to current properties objects, but there doesn't seem to b a single place for all of these. For instance, SerializedFileReader takes a ReadOptions, that contains a ReaderProperties which is what is subsequently used by the SerialzedRowGroupReader and children. On the arrow side we instead use an ArrowReaderOptions. The ParquetMetaDataReader and ParquetMetaDataPushDecoder manage their own set of options. It would be nice to have a single place to set metadata parsing options and then pass that to the respective decoders.

Additional context

Metadata

Metadata

Assignees

Labels

enhancementAny new improvement worthy of a entry in the changelog

Type

No type

Projects

No projects

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions