Skip to content

[Parquet] Implement a "push style" API for decoding Parquet Metadata #8164

@alamb

Description

@alamb

Is your feature request related to a problem or challenge? Please describe what you are trying to do.

The current ParquetMetaDataReader is a wonder of software engineering thanks to @etseidl. However, it is somewhat complicated to use as it has both async and sync methods as well as keeps state internally in a non obvious way -- for example do you call try_parse or parse_and_finish? Or how os load_via_suffix_and_finish related?

Compared to what came before it, ParquetMetaDataReader is an amazing improvement, but I think we could do better.

I ran into this when I discovered that Metadata is needed when implementing a push decoder for Parquet:

Basically, I want a way to parse the metadata without ALSO doing the IO at the same time

Describe the solution you'd like
If we want to truly separate IO and CPU we also need a way to decode the metadata without explicit IO, and hence this PR that provides a way to decode metadata "push style" where it tells you what bytes are needed. It follows the same API as the parquet push decoder

Describe alternatives you've considered

Additional context

Metadata

Metadata

Assignees

Labels

enhancementAny new improvement worthy of a entry in the changelog

Type

No type

Projects

No projects

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions