-
Notifications
You must be signed in to change notification settings - Fork 984
Description
Is your feature request related to a problem or challenge? Please describe what you are trying to do.
ArrowReaderMetadata to read parquet files, and one major usecase is to supply pre-parsed metadata (to avoid a second object store request on read) by providing the ParquetMetaData
to ArrowReaderMetadata::try_new
However, the way the API is currently setup it is easy to supply the ParquetMetaData
but the reader will STILL make 2 object store requests.
This happens if the ArrowReaderOptions
has with_page_index
specified but the provided metadata doesn't (yet) have the page index, it will load it again
This is a common source of confusion / bugs: when someone supplies the ParquetMetaData
to the ArrowReaderMetadata
they are very often trying to avoid a second object store request, but as it often turns out the second fetch happens anyways to read the page index (thus obviating the attempt at optimization)
This is (in a roundabout way) what is happening to @progval in apache/datafusion#12593 and it took me a while to debug what was happening while working on the advanced_parquet_index.rs in DataFusion
Describe the solution you'd like
I would like the API to be harder to misuse.
Describe alternatives you've considered
For example, maybe we could make ArrowReaderMetadata error if it was supplied with ParquetMetaData
that did not have the page indexes,
for example, we could add a ArrowReaderOptions::error_if_need_metadata
or something that would change the automatic fetch/load behavior into an error if the reader needs the page index, and the file has a page index, but it isn't loaded yet into ParquetMetaData
Additional context