ArrowReaderMetadata API makes it too easy to (accidentally) make an additional object store request

**Is your feature request related to a problem or challenge? Please describe what you are trying to do.**

[ArrowReaderMetadata](https://docs.rs/parquet/latest/parquet/arrow/arrow_reader/struct.ArrowReaderMetadata.html) to read parquet files,  and one major usecase is to supply pre-parsed metadata (to avoid a second object store request on read) by providing the `ParquetMetaData` to [`ArrowReaderMetadata::try_new`](https://docs.rs/parquet/latest/parquet/arrow/arrow_reader/struct.ArrowReaderMetadata.html#method.try_new)

However, the way the API is currently setup it is easy to supply the `ParquetMetaData` but the reader will *STILL* make 2 object store requests. 

This happens if the `ArrowReaderOptions` has [`with_page_index`](https://docs.rs/parquet/latest/parquet/arrow/arrow_reader/struct.ArrowReaderOptions.html#method.with_page_index) specified but the provided metadata doesn't (yet) have the page index, it will load it again

This is a common source of confusion / bugs:  when someone supplies the `ParquetMetaData` to the `ArrowReaderMetadata` they are very often trying to avoid a second object store request, but as it often turns out the second fetch happens anyways to read the page index (thus obviating the attempt at optimization)

This is (in a roundabout way) what is happening to @progval in https://github.com/apache/datafusion/pull/12593 and it took me a while to debug what was happening while working on the [advanced_parquet_index.rs](https://github.com/apache/datafusion/blob/main/datafusion-examples/examples/advanced_parquet_index.rs) in DataFusion


**Describe the solution you'd like**
I would like the API to be harder to misuse. 


**Describe alternatives you've considered**
For example, maybe we could make ArrowReaderMetadata error if it was supplied with `ParquetMetaData` that did not have the page indexes, 

for example, we could add a `ArrowReaderOptions::error_if_need_metadata` or something that would change the automatic fetch/load behavior into an error if the reader needs the page index, and the file has a page index, but it isn't loaded yet into `ParquetMetaData`

**Additional context**

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

ArrowReaderMetadata API makes it too easy to (accidentally) make an additional object store request #6476

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

ArrowReaderMetadata API makes it too easy to (accidentally) make an additional object store request #6476

Description

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions