[Parquet Metadata] API to determine page-index presence separately from page-index load

**Is your feature request related to a problem or challenge? Please describe what you are trying to do.**

`apache/datafusion` ran into this while working on page pruning in [apache/datafusion#21556](https://github.com/apache/datafusion/pull/21556).

We are trying to reduce unnecessary work in low-latency scan paths and simplify page-pruning control flow.

Today, [`ParquetMetaData::column_index`](https://github.com/apache/arrow-rs/blob/68851ef953fd771cc310203c446e54145d4407e1/parquet/src/file/metadata/mod.rs#L262-L264) and [`ParquetMetaData::offset_index`](https://github.com/apache/arrow-rs/blob/68851ef953fd771cc310203c446e54145d4407e1/parquet/src/file/metadata/mod.rs#L272-L274) return `None` both when the file has no page index and when the page index has not been fetched yet. That behavior is tied to how [`ParquetMetaDataReader::load_page_index`](https://github.com/apache/arrow-rs/blob/68851ef953fd771cc310203c446e54145d4407e1/parquet/src/file/metadata/reader.rs#L496-L497) works today.

That makes it hard for downstream consumers to optimize page-pruning flow. In DataFusion, for example, we want to:
- avoid loading page-index metadata unless there is a usable page-pruning predicate
- avoid building page-pruning predicates when the file has no page index

The first part is possible today. The second is not, because when indexes are not already loaded, `None` is ambiguous.

Relevant DataFusion code:
- [`has_page_index`](https://github.com/apache/datafusion/blob/3017761a65a5337887612b818985d22767529804/datafusion/datasource-parquet/src/opener.rs#L941-L945)
- [`build_page_pruning_predicate`](https://github.com/apache/datafusion/blob/3017761a65a5337887612b818985d22767529804/datafusion/datasource-parquet/src/opener.rs#L948-L954)
- [`FiltersPreparedParquetOpen::load_page_index`](https://github.com/apache/datafusion/blob/3017761a65a5337887612b818985d22767529804/datafusion/datasource-parquet/src/opener.rs#L957-L984)
- helper that eventually calls Arrow’s metadata loader: [`load_page_index`](https://github.com/apache/datafusion/blob/3017761a65a5337887612b818985d22767529804/datafusion/datasource-parquet/src/opener.rs#L1717-L1743)

**Describe the solution you'd like**

An API that exposes page-index availability separately from whether the actual index payload has been loaded.

Examples:
- `page_index_state() -> Unknown | Absent | PresentNotLoaded | PresentLoaded`
- or a smaller API such as `has_page_index() -> Option<bool>` with documented semantics

The important part is allowing callers to distinguish the following cases without an actual page-index load:

- page index absent
- page index not yet loaded


**Describe alternatives you've considered**
Do nothing, as downstream consumers can attempt an optional page-index load and infer absence from the result, but that forces extra I/O and complicates the pruning path.

**Additional context**



Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[Parquet Metadata] API to determine page-index presence separately from page-index load #9693

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

[Parquet Metadata] API to determine page-index presence separately from page-index load #9693

Description

Metadata

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

Issue actions