Skip to content

[Parquet Metadata] API to determine page-index presence separately from page-index load #9693

@alamb

Description

@alamb

Is your feature request related to a problem or challenge? Please describe what you are trying to do.

apache/datafusion ran into this while working on page pruning in apache/datafusion#21556.

We are trying to reduce unnecessary work in low-latency scan paths and simplify page-pruning control flow.

Today, ParquetMetaData::column_index and ParquetMetaData::offset_index return None both when the file has no page index and when the page index has not been fetched yet. That behavior is tied to how ParquetMetaDataReader::load_page_index works today.

That makes it hard for downstream consumers to optimize page-pruning flow. In DataFusion, for example, we want to:

  • avoid loading page-index metadata unless there is a usable page-pruning predicate
  • avoid building page-pruning predicates when the file has no page index

The first part is possible today. The second is not, because when indexes are not already loaded, None is ambiguous.

Relevant DataFusion code:

Describe the solution you'd like

An API that exposes page-index availability separately from whether the actual index payload has been loaded.

Examples:

  • page_index_state() -> Unknown | Absent | PresentNotLoaded | PresentLoaded
  • or a smaller API such as has_page_index() -> Option<bool> with documented semantics

The important part is allowing callers to distinguish the following cases without an actual page-index load:

  • page index absent
  • page index not yet loaded

Describe alternatives you've considered
Do nothing, as downstream consumers can attempt an optional page-index load and infer absence from the result, but that forces extra I/O and complicates the pruning path.

Additional context

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type
    No fields configured for issues without a type.

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions