Is your feature request related to a problem or challenge? Please describe what you are trying to do.
apache/datafusion ran into this while working on page pruning in apache/datafusion#21556.
We are trying to reduce unnecessary work in low-latency scan paths and simplify page-pruning control flow.
Today, ParquetMetaData::column_index and ParquetMetaData::offset_index return None both when the file has no page index and when the page index has not been fetched yet. That behavior is tied to how ParquetMetaDataReader::load_page_index works today.
That makes it hard for downstream consumers to optimize page-pruning flow. In DataFusion, for example, we want to:
- avoid loading page-index metadata unless there is a usable page-pruning predicate
- avoid building page-pruning predicates when the file has no page index
The first part is possible today. The second is not, because when indexes are not already loaded, None is ambiguous.
Relevant DataFusion code:
Describe the solution you'd like
An API that exposes page-index availability separately from whether the actual index payload has been loaded.
Examples:
page_index_state() -> Unknown | Absent | PresentNotLoaded | PresentLoaded
- or a smaller API such as
has_page_index() -> Option<bool> with documented semantics
The important part is allowing callers to distinguish the following cases without an actual page-index load:
- page index absent
- page index not yet loaded
Describe alternatives you've considered
Do nothing, as downstream consumers can attempt an optional page-index load and infer absence from the result, but that forces extra I/O and complicates the pruning path.
Additional context
Is your feature request related to a problem or challenge? Please describe what you are trying to do.
apache/datafusionran into this while working on page pruning in apache/datafusion#21556.We are trying to reduce unnecessary work in low-latency scan paths and simplify page-pruning control flow.
Today,
ParquetMetaData::column_indexandParquetMetaData::offset_indexreturnNoneboth when the file has no page index and when the page index has not been fetched yet. That behavior is tied to howParquetMetaDataReader::load_page_indexworks today.That makes it hard for downstream consumers to optimize page-pruning flow. In DataFusion, for example, we want to:
The first part is possible today. The second is not, because when indexes are not already loaded,
Noneis ambiguous.Relevant DataFusion code:
has_page_indexbuild_page_pruning_predicateFiltersPreparedParquetOpen::load_page_indexload_page_indexDescribe the solution you'd like
An API that exposes page-index availability separately from whether the actual index payload has been loaded.
Examples:
page_index_state() -> Unknown | Absent | PresentNotLoaded | PresentLoadedhas_page_index() -> Option<bool>with documented semanticsThe important part is allowing callers to distinguish the following cases without an actual page-index load:
Describe alternatives you've considered
Do nothing, as downstream consumers can attempt an optional page-index load and infer absence from the result, but that forces extra I/O and complicates the pruning path.
Additional context