Skip to content

Advanced example for building an external index for Row Groups *within* parquet files #10580

@alamb

Description

@alamb

Is your feature request related to a problem or challenge?

It is common in databases and other analytic system to have additional external "indexes" (perhaps stored in the "metadata catalog", perhaps stored alongside the data files, perhaps embedded in the files, perhaps elsewhere)

These indexes are used to speed up queries by "pruning": specifically evaluating a predicate on the index and then only reading the portions of files that would pass the filters in the query. In #10546 we showed how to create a index for entire files.

I would also like to create an example of how to create such an index for row groups within a file (showing how to read it without re-reading the metadata each time)

To complete this example, I think we need:

  1. The API from @NGA-TRAN in [EPIC] Efficiently and correctly extract parquet statistics into ArrayRefs #10453
  2. The API described in API in ParquetExec to pass in RowSelections to ParquetExec (enable custom indexes, finer grained pushdown) #9929

Describe the solution you'd like

No response

Describe alternatives you've considered

No response

Additional context

This is a follow on to #10546

Metadata

Metadata

Assignees

Labels

No labels
No labels

Type

No type

Projects

No projects

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions