feat(reader): cache Parquet metadata for when FileScanTasks read the same file #2100

mbutrovich · 2026-02-02T16:53:54Z

Which issue does this PR close?

While running Spark/Iceberg with DataFusion Comet on a workload that generates ~80,000 FileScanTask objects passed into the ArrowReader, we see the majority of CPU time spent in get_metadata calls via ArrowReader::create_parquet_record_batch_stream_builder.

This is a screenshot from the CPU time flame graph from one of the executors in this Spark job:

I suspect the ArrowReader is processing FileScanTasks for the same Parquet data files and fetching the same metadata, burning CPU cycles to parse and adding extra object store calls.

What changes are included in this PR?

ParquetMetadataCache modeled after delete_filter.rs's behavior. I made the key a composite of the location and whether the page index was requested to be read, since a subsequent true when cached with false will yield improper results.
ArrowReader has a metadata cache.
BasicDeleteFileLoader has a metadata cache.

Are these changes tested?

New test in reader.rs, as well as all existing tests pass.
Running in Comet CI to get all of the Iceberg Java tests (feat: [iceberg] Test Parquet metadata caching in iceberg-rust datafusion-comet#3365), as well as try the pipeline described above where we saw the performance issue.

mbutrovich · 2026-02-02T18:56:06Z

I'm also considering caching FileMetadata alongside ParquetMetaData, since I think this code:

let (file_metadata, parquet_reader) =
    try_join!(parquet_file.metadata(), parquet_file.reader())?;

results in redundant HEAD requests if tasks refer to the same file. @Xuanwo does OpenDAL do any sort of caching for that, or should I implement that here along with ParquetMetaData (maybe it's best not to assume anything about underlying Storage implementations based on @CTTY's recent work)?

mbutrovich · 2026-02-02T23:28:37Z

So caching didn't help our test pipeline. It turns out the table I ran this test on has a ton of Parquet files, almost 1:1 FileScanTasks for Parquet data files. I will leave this open for discussion purposes, there are maybe scenarios where it would help, but I think the target split size for the Iceberg scan would have to be set really small to create the problematic case.

mbutrovich added 2 commits February 2, 2026 09:38

Add ParquetMetaData caching to ArrowReader.

a7d0de3

Add test.

84a5723

mbutrovich mentioned this pull request Feb 2, 2026

feat: [iceberg] Test Parquet metadata caching in iceberg-rust apache/datafusion-comet#3365

Closed

mbutrovich closed this Feb 3, 2026

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feat(reader): cache Parquet metadata for when FileScanTasks read the same file #2100

feat(reader): cache Parquet metadata for when FileScanTasks read the same file #2100

Uh oh!

mbutrovich commented Feb 2, 2026 •

edited

Loading

Uh oh!

mbutrovich commented Feb 2, 2026

Uh oh!

mbutrovich commented Feb 2, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

feat(reader): cache Parquet metadata for when FileScanTasks read the same file #2100

feat(reader): cache Parquet metadata for when FileScanTasks read the same file #2100

Uh oh!

Conversation

mbutrovich commented Feb 2, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Which issue does this PR close?

What changes are included in this PR?

Are these changes tested?

Uh oh!

mbutrovich commented Feb 2, 2026

Uh oh!

mbutrovich commented Feb 2, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

mbutrovich commented Feb 2, 2026 •

edited

Loading