Skip to content

Conversation

@mbutrovich
Copy link
Collaborator

@mbutrovich mbutrovich commented Feb 2, 2026

Which issue does this PR close?

While running Spark/Iceberg with DataFusion Comet on a workload that generates ~80,000 FileScanTask objects passed into the ArrowReader, we see the majority of CPU time spent in get_metadata calls via ArrowReader::create_parquet_record_batch_stream_builder.

This is a screenshot from the CPU time flame graph from one of the executors in this Spark job:
Screenshot 2026-02-02 at 6 40 19 AM

I suspect the ArrowReader is processing FileScanTasks for the same Parquet data files and fetching the same metadata, burning CPU cycles to parse and adding extra object store calls.

What changes are included in this PR?

  • ParquetMetadataCache modeled after delete_filter.rs's behavior. I made the key a composite of the location and whether the page index was requested to be read, since a subsequent true when cached with false will yield improper results.
  • ArrowReader has a metadata cache.
  • BasicDeleteFileLoader has a metadata cache.

Are these changes tested?

@mbutrovich
Copy link
Collaborator Author

I'm also considering caching FileMetadata alongside ParquetMetaData, since I think this code:

let (file_metadata, parquet_reader) =
    try_join!(parquet_file.metadata(), parquet_file.reader())?;

results in redundant HEAD requests if tasks refer to the same file. @Xuanwo does OpenDAL do any sort of caching for that, or should I implement that here along with ParquetMetaData (maybe it's best not to assume anything about underlying Storage implementations based on @CTTY's recent work)?

@mbutrovich
Copy link
Collaborator Author

So caching didn't help our test pipeline. It turns out the table I ran this test on has a ton of Parquet files, almost 1:1 FileScanTasks for Parquet data files. I will leave this open for discussion purposes, there are maybe scenarios where it would help, but I think the target split size for the Iceberg scan would have to be set really small to create the problematic case.

@mbutrovich mbutrovich closed this Feb 3, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant