-
Notifications
You must be signed in to change notification settings - Fork 2k
Description
Describe the bug
Summary
When a Parquet physical plan is serialized/deserialized through protobuf the decoded ParquetSource does not have a ParquetFileReaderFactory attached. Which is expected but we are missing steps to re-create and reattach the ParquetFileReaderFactory with reference to the local runtime_env.cache_manager.get_file_metadata_cache().
Issue / Impact
If we don't preserve this then during distributed execution repeated Parquet scans show a constant extra bytes_scanned overhead per DataSourceExec (metadata-related reads being repeated). Testing with a local patch (datafusion-contrib/datafusion-distributed#359) that attaches cached reader factory during decode, this overhead disappears and scan bytes match single node datafusion values. But the fix seems like it belongs here in the upstream where we can attach the factory to enable caching.
To Reproduce
To reproduce yourself by:
Building a Parquet physical plan with DataSourceExec which has a parquet factory attached
Roundtrip via PhysicalPlanNode::try_from_physical_plan +try_into_physical_plan
Then you can check decoded ParquetSource.parquet_file_reader_factory()is_some() - currently won't be there
Expected behavior
attach after decoding
Additional context
You can look at PR description here for an example: datafusion-contrib/datafusion-distributed#359
Fix
PR: #20574