Skip to content

ParquetScan decode path does not re-attach CachedParquetFileReaderFactory #20575

@nathanb9

Description

@nathanb9

Describe the bug

Summary

When a Parquet physical plan is serialized/deserialized through protobuf the decoded ParquetSource does not have a ParquetFileReaderFactory attached. Which is expected but we are missing steps to re-create and reattach the ParquetFileReaderFactory with reference to the local runtime_env.cache_manager.get_file_metadata_cache().

Issue / Impact

If we don't preserve this then during distributed execution repeated Parquet scans show a constant extra bytes_scanned overhead per DataSourceExec (metadata-related reads being repeated). Testing with a local patch (datafusion-contrib/datafusion-distributed#359) that attaches cached reader factory during decode, this overhead disappears and scan bytes match single node datafusion values. But the fix seems like it belongs here in the upstream where we can attach the factory to enable caching.

To Reproduce

To reproduce yourself by:
Building a Parquet physical plan with DataSourceExec which has a parquet factory attached
Roundtrip via PhysicalPlanNode::try_from_physical_plan +try_into_physical_plan
Then you can check decoded ParquetSource.parquet_file_reader_factory()is_some() - currently won't be there

Expected behavior

attach after decoding

Additional context

You can look at PR description here for an example: datafusion-contrib/datafusion-distributed#359

Fix

PR: #20574

Metadata

Metadata

Assignees

Labels

bugSomething isn't working

Type

Projects

No projects

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions