-
Couldn't load subscription status.
- Fork 1.7k
Description
Is your feature request related to a problem or challenge? Please describe what you are trying to do.
Currently the file scan operators such as ParquetExec, CsvExec, etc... are created with a FileScanConfig, which internally contains a list of PartitionedFile. These PartitionedFile are provided grouped together in "file groups". For each of these groups, the operators expose a DataFusion partition which will scan these files sequentially.
Whilst this works, I'm getting a little bit concerned we are ending up with quite a lot of complexity within each of the individual, file-format specific operators:
- The individual files within a file group may have differing schema - Support reading from CSV, Avro and Json files that have mergeable/compatible, but not identical schemas #1669
- If using Hive partitioning, need to project the additional rows from the partition key - Implement partitioned read in listing table provider #1139
- Edge cases where the file isn't even needed - Error occurs when only using partition columns in query #1999
- Intra-file parallelism - Make it possible to only scan part of a parquet file in a partition #1990
This in turn comes with some downsides:
- Code duplication between operators for different formats with potential for feature and functionality divergence
- The operators are getting very large and quite hard to reason about
- Execution details are hidden from the physical plan, potentially limiting parallelism, optimisation, introspection, etc...
- Catalog details, such as the partitioning scheme, leak into the physical operators themselves
Describe the solution you'd like
It isn't a fully formed proposal, but I wonder if instead of continuing to extend the individual file format operators we might instead compose together simpler physical operators within the query plan. Specifically I wonder if we might make it so that the ParquetExec, CsvExec operators handle a single file, and the plan stage within TableProvider::scan instead constructs a more complex physical plan containing the necessary ProjectionExec, SchemaAdapter operators as necessary.
For what it is worth, IOx uses a similar approach https://github.com/influxdata/influxdb_iox/blob/main/query/src/provider.rs#L282 and it works quite well.
Describe alternatives you've considered
The current approach could remain
Additional context
I'm trying to take a more holistic view on what the parquet interface upstream should look like, which is somewhat related to this apache/arrow-rs#1473