RFC: More Granular File Operators

**Is your feature request related to a problem or challenge? Please describe what you are trying to do.**

Currently the file scan operators such as `ParquetExec`, `CsvExec`, etc... are created with a `FileScanConfig`, which internally contains a list of `PartitionedFile`. These `PartitionedFile` are provided grouped together in "file groups". For each of these groups, the operators expose a DataFusion partition which will scan these files sequentially.

Whilst this works, I'm getting a little bit concerned we are ending up with quite a lot of complexity within each of the individual, file-format specific operators:

* The individual files within a file group may have differing schema - #1669
* If using Hive partitioning, need to project the additional rows from the partition key - #1139
* Edge cases where the file isn't even needed - #1999
* Intra-file parallelism - #1990

This in turn comes with some downsides:

* Code duplication between operators for different formats with potential for feature and functionality divergence
* The operators are getting very large and quite hard to reason about
* Execution details are hidden from the physical plan, potentially limiting parallelism, optimisation, introspection, etc...
* Catalog details, such as the partitioning scheme, leak into the physical operators themselves

**Describe the solution you'd like**

It isn't a fully formed proposal, but I wonder if instead of continuing to extend the individual file format operators we might instead compose together simpler physical operators within the query plan. Specifically I wonder if we might make it so that the `ParquetExec`, `CsvExec` operators handle a single file, and the plan stage within `TableProvider::scan` instead constructs a more complex physical plan containing the necessary `ProjectionExec`, `SchemaAdapter` operators as necessary.

For what it is worth, IOx uses a similar approach https://github.com/influxdata/influxdb_iox/blob/main/query/src/provider.rs#L282 and it works quite well.

**Describe alternatives you've considered**

The current approach could remain

**Additional context**

I'm trying to take a more holistic view on what the parquet interface upstream should look like, which is somewhat related to this https://github.com/apache/arrow-rs/issues/1473

FYI @rdettai @yjshen @alamb 


Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Uh oh!

RFC: More Granular File Operators #2079

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Uh oh!

RFC: More Granular File Operators #2079

Description

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions