Skip to content

RFC: More Granular File Operators #2079

@tustvold

Description

@tustvold

Is your feature request related to a problem or challenge? Please describe what you are trying to do.

Currently the file scan operators such as ParquetExec, CsvExec, etc... are created with a FileScanConfig, which internally contains a list of PartitionedFile. These PartitionedFile are provided grouped together in "file groups". For each of these groups, the operators expose a DataFusion partition which will scan these files sequentially.

Whilst this works, I'm getting a little bit concerned we are ending up with quite a lot of complexity within each of the individual, file-format specific operators:

This in turn comes with some downsides:

  • Code duplication between operators for different formats with potential for feature and functionality divergence
  • The operators are getting very large and quite hard to reason about
  • Execution details are hidden from the physical plan, potentially limiting parallelism, optimisation, introspection, etc...
  • Catalog details, such as the partitioning scheme, leak into the physical operators themselves

Describe the solution you'd like

It isn't a fully formed proposal, but I wonder if instead of continuing to extend the individual file format operators we might instead compose together simpler physical operators within the query plan. Specifically I wonder if we might make it so that the ParquetExec, CsvExec operators handle a single file, and the plan stage within TableProvider::scan instead constructs a more complex physical plan containing the necessary ProjectionExec, SchemaAdapter operators as necessary.

For what it is worth, IOx uses a similar approach https://github.com/influxdata/influxdb_iox/blob/main/query/src/provider.rs#L282 and it works quite well.

Describe alternatives you've considered

The current approach could remain

Additional context

I'm trying to take a more holistic view on what the parquet interface upstream should look like, which is somewhat related to this apache/arrow-rs#1473

FYI @rdettai @yjshen @alamb

Metadata

Metadata

Assignees

No one assigned

    Labels

    enhancementNew feature or request

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions