Decouple Streaming Use-Case from File IO Abstractions

### Is your feature request related to a problem or challenge?

Currently we accommodate streaming workloads within DataFusion by overloading the file IO abstractions. 

This is not always a very good fit and results in a number of workarounds:

* Providing PartitionedFile information to methods that write new files via FileSinkConfig - https://docs.rs/datafusion/latest/datafusion/datasource/physical_plan/struct.FileSinkConfig.html#structfield.file_groups
* Passing unbounded_input to parallel write logic that really shouldn't care - https://github.com/apache/arrow-datafusion/blob/main/datafusion/core/src/datasource/file_format/write/orchestration.rs#L79
* Special logic to handle reading files that aren't regular files - https://github.com/apache/arrow-datafusion/pull/7282#discussion_r1294019448
* The somewhat confusing notion of an "infinite" file - https://docs.rs/datafusion/latest/datafusion/datasource/physical_plan/struct.FileScanConfig.html#structfield.infinite_source
* An ObjectStore::append method that doesn't naturally fit with the design goals of the project - https://github.com/apache/arrow-rs/pull/4978/files#diff-fbf0cdd67af2b24703a831d28c932bfe0535234e3b745d3a5437b9c0bc881ddeR89
* Schema inference within ListingTable currently assumes files are finite
* ObjectMeta is passed around in plans with the implicit assumption that files are not changing underneath
* And more...

As DataFusion gets more sophisticated about handling catalogs, reading/writing partitioned data, caching data, this overloading is getting more and more arcane and hard to reason about, and I think it is overdue we do something to address it.


### Describe the solution you'd like

I would like to separate the notions of FileSink and FileScan from a StreamSink and StreamSource, this would allow abstractions that better fit their respective use-cases.

In particular 

* FileSink and FileScan can focus on reading/writing partitioned immutable files following standard big data practices
* StreamSink and StreamSource can focus on reading/writing CSV / JSON (/ Avro) data from streaming sources

Not only would this simplify the current code, but would also expand the streaming support in DataFusion

* Allows for more efficient non-blocking IO, as linux FIFO's support poll(2) (unlike general files)
* Potential future integrations with data streaming systems such as Kafka, etc...

### Describe alternatives you've considered

_No response_

### Additional context

_No response_

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Decouple Streaming Use-Case from File IO Abstractions #7994

Is your feature request related to a problem or challenge?

Describe the solution you'd like

Describe alternatives you've considered

Additional context

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Decouple Streaming Use-Case from File IO Abstractions #7994

Description

Is your feature request related to a problem or challenge?

Describe the solution you'd like

Describe alternatives you've considered

Additional context

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions