Skip to content

Decouple Streaming Use-Case from File IO Abstractions #7994

@tustvold

Description

@tustvold

Is your feature request related to a problem or challenge?

Currently we accommodate streaming workloads within DataFusion by overloading the file IO abstractions.

This is not always a very good fit and results in a number of workarounds:

As DataFusion gets more sophisticated about handling catalogs, reading/writing partitioned data, caching data, this overloading is getting more and more arcane and hard to reason about, and I think it is overdue we do something to address it.

Describe the solution you'd like

I would like to separate the notions of FileSink and FileScan from a StreamSink and StreamSource, this would allow abstractions that better fit their respective use-cases.

In particular

  • FileSink and FileScan can focus on reading/writing partitioned immutable files following standard big data practices
  • StreamSink and StreamSource can focus on reading/writing CSV / JSON (/ Avro) data from streaming sources

Not only would this simplify the current code, but would also expand the streaming support in DataFusion

  • Allows for more efficient non-blocking IO, as linux FIFO's support poll(2) (unlike general files)
  • Potential future integrations with data streaming systems such as Kafka, etc...

Describe alternatives you've considered

No response

Additional context

No response

Metadata

Metadata

Assignees

Labels

enhancementNew feature or request

Type

No type

Projects

No projects

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions