-
Notifications
You must be signed in to change notification settings - Fork 1.8k
Closed
Labels
enhancementNew feature or requestNew feature or request
Description
Is your feature request related to a problem or challenge?
Currently we accommodate streaming workloads within DataFusion by overloading the file IO abstractions.
This is not always a very good fit and results in a number of workarounds:
- Providing PartitionedFile information to methods that write new files via FileSinkConfig - https://docs.rs/datafusion/latest/datafusion/datasource/physical_plan/struct.FileSinkConfig.html#structfield.file_groups
- Passing unbounded_input to parallel write logic that really shouldn't care - https://github.com/apache/arrow-datafusion/blob/main/datafusion/core/src/datasource/file_format/write/orchestration.rs#L79
- Special logic to handle reading files that aren't regular files - Update ObjectStore 0.7.0 and Arrow 46.0.0 #7282 (comment)
- The somewhat confusing notion of an "infinite" file - https://docs.rs/datafusion/latest/datafusion/datasource/physical_plan/struct.FileScanConfig.html#structfield.infinite_source
- An ObjectStore::append method that doesn't naturally fit with the design goals of the project - https://github.com/apache/arrow-rs/pull/4978/files#diff-fbf0cdd67af2b24703a831d28c932bfe0535234e3b745d3a5437b9c0bc881ddeR89
- Schema inference within ListingTable currently assumes files are finite
- ObjectMeta is passed around in plans with the implicit assumption that files are not changing underneath
- And more...
As DataFusion gets more sophisticated about handling catalogs, reading/writing partitioned data, caching data, this overloading is getting more and more arcane and hard to reason about, and I think it is overdue we do something to address it.
Describe the solution you'd like
I would like to separate the notions of FileSink and FileScan from a StreamSink and StreamSource, this would allow abstractions that better fit their respective use-cases.
In particular
- FileSink and FileScan can focus on reading/writing partitioned immutable files following standard big data practices
- StreamSink and StreamSource can focus on reading/writing CSV / JSON (/ Avro) data from streaming sources
Not only would this simplify the current code, but would also expand the streaming support in DataFusion
- Allows for more efficient non-blocking IO, as linux FIFO's support poll(2) (unlike general files)
- Potential future integrations with data streaming systems such as Kafka, etc...
Describe alternatives you've considered
No response
Additional context
No response
devinjdangelo and alamb
Metadata
Metadata
Assignees
Labels
enhancementNew feature or requestNew feature or request