-
Notifications
You must be signed in to change notification settings - Fork 1.8k
Description
Is your feature request related to a problem or challenge?
Historically DataFusion was one (very) large crate datafusion, and as it grew bigger we extracted various functionality into separate crates. This leads to both faster compile times (as the crates can be compiled in parallel) as well easier to navigate code (as the crates force a cleaner dependency separation)
As described by @waynexia the build time of DataFusion has been growing,
Some of this is due to the fact there is more code / more features to test. However a non trivial part of the long compile time is the time taken to compile the datafusion / core crate in https://github.com/apache/datafusion/tree/main/datafusion/core
While we are pursuing additional ways to reduce compile time, I think we should also move more code out of datafusion/core into their own crates.
We have successfully done this in the past with other projects such as
- [EPIC] Extract remaining physical optimizer out of core #11502
- [Epic] Extract catalog functionality from the core to make it more modular #10782
Describe the solution you'd like
I would like to split out the https://github.com/apache/datafusion/tree/main/datafusion/core/src/datasource from DataFusion core
Describe alternatives you've considered
I think we will end up with several new crates
datafusion-catalog-listing:ListingTableand associated types likePartitionedFiledatafusion-datasource-parquet:ParquetExecand file firmatdatafusion-datasource-avroAvroExecand file formatsdatafusion-datasource-arrowdatafusion-datasource-jsondatafusion-datasource-csv
I think we could start by creating datafusion-catalog-listing and trying to pull some of the listing table implementation into there and then trying to move one of the simpler datasources out (datafusion-datasource-arrow perhaps)
Additional context
No response