-
Notifications
You must be signed in to change notification settings - Fork 1.7k
Description
TLDR
- Keep
datafusion-cli
in the apache/datafusion repo - Make a new repo with a new CLI called
dfdb
(ordatafusion-cli++
ordfcli
) which is purposely designed for running queries against a wide variety of pre-integrated sources
Problem Statement
As of today, datafusion-cli
(docs) serves two roles:
- A debugging / testing tool for the DataFusion query engine developers
- A CLI tool for actually doing useful processing if files (locally and remotely using object store), similar to the
duckdb
CLI tool
It is really sweet to have a CLI that lets you query a directory of parquet files
DataFusion CLI v41.0.0
>
> select "WatchID", "EventDate", "URL" from hits_partitioned limit 10;
+---------------------+-----------+------------------------------------------------------------------------------------------------------+
| WatchID | EventDate | URL |
+---------------------+-----------+------------------------------------------------------------------------------------------------------+
| 6904841588848398438 | 15895 | 687474703a2f2f736d6573686172696b692e72752f6d616e756661637475726572363437 |
...
| 7551542980199423249 | 15895 | 687474703a2f2f736d6573686172696b692e72752f6d616e756661637475726572363437 |
+---------------------+-----------+------------------------------------------------------------------------------------------------------+
10 row(s) fetched.
Elapsed 0.059 seconds.
However, similarly to the discussion with have had with datafusion-pytyhon
this dual role leads to a tension between keeping the core lean and easier to embed (e.g. fewer dependencies) and making a better CLI experience
Examples of Friction
I have recently seen some PRs that are basically integrations that would make datafusion-cli a better end user tool, but bring more dependencies and complexity to DataFusion. For example
- Hugging face Feat: Implement hf:// / "hugging face" integration in datafusion-cli #10792 from @xinlifoobar
- FlightSQL: Generic FlightTableFactory with a default FlightSqlDriver #11938 from @ccciudatu
I realize I have been partly responsible for this mess and for that I apologize.
Proposal
I propose resolving this conflict by creating a new repository for the "CLI tool people actually use"
We would keep datafusion-cli
as it is, a relatively small and a thin wrapper around the core engine. I don't think we should remove features but we also wouldn't add them (other than what was added to the engine by default)
We would add many new features / capabilitues to this dfdb
tool
Examples of new features
There are several obvious examples of integrations that would be super useful for users of a CLI tool but not appropriate for the datafusion repo (due to circular dependencies, for example):
- apache iceberg: https://github.com/apache/iceberg-rust
- delta-rs: https://github.com/delta-io/delta-rs
- hudi: https://github.com/apache/hudi-rs
- The providers in https://github.com/datafusion-contrib/datafusion-table-providers from @phillipleblanc et al
@philippemnoel actually referrs to the lack of built in Apache Iceberg support in his blog about switching to using duckdb. This is sad given all the code to use datafuson and delta exists, there just isn't a pre-integrated binary that shows how to hook it up and it easy to get up and use
Other cool features
There are many other cool features I have dreamed about adding to a CLI that might be more appropriate in a separate repo. Some ideas to inspire:
- Local catalog support (imagine if you could store your
CREATE EXTERNAL TABLE
definitions in a file someere (.open <filename>
style) - Local parquet metadata cache (imagine being able to cache the parquet metadata for 100s of files in object store in some sort of persistence format so future queries were fast)
- SQL auto completion,
- etc.
Open questions
Should the new tool be in the datafusion-contrib
organization or the apache
organization?
The tradeoffs are that datafusion-contrib
could move faster / has less governance overhead, but would also lose the apache community
I personally suggest we start with this tool in the datafusion-contrib
organization and if there is interest we can discuss bringing it back to the apache organization.