Skip to content

Proposal: Create dfdb, a new CLI different than datafusion-cli with pre-built integrations #11979

@alamb

Description

@alamb

TLDR

  • Keep datafusion-cli in the apache/datafusion repo
  • Make a new repo with a new CLI called dfdb (or datafusion-cli++or dfcli) which is purposely designed for running queries against a wide variety of pre-integrated sources

Problem Statement

As of today, datafusion-cli (docs) serves two roles:

  1. A debugging / testing tool for the DataFusion query engine developers
  2. A CLI tool for actually doing useful processing if files (locally and remotely using object store), similar to the duckdb CLI tool

It is really sweet to have a CLI that lets you query a directory of parquet files

DataFusion CLI v41.0.0
>
> select "WatchID", "EventDate", "URL" from hits_partitioned limit 10;
+---------------------+-----------+------------------------------------------------------------------------------------------------------+
| WatchID             | EventDate | URL                                                                                                  |
+---------------------+-----------+------------------------------------------------------------------------------------------------------+
| 6904841588848398438 | 15895     | 687474703a2f2f736d6573686172696b692e72752f6d616e756661637475726572363437                             |
...
| 7551542980199423249 | 15895     | 687474703a2f2f736d6573686172696b692e72752f6d616e756661637475726572363437                             |
+---------------------+-----------+------------------------------------------------------------------------------------------------------+
10 row(s) fetched.
Elapsed 0.059 seconds.

However, similarly to the discussion with have had with datafusion-pytyhon this dual role leads to a tension between keeping the core lean and easier to embed (e.g. fewer dependencies) and making a better CLI experience

Examples of Friction

I have recently seen some PRs that are basically integrations that would make datafusion-cli a better end user tool, but bring more dependencies and complexity to DataFusion. For example

  1. Hugging face Feat: Implement hf:// / "hugging face" integration in datafusion-cli #10792 from @xinlifoobar
  2. FlightSQL: Generic FlightTableFactory with a default FlightSqlDriver #11938 from @ccciudatu

I realize I have been partly responsible for this mess and for that I apologize.

Proposal

I propose resolving this conflict by creating a new repository for the "CLI tool people actually use"

We would keep datafusion-cli as it is, a relatively small and a thin wrapper around the core engine. I don't think we should remove features but we also wouldn't add them (other than what was added to the engine by default)

We would add many new features / capabilitues to this dfdb tool

Examples of new features

There are several obvious examples of integrations that would be super useful for users of a CLI tool but not appropriate for the datafusion repo (due to circular dependencies, for example):

@philippemnoel actually referrs to the lack of built in Apache Iceberg support in his blog about switching to using duckdb. This is sad given all the code to use datafuson and delta exists, there just isn't a pre-integrated binary that shows how to hook it up and it easy to get up and use

Other cool features

There are many other cool features I have dreamed about adding to a CLI that might be more appropriate in a separate repo. Some ideas to inspire:

  1. Local catalog support (imagine if you could store your CREATE EXTERNAL TABLE definitions in a file someere (.open <filename> style)
  2. Local parquet metadata cache (imagine being able to cache the parquet metadata for 100s of files in object store in some sort of persistence format so future queries were fast)
  3. SQL auto completion,
  4. etc.

Open questions

Should the new tool be in the datafusion-contrib organization or the apache organization?

The tradeoffs are that datafusion-contrib could move faster / has less governance overhead, but would also lose the apache community

I personally suggest we start with this tool in the datafusion-contrib organization and if there is interest we can discuss bringing it back to the apache organization.

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions