Proposal: Create `dfdb`, a new CLI different than `datafusion-cli` with pre-built integrations

# TLDR
* Keep `datafusion-cli` in the apache/datafusion repo
* Make a new repo with a new CLI called `dfdb` (or `datafusion-cli++`or `dfcli`) which is purposely designed for running queries against a wide variety of pre-integrated sources

# Problem Statement

As of today, `datafusion-cli` ([docs](https://datafusion.apache.org/user-guide/cli/index.html)) serves two roles:
1. A debugging / testing tool for the DataFusion query engine developers
2. A CLI tool for actually doing useful processing if files (locally and remotely using object store), similar to the `duckdb` CLI tool

It is really sweet to have a CLI that lets you query a directory of parquet files

```sql
DataFusion CLI v41.0.0
>
> select "WatchID", "EventDate", "URL" from hits_partitioned limit 10;
+---------------------+-----------+------------------------------------------------------------------------------------------------------+
| WatchID             | EventDate | URL                                                                                                  |
+---------------------+-----------+------------------------------------------------------------------------------------------------------+
| 6904841588848398438 | 15895     | 687474703a2f2f736d6573686172696b692e72752f6d616e756661637475726572363437                             |
...
| 7551542980199423249 | 15895     | 687474703a2f2f736d6573686172696b692e72752f6d616e756661637475726572363437                             |
+---------------------+-----------+------------------------------------------------------------------------------------------------------+
10 row(s) fetched.
Elapsed 0.059 seconds.
```

However, similarly to the [discussion with have had with `datafusion-pytyhon`](https://github.com/apache/datafusion-python/issues/440) this dual role leads to a tension between keeping the core lean and easier to embed (e.g. fewer dependencies) and making a better CLI experience

## Examples of Friction

I have recently seen some PRs  that are basically integrations that would make datafusion-cli a better end user tool, but bring more dependencies and complexity to DataFusion. For example
1. Hugging face https://github.com/apache/datafusion/pull/10792 from @xinlifoobar 
2. FlightSQL: https://github.com/apache/datafusion/pull/11938 from @ccciudatu

I realize I have been partly responsible for this mess and for that I apologize. 

# Proposal
I  propose resolving this conflict by creating a new repository for the "CLI tool people actually use"

We would keep `datafusion-cli` as it is, a relatively small and a thin wrapper around the core engine. I don't think we should remove features but we also wouldn't add them (other than what was added to the engine by default)

We would add many new features / capabilitues to this `dfdb` tool


## Examples of new features
There are several obvious examples of integrations that would be super useful for users of a CLI tool but not appropriate for the datafusion repo (due to circular dependencies, for example):
* apache iceberg: https://github.com/apache/iceberg-rust
* delta-rs: https://github.com/delta-io/delta-rs
* hudi: https://github.com/apache/hudi-rs
* The providers in https://github.com/datafusion-contrib/datafusion-table-providers from @phillipleblanc  et al


 @philippemnoel actually referrs to the [lack of built in Apache Iceberg support in his blog](https://blog.paradedb.com/pages/iceberg_lakehouse) about switching to using duckdb. This is sad given all the code to use datafuson and delta exists, there just isn't a pre-integrated binary that shows how to hook it up and it easy to get up and use

## Other cool features 
There are many other cool features I have dreamed about adding to a CLI that might be more appropriate in a separate repo. Some ideas to inspire:

1. Local catalog support (imagine if you could store your `CREATE EXTERNAL TABLE` definitions in a file someere (`.open <filename>` style)
2. Local parquet metadata cache (imagine being able to cache the parquet metadata for 100s of files in object store in some sort of persistence format so future queries were fast)
3. SQL auto completion,
4. etc. 


# Open questions
## Should the new tool  be in the [`datafusion-contrib`](https://github.com/datafusion-contrib) organization or the  `apache` organization? 

The tradeoffs are that `datafusion-contrib` could move faster / has less governance overhead, but would also lose the apache community

I personally suggest we start with this tool  in the `datafusion-contrib` organization and if there is interest we can discuss bringing it back to the apache organization. 


Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Proposal: Create `dfdb`, a new CLI different than `datafusion-cli` with pre-built integrations #11979

TLDR

Problem Statement

Examples of Friction

Proposal

Examples of new features

Other cool features

Open questions

Should the new tool be in the `datafusion-contrib` organization or the `apache` organization?

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Proposal: Create dfdb, a new CLI different than datafusion-cli with pre-built integrations #11979

Description

TLDR

Problem Statement

Examples of Friction

Proposal

Examples of new features

Other cool features

Open questions

Should the new tool be in the datafusion-contrib organization or the apache organization?

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions

Proposal: Create `dfdb`, a new CLI different than `datafusion-cli` with pre-built integrations #11979

Should the new tool be in the `datafusion-contrib` organization or the `apache` organization?