Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

various documentation updates #3005

Closed
wants to merge 12 commits into from
Prev Previous commit
Next Next commit
bit of re-arranging and cleanup, removing duplication
  • Loading branch information
kmitchener committed Aug 1, 2022
commit 9cc86c5925948152ad785a18e9fcf8fa4e653521
41 changes: 5 additions & 36 deletions docs/source/cli/index.md
Original file line number Diff line number Diff line change
Expand Up @@ -94,42 +94,7 @@ DataFusion CLI can also be installed via Homebrew (on MacOS). Install it as any

Type `\q` to exit the CLI.

### Registering Parquet Data Sources

Parquet data sources can be registered by executing a `CREATE EXTERNAL TABLE` SQL statement. It is not necessary to provide schema information for Parquet files.

```sql
CREATE EXTERNAL TABLE taxi
STORED AS PARQUET
LOCATION '/mnt/nyctaxi/tripdata.parquet';
```

### Registering CSV Data Sources

CSV data sources can be registered by executing a `CREATE EXTERNAL TABLE` SQL statement. It is necessary to provide schema information for CSV files since DataFusion does not automatically infer the schema when using SQL to query CSV files.

```sql
CREATE EXTERNAL TABLE test (
c1 VARCHAR NOT NULL,
c2 INT NOT NULL,
c3 SMALLINT NOT NULL,
c4 SMALLINT NOT NULL,
c5 INT NOT NULL,
c6 BIGINT NOT NULL,
c7 SMALLINT NOT NULL,
c8 INT NOT NULL,
c9 BIGINT NOT NULL,
c10 VARCHAR NOT NULL,
c11 FLOAT NOT NULL,
c12 DOUBLE NOT NULL,
c13 VARCHAR NOT NULL
)
STORED AS CSV
WITH HEADER ROW
LOCATION '/path/to/aggregate_test_100.csv';
```

## CLI Commands
### CLI Commands

Available commands inside DataFusion CLI are:

Expand All @@ -143,3 +108,7 @@ Available commands inside DataFusion CLI are:
| `\h` | list available commands |
| `\h function` | get help for specific command |
| `\pset [NAME [VALUE]]` | set option (eg: `\pset format csv`) |

### Running SQL

See the [SQL Reference](../user-guide/sql/index.rst) for how to register data sources and supported SQL.
24 changes: 12 additions & 12 deletions docs/source/specification/roadmap.md
Original file line number Diff line number Diff line change
Expand Up @@ -34,7 +34,7 @@ suggest you start a conversation using a github issue or the
dev@arrow.apache.org mailing list to make review efficient and avoid
surprises.

# DataFusion
## DataFusion

DataFusion's goal is to become the embedded query engine of choice
for new analytic applications, by leveraging the unique features of
Expand All @@ -47,7 +47,7 @@ to provide:
4. A Procedural API for programmatically creating and running execution plans
5. High performance, data race free, ergonomic extensibility points at at every layer

## Additional SQL Language Features
### Additional SQL Language Features

- Decimal Support [#122](https://github.com/apache/arrow-datafusion/issues/122)
- Complete support list on [status](https://github.com/apache/arrow-datafusion/blob/master/README.md#status)
Expand All @@ -56,32 +56,32 @@ to provide:
- Support for nested structures (fields, lists, structs) [#119](https://github.com/apache/arrow-datafusion/issues/119)
- Run all queries from the TPCH benchmark (see [milestone](https://github.com/apache/arrow-datafusion/milestone/2) for more details)

## Query Optimizer
### Query Optimizer

- More sophisticated cost based optimizer for join ordering
- Implement advanced query optimization framework (Tokomak) #440
- Finer optimizations for group by and aggregate functions

## Datasources
### Datasources

- Better support for reading data from remote filesystems (e.g. S3) without caching it locally [#907](https://github.com/apache/arrow-datafusion/issues/907) [#1060](https://github.com/apache/arrow-datafusion/issues/1060)
- Improve performances of file format datasources (parallelize file listings, async Arrow readers, file chunk prefetching capability...)

## Runtime / Infrastructure
### Runtime / Infrastructure

- Migrate to some sort of arrow2 based implementation (see [milestone](https://github.com/apache/arrow-datafusion/milestone/3) for more details)
- Add DataFusion to h2oai/db-benchmark [147](https://github.com/apache/arrow-datafusion/issues/147)
- Improve build time [348](https://github.com/apache/arrow-datafusion/issues/348)

## Resource Management
### Resource Management

- Finer grain control and limit of runtime memory [#587](https://github.com/apache/arrow-datafusion/issues/587) and CPU usage [#54](https://github.com/apache/arrow-datafusion/issues/64)

## Python Interface
### Python Interface

TBD

## DataFusion CLI (`datafusion-cli`)
### DataFusion CLI (`datafusion-cli`)

Note: There are some additional thoughts on a datafusion-cli vision on [#1096](https://github.com/apache/arrow-datafusion/issues/1096#issuecomment-939418770).

Expand All @@ -91,7 +91,7 @@ Note: There are some additional thoughts on a datafusion-cli vision on [#1096](h
- publishing to apt, brew, and possible NuGet registry so that people can use it more easily
- adopt a shorter name, like dfcli?

# Ballista
## Ballista

Ballista is a distributed compute platform based on Apache Arrow and DataFusion. It provides a query scheduler that
breaks a physical plan into stages and tasks and then schedules tasks for execution across the available executors
Expand All @@ -101,16 +101,16 @@ Having Ballista as part of the DataFusion codebase helps ensure that DataFusion
compute. For example, it helps ensure that physical query plans can be serialized to protobuf format and that they
remain language-agnostic so that executors can be built in languages other than Rust.

## Ballista Roadmap
### Ballista Roadmap

## Move query scheduler into DataFusion
### Move query scheduler into DataFusion

The Ballista scheduler has some advantages over DataFusion query execution because it doesn't try to eagerly execute
the entire query at once but breaks it down into a directionally-acyclic graph (DAG) of stages and executes a
configurable number of stages and tasks concurrently. It should be possible to push some of this logic down to
DataFusion so that the same scheduler can be used to scale across cores in-process and across nodes in a cluster.

## Implement execution-time cost-based optimizations based on statistics
### Implement execution-time cost-based optimizations based on statistics

After the execution of a query stage, accurate statistics are available for the resulting data. These statistics
could be leveraged by the scheduler to optimize the query during execution. For example, when performing a hash join
Expand Down
4 changes: 2 additions & 2 deletions docs/source/user-guide/example-usage.md
Original file line number Diff line number Diff line change
Expand Up @@ -28,7 +28,7 @@ datafusion = "10"
tokio = "1.0"
```

## Run a SQL query against data stored in a CSV:
## Run a SQL query against data stored in a CSV

```rust
use datafusion::prelude::*;
Expand All @@ -48,7 +48,7 @@ async fn main() -> datafusion::error::Result<()> {
}
```

## Use the DataFrame API to process data stored in a CSV:
## Use the DataFrame API to process data stored in a CSV

```rust
use datafusion::prelude::*;
Expand Down
8 changes: 3 additions & 5 deletions docs/source/user-guide/sql/ddl.md
Original file line number Diff line number Diff line change
Expand Up @@ -17,12 +17,10 @@
under the License.
-->

# DDL
## DDL

DataFusion is a _query_ engine and supports DDL only for modifying the catalog and registering external tables,
creating tables in memory, or creating views. In the DataFusion CLI, these changes are not persisted in any way, taking
place only in memory. The library is the same -- the catalog of tables is not persisted, unless the persistence is built
into the application using the library. You can not insert, update, or delete data using DataFusion SQL.
creating tables in memory, or creating views. You can not insert, update, or delete data using DataFusion SQL.

### CREATE DATABASE

Expand Down Expand Up @@ -81,7 +79,7 @@ Drops the table from DataFusion's catalog.
DROP TABLE test.schema.t1;
```

## CREATE EXTERNAL TABLE
### CREATE EXTERNAL TABLE

Parquet data sources can be registered by executing a `CREATE EXTERNAL TABLE` SQL statement. It is not necessary
to provide schema information for Parquet files.
Expand Down
2 changes: 1 addition & 1 deletion docs/source/user-guide/sql/index.rst
Original file line number Diff line number Diff line change
Expand Up @@ -22,7 +22,7 @@ SQL Reference
:maxdepth: 2

sql_status
select
ddl
select
aggregate_functions
DataFusion Functions <datafusion-functions>