bit of re-arranging and cleanup, removing duplication

apache · kmitchener · Aug 1, 2022 · Aug 1, 2022 · Aug 1, 2022 · Aug 1, 2022
commit 9cc86c5925948152ad785a18e9fcf8fa4e653521
diff --git a/docs/source/cli/index.md b/docs/source/cli/index.md
@@ -94,42 +94,7 @@ DataFusion CLI can also be installed via Homebrew (on MacOS). Install it as any
 
 Type `\q` to exit the CLI.
 
-### Registering Parquet Data Sources
-
-Parquet data sources can be registered by executing a `CREATE EXTERNAL TABLE` SQL statement. It is not necessary to provide schema information for Parquet files.
-
-```sql
-    CREATE EXTERNAL TABLE taxi
-    STORED AS PARQUET
-    LOCATION '/mnt/nyctaxi/tripdata.parquet';
-```
-
-### Registering CSV Data Sources
-
-CSV data sources can be registered by executing a `CREATE EXTERNAL TABLE` SQL statement. It is necessary to provide schema information for CSV files since DataFusion does not automatically infer the schema when using SQL to query CSV files.
-
-```sql
-    CREATE EXTERNAL TABLE test (
-        c1  VARCHAR NOT NULL,
-        c2  INT NOT NULL,
-        c3  SMALLINT NOT NULL,
-        c4  SMALLINT NOT NULL,
-        c5  INT NOT NULL,
-        c6  BIGINT NOT NULL,
-        c7  SMALLINT NOT NULL,
-        c8  INT NOT NULL,
-        c9  BIGINT NOT NULL,
-        c10 VARCHAR NOT NULL,
-        c11 FLOAT NOT NULL,
-        c12 DOUBLE NOT NULL,
-        c13 VARCHAR NOT NULL
-    )
-    STORED AS CSV
-    WITH HEADER ROW
-    LOCATION '/path/to/aggregate_test_100.csv';
-```
-
-## CLI Commands
+### CLI Commands
 
 Available commands inside DataFusion CLI are:
 
@@ -143,3 +108,7 @@ Available commands inside DataFusion CLI are:
 | `\h`                   | list available commands             |
 | `\h function`          | get help for specific command       |
 | `\pset [NAME [VALUE]]` | set option (eg: `\pset format csv`) |
+
+### Running SQL
+
+See the [SQL Reference](../user-guide/sql/index.rst) for how to register data sources and supported SQL. 
diff --git a/docs/source/specification/roadmap.md b/docs/source/specification/roadmap.md
@@ -34,7 +34,7 @@ suggest you start a conversation using a github issue or the
 dev@arrow.apache.org mailing list to make review efficient and avoid
 surprises.
 
-# DataFusion
+## DataFusion
 
 DataFusion's goal is to become the embedded query engine of choice
 for new analytic applications, by leveraging the unique features of
@@ -47,7 +47,7 @@ to provide:
 4. A Procedural API for programmatically creating and running execution plans
 5. High performance, data race free, ergonomic extensibility points at at every layer
 
-## Additional SQL Language Features
+### Additional SQL Language Features
 
 - Decimal Support [#122](https://github.com/apache/arrow-datafusion/issues/122)
 - Complete support list on [status](https://github.com/apache/arrow-datafusion/blob/master/README.md#status)
@@ -56,32 +56,32 @@ to provide:
 - Support for nested structures (fields, lists, structs) [#119](https://github.com/apache/arrow-datafusion/issues/119)
 - Run all queries from the TPCH benchmark (see [milestone](https://github.com/apache/arrow-datafusion/milestone/2) for more details)
 
-## Query Optimizer
+### Query Optimizer
 
 - More sophisticated cost based optimizer for join ordering
 - Implement advanced query optimization framework (Tokomak) #440
 - Finer optimizations for group by and aggregate functions
 
-## Datasources
+### Datasources
 
 - Better support for reading data from remote filesystems (e.g. S3) without caching it locally [#907](https://github.com/apache/arrow-datafusion/issues/907) [#1060](https://github.com/apache/arrow-datafusion/issues/1060)
 - Improve performances of file format datasources (parallelize file listings, async Arrow readers, file chunk prefetching capability...)
 
-## Runtime / Infrastructure
+### Runtime / Infrastructure
 
 - Migrate to some sort of arrow2 based implementation (see [milestone](https://github.com/apache/arrow-datafusion/milestone/3) for more details)
 - Add DataFusion to h2oai/db-benchmark [147](https://github.com/apache/arrow-datafusion/issues/147)
 - Improve build time [348](https://github.com/apache/arrow-datafusion/issues/348)
 
-## Resource Management
+### Resource Management
 
 - Finer grain control and limit of runtime memory [#587](https://github.com/apache/arrow-datafusion/issues/587) and CPU usage [#54](https://github.com/apache/arrow-datafusion/issues/64)
 
-## Python Interface
+### Python Interface
 
 TBD
 
-## DataFusion CLI (`datafusion-cli`)
+### DataFusion CLI (`datafusion-cli`)
 
 Note: There are some additional thoughts on a datafusion-cli vision on [#1096](https://github.com/apache/arrow-datafusion/issues/1096#issuecomment-939418770).
 
@@ -91,7 +91,7 @@ Note: There are some additional thoughts on a datafusion-cli vision on [#1096](h
 - publishing to apt, brew, and possible NuGet registry so that people can use it more easily
 - adopt a shorter name, like dfcli?
 
-# Ballista
+## Ballista
 
 Ballista is a distributed compute platform based on Apache Arrow and DataFusion. It provides a query scheduler that
 breaks a physical plan into stages and tasks and then schedules tasks for execution across the available executors
@@ -101,16 +101,16 @@ Having Ballista as part of the DataFusion codebase helps ensure that DataFusion
 compute. For example, it helps ensure that physical query plans can be serialized to protobuf format and that they
 remain language-agnostic so that executors can be built in languages other than Rust.
 
-## Ballista Roadmap
+### Ballista Roadmap
 
-## Move query scheduler into DataFusion
+### Move query scheduler into DataFusion
 
 The Ballista scheduler has some advantages over DataFusion query execution because it doesn't try to eagerly execute
 the entire query at once but breaks it down into a directionally-acyclic graph (DAG) of stages and executes a
 configurable number of stages and tasks concurrently. It should be possible to push some of this logic down to
 DataFusion so that the same scheduler can be used to scale across cores in-process and across nodes in a cluster.
 
-## Implement execution-time cost-based optimizations based on statistics
+### Implement execution-time cost-based optimizations based on statistics
 
 After the execution of a query stage, accurate statistics are available for the resulting data. These statistics
 could be leveraged by the scheduler to optimize the query during execution. For example, when performing a hash join

diff --git a/docs/source/user-guide/example-usage.md b/docs/source/user-guide/example-usage.md
@@ -28,7 +28,7 @@ datafusion = "10"
 tokio = "1.0"
 ```
 
-## Run a SQL query against data stored in a CSV:
+## Run a SQL query against data stored in a CSV
 
 ```rust
 use datafusion::prelude::*;
@@ -48,7 +48,7 @@ async fn main() -> datafusion::error::Result<()> {
 }
 ```
 
-## Use the DataFrame API to process data stored in a CSV:
+## Use the DataFrame API to process data stored in a CSV
 
 ```rust
 use datafusion::prelude::*;

diff --git a/docs/source/user-guide/sql/ddl.md b/docs/source/user-guide/sql/ddl.md
@@ -17,12 +17,10 @@
   under the License.
 -->
 
-# DDL
+## DDL
 
 DataFusion is a _query_ engine and supports DDL only for modifying the catalog and registering external tables,
-creating tables in memory, or creating views. In the DataFusion CLI, these changes are not persisted in any way, taking
-place only in memory. The library is the same -- the catalog of tables is not persisted, unless the persistence is built
-into the application using the library. You can not insert, update, or delete data using DataFusion SQL.
+creating tables in memory, or creating views. You can not insert, update, or delete data using DataFusion SQL.
 
 ### CREATE DATABASE
 
@@ -81,7 +79,7 @@ Drops the table from DataFusion's catalog.
 DROP TABLE test.schema.t1;
 ```
 
-## CREATE EXTERNAL TABLE
+### CREATE EXTERNAL TABLE
 
 Parquet data sources can be registered by executing a `CREATE EXTERNAL TABLE` SQL statement. It is not necessary
 to provide schema information for Parquet files.

diff --git a/docs/source/user-guide/sql/index.rst b/docs/source/user-guide/sql/index.rst
@@ -22,7 +22,7 @@ SQL Reference
    :maxdepth: 2
 
    sql_status
-   select
    ddl
+   select
    aggregate_functions
    DataFusion Functions <datafusion-functions>