Skip to content

Commit 4c7833e

Browse files
alambwaynexia
andauthored
[DOCS]: consolidate doc site content simplify navbar (#5962)
* [DOCS]: consolidate doc site content simplify navbar * prettier * Update docs/source/user-guide/faq.md Co-authored-by: Ruihang Xia <waynestxia@gmail.com> * Update versions to latest * remove reundant example * update duckdb link and polars description * update velox link * prettier --------- Co-authored-by: Ruihang Xia <waynestxia@gmail.com>
1 parent 826001d commit 4c7833e

File tree

12 files changed

+170
-299
lines changed

12 files changed

+170
-299
lines changed

docs/source/contributor-guide/architecture.md

Lines changed: 4 additions & 3 deletions
Original file line numberDiff line numberDiff line change
@@ -20,7 +20,8 @@
2020
# Architecture
2121

2222
DataFusion's code structure and organization is described in the
23-
[Crate Documentation], to keep it as close to the source as
24-
possible.
23+
[crates.io documentation], to keep it as close to the source as
24+
possible. You can find the most up to date version in the [source code].
2525

26-
[crate documentation]: https://docs.rs/datafusion/latest/datafusion/index.html#code-organization
26+
[crates.io documentation]: https://docs.rs/datafusion/latest/datafusion/index.html#code-organization
27+
[source code]: https://github.com/apache/arrow-datafusion/blob/main/datafusion/core/src/lib.rs

docs/source/index.rst

Lines changed: 3 additions & 9 deletions
Original file line numberDiff line numberDiff line change
@@ -37,10 +37,9 @@ community.
3737
:maxdepth: 1
3838
:caption: Links
3939

40-
Issue tracker <https://github.com/apache/arrow-datafusion/issues>
40+
Github and Issue Tracker <https://github.com/apache/arrow-datafusion>
4141
crates.io <https://crates.io/crates/datafusion>
42-
API Docs <https://docs.rs/datafusion/21.1.0/datafusion/>
43-
Github <https://github.com/apache/arrow-datafusion>
42+
API Docs <https://docs.rs/datafusion/latest/datafusion/>
4443
Code of conduct <https://github.com/apache/arrow-datafusion/blob/main/CODE_OF_CONDUCT.md>
4544

4645
.. _toc.guide:
@@ -50,22 +49,17 @@ community.
5049

5150
user-guide/introduction
5251
user-guide/example-usage
53-
user-guide/users
54-
user-guide/comparison
55-
user-guide/integration
56-
user-guide/library
5752
user-guide/cli
5853
user-guide/dataframe
5954
user-guide/expressions
6055
user-guide/sql/index
6156
user-guide/configs
6257
user-guide/faq
63-
Rust Crate Documentation <https://docs.rs/crate/datafusion/>
6458

6559
.. _toc.contributor-guide:
6660

6761
.. toctree::
68-
:maxdepth: 2
62+
:maxdepth: 1
6963
:caption: Contributor Guide
7064

7165
contributor-guide/index

docs/source/user-guide/cli.md

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -17,7 +17,7 @@
1717
under the License.
1818
-->
1919

20-
# DataFusion Command-line SQL Utility
20+
# `datafusion-cli`
2121

2222
The DataFusion CLI is a command-line interactive SQL utility for executing
2323
queries against any supported data files. It is a convenient way to

docs/source/user-guide/comparison.md

Lines changed: 0 additions & 52 deletions
This file was deleted.

docs/source/user-guide/example-usage.md

Lines changed: 59 additions & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -26,7 +26,7 @@ In this example some simple processing is performed on the [`example.csv`](../..
2626
Add the following to your `Cargo.toml` file:
2727

2828
```toml
29-
datafusion = "11.0"
29+
datafusion = "22"
3030
tokio = "1.0"
3131
```
3232

@@ -81,7 +81,7 @@ async fn main() -> datafusion::error::Result<()> {
8181
+---+--------+
8282
```
8383

84-
# Identifiers and Capitalization
84+
## Identifiers and Capitalization
8585

8686
Please be aware that all identifiers are effectively made lower-case in SQL, so if your csv file has capital letters (ex: `Name`) you must put your column name in double quotes or the examples won't work.
8787

@@ -141,3 +141,60 @@ async fn main() -> datafusion::error::Result<()> {
141141
| 1 | 2 |
142142
+---+--------+
143143
```
144+
145+
## Extensibility
146+
147+
DataFusion is designed to be extensible at all points. To that end, you can provide your own custom:
148+
149+
- [x] User Defined Functions (UDFs)
150+
- [x] User Defined Aggregate Functions (UDAFs)
151+
- [x] User Defined Table Source (`TableProvider`) for tables
152+
- [x] User Defined `Optimizer` passes (plan rewrites)
153+
- [x] User Defined `LogicalPlan` nodes
154+
- [x] User Defined `ExecutionPlan` nodes
155+
156+
## Rust Version Compatibility
157+
158+
This crate is tested with the latest stable version of Rust. We do not currently test against other, older versions of the Rust compiler.
159+
160+
## Optimized Configuration
161+
162+
For an optimized build several steps are required. First, use the below in your `Cargo.toml`. It is
163+
worth noting that using the settings in the `[profile.release]` section will significantly increase the build time.
164+
165+
```toml
166+
[dependencies]
167+
datafusion = { version = "22.0" , features = ["simd"]}
168+
tokio = { version = "^1.0", features = ["rt-multi-thread"] }
169+
snmalloc-rs = "0.2"
170+
171+
[profile.release]
172+
lto = true
173+
codegen-units = 1
174+
```
175+
176+
Then, in `main.rs.` update the memory allocator with the below after your imports:
177+
178+
```rust
179+
use datafusion::prelude::*;
180+
181+
#[global_allocator]
182+
static ALLOC: snmalloc_rs::SnMalloc = snmalloc_rs::SnMalloc;
183+
184+
async fn main() -> datafusion::error::Result<()> {
185+
Ok(())
186+
}
187+
```
188+
189+
Finally, in order to build with the `simd` optimization `cargo nightly` is required.
190+
191+
```shell
192+
rustup toolchain install nightly
193+
```
194+
195+
Based on the instruction set architecture you are building on you will want to configure the `target-cpu` as well, ideally
196+
with `native` or at least `avx2`.
197+
198+
```
199+
RUSTFLAGS='-C target-cpu=native' cargo +nightly run --release
200+
```

docs/source/user-guide/expressions.md

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -17,7 +17,7 @@
1717
under the License.
1818
-->
1919

20-
# Expressions
20+
# Expression API
2121

2222
DataFrame methods such as `select` and `filter` accept one or more logical expressions and there are many functions
2323
available for creating logical expressions. These are documented below.

docs/source/user-guide/faq.md

Lines changed: 34 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -29,3 +29,37 @@ model and computational kernels. It is designed to run within a single process,
2929
for parallel query execution.
3030

3131
[Ballista](https://github.com/apache/arrow-ballista) is a distributed compute platform built on DataFusion.
32+
33+
# How does DataFusion Compare with `XYZ`?
34+
35+
When compared to similar systems, DataFusion typically is:
36+
37+
1. Targeted at developers, rather than end users / data scientists.
38+
2. Designed to be embedded, rather than a complete file based SQL system.
39+
3. Governed by the [Apache Software Foundation](https://www.apache.org/) process, rather than a single company or individual.
40+
4. Implemented in `Rust`, rather than `C/C++`
41+
42+
Here is a comparison with similar projects that may help understand
43+
when DataFusion might be be suitable and unsuitable for your needs:
44+
45+
- [DuckDB](https://www.duckdb.org) is an open source, in process analytic database.
46+
Like DataFusion, it supports very fast execution, both from its custom file format
47+
and directly from parquet files. Unlike DataFusion, it is written in C/C++ and it
48+
is primarily used directly by users as a serverless database and query system rather
49+
than as a library for building such database systems.
50+
51+
- [Polars](http://pola.rs): Polars is one of the fastest DataFrame
52+
libraries at the time of writing. Like DataFusion, it is also
53+
written in Rust and uses the Apache Arrow memory model, but unlike
54+
DataFusion it is not designed with as many extension points.
55+
56+
- [Facebook Velox](https://github.com/facebookincubator/velox)
57+
is an execution engine. Like DataFusion, Velox aims to
58+
provide a reusable foundation for building database-like systems. Unlike DataFusion,
59+
it is written in C/C++ and does not include a SQL frontend or planning / optimization
60+
framework.
61+
62+
- [Databend](https://github.com/datafuselabs/databend) is a complete
63+
database system. Like DataFusion it is also written in Rust and
64+
utilizes the Apache Arrow memory model, but unlike DataFusion it
65+
targets end-users rather than developers of other database systems.

docs/source/user-guide/integration.md

Lines changed: 0 additions & 35 deletions
This file was deleted.

docs/source/user-guide/introduction.md

Lines changed: 67 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -17,7 +17,7 @@
1717
under the License.
1818
-->
1919

20-
# Features, and Usecases
20+
# Introduction
2121

2222
DataFusion is a very fast, extensible query engine for building
2323
high-quality data-centric systems in [Rust](http://rustlang.org),
@@ -66,6 +66,72 @@ features, and avoid reimplementing general (but still necessary)
6666
features such as an expression representation, standard optimizations,
6767
execution plans, file format support, etc.
6868

69+
## Known Users
70+
71+
Here are some of the projects known to use DataFusion:
72+
73+
- [Ballista](https://github.com/apache/arrow-ballista) Distributed SQL Query Engine
74+
- [Blaze](https://github.com/blaze-init/blaze) Spark accelerator with DataFusion at its core
75+
- [CeresDB](https://github.com/CeresDB/ceresdb) Distributed Time-Series Database
76+
- [Cloudfuse Buzz](https://github.com/cloudfuse-io/buzz-rust)
77+
- [CnosDB](https://github.com/cnosdb/cnosdb) Open Source Distributed Time Series Database
78+
- [Cube Store](https://github.com/cube-js/cube.js/tree/master/rust)
79+
- [Dask SQL](https://github.com/dask-contrib/dask-sql) Distributed SQL query engine in Python
80+
- [datafusion-tui](https://github.com/datafusion-contrib/datafusion-tui) Text UI for DataFusion
81+
- [delta-rs](https://github.com/delta-io/delta-rs) Native Rust implementation of Delta Lake
82+
- [Flock](https://github.com/flock-lab/flock)
83+
- [GreptimeDB](https://github.com/GreptimeTeam/greptimedb) Open Source & Cloud Native Distributed Time Series Database
84+
- [InfluxDB IOx](https://github.com/influxdata/influxdb_iox) Time Series Database
85+
- [Kamu](https://github.com/kamu-data/kamu-cli/) Planet-scale streaming data pipeline
86+
- [Parseable](https://github.com/parseablehq/parseable) Log storage and observability platform
87+
- [qv](https://github.com/timvw/qv) Quickly view your data
88+
- [ROAPI](https://github.com/roapi/roapi)
89+
- [Seafowl](https://github.com/splitgraph/seafowl) CDN-friendly analytical database
90+
- [Synnada](https://synnada.ai/) Streaming-first framework for data products
91+
- [Tensorbase](https://github.com/tensorbase/tensorbase)
92+
- [VegaFusion](https://vegafusion.io/) Server-side acceleration for the [Vega](https://vega.github.io/) visualization grammar
93+
- [ZincObserve](https://github.com/zinclabs/zincobserve) Distributed cloud native observability platform
94+
95+
[ballista]: https://github.com/apache/arrow-ballista
96+
[blaze]: https://github.com/blaze-init/blaze
97+
[ceresdb]: https://github.com/CeresDB/ceresdb
98+
[cloudfuse buzz]: https://github.com/cloudfuse-io/buzz-rust
99+
[cnosdb]: https://github.com/cnosdb/cnosdb
100+
[cube store]: https://github.com/cube-js/cube.js/tree/master/rust
101+
[dask sql]: https://github.com/dask-contrib/dask-sql
102+
[datafusion-tui]: https://github.com/datafusion-contrib/datafusion-tui
103+
[delta-rs]: https://github.com/delta-io/delta-rs
104+
[flock]: https://github.com/flock-lab/flock
105+
[kamu]: https://github.com/kamu-data/kamu-cli
106+
[greptime db]: https://github.com/GreptimeTeam/greptimedb
107+
[influxdb iox]: https://github.com/influxdata/influxdb_iox
108+
[parseable]: https://github.com/parseablehq/parseable
109+
[prql-query]: https://github.com/prql/prql-query
110+
[qv]: https://github.com/timvw/qv
111+
[roapi]: https://github.com/roapi/roapi
112+
[seafowl]: https://github.com/splitgraph/seafowl
113+
[synnada]: https://synnada.ai/
114+
[tensorbase]: https://github.com/tensorbase/tensorbase
115+
[vegafusion]: https://vegafusion.io/
116+
[zincobserve]: https://github.com/zinclabs/zincobserve "if you know of another project, please submit a PR to add a link!"
117+
118+
## Integrations and Extensions
119+
120+
There are a number of community projects that extend DataFusion or
121+
provide integrations with other systems.
122+
123+
### Language Bindings
124+
125+
- [datafusion-c](https://github.com/datafusion-contrib/datafusion-c)
126+
- [datafusion-python](https://github.com/apache/arrow-datafusion-python)
127+
- [datafusion-ruby](https://github.com/datafusion-contrib/datafusion-ruby)
128+
- [datafusion-java](https://github.com/datafusion-contrib/datafusion-java)
129+
130+
### Integrations
131+
132+
- [datafusion-bigtable](https://github.com/datafusion-contrib/datafusion-bigtable)
133+
- [datafusion-catalogprovider-glue](https://github.com/datafusion-contrib/datafusion-catalogprovider-glue)
134+
69135
## Why DataFusion?
70136

71137
- _High Performance_: Leveraging Rust and Arrow's memory model, DataFusion is very fast.

0 commit comments

Comments
 (0)