|
16 | 16 | // under the License. |
17 | 17 | #![warn(missing_docs, clippy::needless_borrow)] |
18 | 18 |
|
19 | | -//! [DataFusion](https://github.com/apache/arrow-datafusion) |
20 | | -//! is an extensible query execution framework that uses |
21 | | -//! [Apache Arrow](https://arrow.apache.org) as its in-memory format. |
| 19 | +//! [DataFusion] is an extensible query engine written in Rust that |
| 20 | +//! uses [Apache Arrow] as its in-memory format. DataFusion's [use |
| 21 | +//! cases] include building very fast database and analytic systems, |
| 22 | +//! customized to particular workloads. |
22 | 23 | //! |
23 | | -//! DataFusion supports both an SQL and a DataFrame API for building logical query plans |
24 | | -//! as well as a query optimizer and execution engine capable of parallel execution |
25 | | -//! against partitioned data sources (CSV and Parquet) using threads. |
| 24 | +//! "Out of the box," DataFusion quickly runs complex [SQL] and |
| 25 | +//! [`DataFrame`] queries using a sophisticated query planner, a columnar, |
| 26 | +//! multi-threaded, vectorized execution engine, and partitioned data |
| 27 | +//! sources (Parquet, CSV, JSON, and Avro). |
26 | 28 | //! |
27 | | -//! Below is an example of how to execute a query against data stored |
28 | | -//! in a CSV file using a [`DataFrame`](dataframe::DataFrame): |
| 29 | +//! DataFusion can also be easily customized to support additional |
| 30 | +//! data sources, query languages, functions, custom operators and |
| 31 | +//! more. |
| 32 | +//! |
| 33 | +//! [DataFusion]: https://arrow.apache.org/datafusion/ |
| 34 | +//! [Apache Arrow]: https://arrow.apache.org |
| 35 | +//! [use cases]: https://arrow.apache.org/datafusion/user-guide/introduction.html#use-cases |
| 36 | +//! [SQL]: https://arrow.apache.org/datafusion/user-guide/sql/index.html |
| 37 | +//! [`DataFrame`]: dataframe::DataFrame |
| 38 | +//! |
| 39 | +//! # Examples |
| 40 | +//! |
| 41 | +//! The main entry point for interacting with DataFusion is the |
| 42 | +//! [`SessionContext`]. |
| 43 | +//! |
| 44 | +//! [`SessionContext`]: execution::context::SessionContext |
| 45 | +//! |
| 46 | +//! ## DataFrame |
| 47 | +//! |
| 48 | +//! To execute a query against data stored |
| 49 | +//! in a CSV file using a [`DataFrame`]: |
29 | 50 | //! |
30 | 51 | //! ```rust |
31 | 52 | //! # use datafusion::prelude::*; |
|
64 | 85 | //! # } |
65 | 86 | //! ``` |
66 | 87 | //! |
67 | | -//! and how to execute a query against a CSV using SQL: |
| 88 | +//! ## SQL |
| 89 | +//! |
| 90 | +//! To execute a query against a CSV file using [SQL]: |
68 | 91 | //! |
69 | 92 | //! ``` |
70 | 93 | //! # use datafusion::prelude::*; |
|
100 | 123 | //! # } |
101 | 124 | //! ``` |
102 | 125 | //! |
103 | | -//! ## Parse, Plan, Optimize, Execute |
| 126 | +//! ## More Examples |
| 127 | +//! |
| 128 | +//! There are many additional annotated examples of using DataFusion in the [datafusion-examples] directory. |
| 129 | +//! |
| 130 | +//! [datafusion-examples]: https://github.com/apache/arrow-datafusion/tree/main/datafusion-examples |
| 131 | +//! |
| 132 | +//! ## Customization and Extension |
| 133 | +//! |
| 134 | +//! DataFusion supports extension at many points: |
| 135 | +//! |
| 136 | +//! * read from any datasource ([`TableProvider`]) |
| 137 | +//! * define your own catalogs, schemas, and table lists ([`CatalogProvider`]) |
| 138 | +//! * build your own query langue or plans using the ([`LogicalPlanBuilder`]) |
| 139 | +//! * declare and use user-defined scalar functions ([`ScalarUDF`]) |
| 140 | +//! * declare and use user-defined aggregate functions ([`AggregateUDF`]) |
| 141 | +//! * add custom optimizer rewrite passes ([`OptimizerRule`] and [`PhysicalOptimizerRule`]) |
| 142 | +//! * extend the planner to use user-defined logical and physical nodes ([`QueryPlanner`]) |
| 143 | +//! |
| 144 | +//! You can find examples of each of them in the [datafusion-examples] directory. |
| 145 | +//! |
| 146 | +//! [`TableProvider`]: crate::datasource::TableProvider |
| 147 | +//! [`CatalogProvider`]: crate::catalog::catalog::CatalogProvider |
| 148 | +//! [`LogicalPlanBuilder`]: datafusion_expr::logical_plan::builder::LogicalPlanBuilder |
| 149 | +//! [`ScalarUDF`]: physical_plan::udf::ScalarUDF |
| 150 | +//! [`AggregateUDF`]: physical_plan::udaf::AggregateUDF |
| 151 | +//! [`QueryPlanner`]: execution::context::QueryPlanner |
| 152 | +//! [`OptimizerRule`]: datafusion_optimizer::optimizer::OptimizerRule |
| 153 | +//! [`PhysicalOptimizerRule`]: datafusion::physical_optimizer::optimizer::PhysicalOptimizerRule |
| 154 | +//! |
| 155 | +//! # Code Organization |
| 156 | +//! |
| 157 | +//! ## Overview Presentations |
| 158 | +//! |
| 159 | +//! The following presentations offer high level overviews of the |
| 160 | +//! different components and how they interact together. |
| 161 | +//! |
| 162 | +//! - [Apr 2023]: The Apache Arrow DataFusion Architecture talks |
| 163 | +//! - _Query Engine_: [recording](https://youtu.be/NVKujPxwSBA) and [slides](https://docs.google.com/presentation/d/1D3GDVas-8y0sA4c8EOgdCvEjVND4s2E7I6zfs67Y4j8/edit#slide=id.p) |
| 164 | +//! - _Logical Plan and Expressions_: [recording](https://youtu.be/EzZTLiSJnhY) and [slides](https://docs.google.com/presentation/d/1ypylM3-w60kVDW7Q6S99AHzvlBgciTdjsAfqNP85K30) |
| 165 | +//! - _Physical Plan and Execution_: [recording](https://youtu.be/2jkWU3_w6z0) and [slides](https://docs.google.com/presentation/d/1cA2WQJ2qg6tx6y4Wf8FH2WVSm9JQ5UgmBWATHdik0hg) |
| 166 | +//! - [February 2021]: How DataFusion is used within the Ballista Project is described in \*Ballista: Distributed Compute with Rust and Apache Arrow: [recording](https://www.youtube.com/watch?v=ZZHQaOap9pQ) |
| 167 | +//! - [July 2022]: DataFusion and Arrow: Supercharge Your Data Analytical Tool with a Rusty Query Engine: [recording](https://www.youtube.com/watch?v=Rii1VTn3seQ) and [slides](https://docs.google.com/presentation/d/1q1bPibvu64k2b7LPi7Yyb0k3gA1BiUYiUbEklqW1Ckc/view#slide=id.g11054eeab4c_0_1165) |
| 168 | +//! - [March 2021]: The DataFusion architecture is described in _Query Engine Design and the Rust-Based DataFusion in Apache Arrow_: [recording](https://www.youtube.com/watch?v=K6eCAVEk4kU) (DataFusion content starts [~ 15 minutes in](https://www.youtube.com/watch?v=K6eCAVEk4kU&t=875s)) and [slides](https://www.slideshare.net/influxdata/influxdb-iox-tech-talks-query-engine-design-and-the-rustbased-datafusion-in-apache-arrow-244161934) |
| 169 | +//! - [February 2021]: How DataFusion is used within the Ballista Project is described in \*Ballista: Distributed Compute with Rust and Apache Arrow: [recording](https://www.youtube.com/watch?v=ZZHQaOap9pQ) |
| 170 | +//! |
| 171 | +//! ## Architecture |
104 | 172 | //! |
105 | 173 | //! DataFusion is a fully fledged query engine capable of performing complex operations. |
106 | 174 | //! Specifically, when DataFusion receives an SQL query, there are different steps |
107 | 175 | //! that it passes through until a result is obtained. Broadly, they are: |
108 | 176 | //! |
109 | | -//! 1. The string is parsed to an Abstract syntax tree (AST) using [sqlparser](https://docs.rs/sqlparser/latest/sqlparser/). |
110 | | -//! 2. The planner [`SqlToRel`](sql::planner::SqlToRel) converts logical expressions on the AST to logical expressions [`Expr`s](datafusion_expr::Expr). |
111 | | -//! 3. The planner [`SqlToRel`](sql::planner::SqlToRel) converts logical nodes on the AST to a [`LogicalPlan`](datafusion_expr::LogicalPlan). |
112 | | -//! 4. [`OptimizerRules`](optimizer::optimizer::OptimizerRule) are applied to the [`LogicalPlan`](datafusion_expr::LogicalPlan) to optimize it. |
113 | | -//! 5. The [`LogicalPlan`](datafusion_expr::LogicalPlan) is converted to an [`ExecutionPlan`](physical_plan::ExecutionPlan) by a [`PhysicalPlanner`](physical_plan::PhysicalPlanner) |
114 | | -//! 6. The [`ExecutionPlan`](physical_plan::ExecutionPlan) is executed against data through the [`SessionContext`](execution::context::SessionContext) |
| 177 | +//! 1. The string is parsed to an Abstract syntax tree (AST) using [sqlparser]. |
| 178 | +//! 2. The planner [`SqlToRel`] converts logical expressions on the AST to logical expressions [`Expr`]s. |
| 179 | +//! 3. The planner [`SqlToRel`] converts logical nodes on the AST to a [`LogicalPlan`]. |
| 180 | +//! 4. [`OptimizerRule`]s are applied to the [`LogicalPlan`] to optimize it. |
| 181 | +//! 5. The [`LogicalPlan`] is converted to an [`ExecutionPlan`] by a [`PhysicalPlanner`] |
| 182 | +//! 6. The [`ExecutionPlan`]is executed against data through the [`SessionContext`] |
115 | 183 | //! |
116 | | -//! With a [`DataFrame`](dataframe::DataFrame) API, steps 1-3 are not used as the DataFrame builds the [`LogicalPlan`](datafusion_expr::LogicalPlan) directly. |
| 184 | +//! With the [`DataFrame`] API, steps 1-3 are not used as the DataFrame builds the [`LogicalPlan`] directly. |
117 | 185 | //! |
118 | 186 | //! Phases 1-5 are typically cheap when compared to phase 6, and thus DataFusion puts a |
119 | 187 | //! lot of effort to ensure that phase 6 runs efficiently and without errors. |
120 | 188 | //! |
121 | 189 | //! DataFusion's planning is divided in two main parts: logical planning and physical planning. |
122 | 190 | //! |
123 | | -//! ### Logical plan |
| 191 | +//! ### Logical planning |
124 | 192 | //! |
125 | | -//! Logical planning yields [`logical plans`](datafusion_expr::LogicalPlan) and [`logical expressions`](datafusion_expr::Expr). |
126 | | -//! These are [`Schema`](arrow::datatypes::Schema)-aware traits that represent statements whose result is independent of how it should physically be executed. |
| 193 | +//! Logical planning yields [`LogicalPlan`]s and logical [`Expr`] |
| 194 | +//! expressions which are [`Schema`]aware and represent statements |
| 195 | +//! whose result is independent of how it should physically be |
| 196 | +//! executed. |
127 | 197 | //! |
128 | | -//! A [`LogicalPlan`](datafusion_expr::LogicalPlan) is a Directed Acyclic Graph (DAG) of other [`LogicalPlan`s](datafusion_expr::LogicalPlan) and each node contains logical expressions ([`Expr`s](logical_expr::Expr)). |
129 | | -//! All of these are located in [`datafusion_expr`](datafusion_expr). |
| 198 | +//! A [`LogicalPlan`] is a Directed Acyclic Graph (DAG) of other |
| 199 | +//! [`LogicalPlan`]s, and each node contains [`Expr`]s. All of these |
| 200 | +//! are located in [`datafusion_expr`] module. |
130 | 201 | //! |
131 | | -//! ### Physical plan |
| 202 | +//! ### Physical planning |
132 | 203 | //! |
133 | | -//! A Physical plan ([`ExecutionPlan`](physical_plan::ExecutionPlan)) is a plan that can be executed against data. |
134 | | -//! Contrarily to a logical plan, the physical plan has concrete information about how the calculation |
135 | | -//! should be performed (e.g. what Rust functions are used) and how data should be loaded into memory. |
| 204 | +//! An [`ExecutionPlan`] (sometimes referred to as a "physical plan") |
| 205 | +//! is a plan that can be executed against data. Compared to a |
| 206 | +//! logical plan, the physical plan has concrete information about how |
| 207 | +//! calculations should be performed (e.g. what Rust functions are |
| 208 | +//! used) and how data should be loaded into memory. |
136 | 209 | //! |
137 | | -//! [`ExecutionPlan`](physical_plan::ExecutionPlan) uses the Arrow format as its in-memory representation of data, through the [arrow] crate. |
138 | | -//! We recommend going through [its documentation](arrow) for details on how the data is physically represented. |
| 210 | +//! [`ExecutionPlan`]s uses the [Apache Arrow] format as its in-memory |
| 211 | +//! representation of data, through the [arrow] crate. The [arrow] |
| 212 | +//! crate documents how the memory is physically represented. |
139 | 213 | //! |
140 | | -//! A [`ExecutionPlan`](physical_plan::ExecutionPlan) is composed by nodes (implement the trait [`ExecutionPlan`](physical_plan::ExecutionPlan)), |
141 | | -//! and each node is composed by physical expressions ([`PhysicalExpr`](physical_plan::PhysicalExpr)) |
142 | | -//! or aggreagate expressions ([`AggregateExpr`](physical_plan::AggregateExpr)). |
143 | | -//! All of these are located in the module [`physical_plan`](physical_plan). |
| 214 | +//! A [`ExecutionPlan`] is composed by nodes (which each implement the |
| 215 | +//! [`ExecutionPlan`] trait). Each node can contain physical |
| 216 | +//! expressions ([`PhysicalExpr`]) or aggreagate expressions |
| 217 | +//! ([`AggregateExpr`]). All of these are located in the |
| 218 | +//! [`physical_plan`] module. |
144 | 219 | //! |
145 | 220 | //! Broadly speaking, |
146 | 221 | //! |
147 | | -//! * an [`ExecutionPlan`](physical_plan::ExecutionPlan) receives a partition number and asynchronously returns |
148 | | -//! an iterator over [`RecordBatch`](arrow::record_batch::RecordBatch) |
149 | | -//! (a node-specific struct that implements [`RecordBatchReader`](arrow::record_batch::RecordBatchReader)) |
150 | | -//! * a [`PhysicalExpr`](physical_plan::PhysicalExpr) receives a [`RecordBatch`](arrow::record_batch::RecordBatch) |
151 | | -//! and returns an [`Array`](arrow::array::Array) |
152 | | -//! * an [`AggregateExpr`](physical_plan::AggregateExpr) receives [`RecordBatch`es](arrow::record_batch::RecordBatch) |
153 | | -//! and returns a [`RecordBatch`](arrow::record_batch::RecordBatch) of a single row(*) |
| 222 | +//! * an [`ExecutionPlan`] receives a partition number and |
| 223 | +//! asynchronously returns an iterator over [`RecordBatch`] (a |
| 224 | +//! node-specific struct that implements [`RecordBatchReader`]) |
| 225 | +//! * a [`PhysicalExpr`] receives a [`RecordBatch`] |
| 226 | +//! and returns an [`Array`] |
| 227 | +//! * an [`AggregateExpr`] receives a series of [`RecordBatch`]es |
| 228 | +//! and returns a [`RecordBatch`] of a single row(*) |
154 | 229 | //! |
155 | 230 | //! (*) Technically, it aggregates the results on each partition and then merges the results into a single partition. |
156 | 231 | //! |
|
173 | 248 | //! * Scan from memory: [`MemoryExec`](physical_plan::memory::MemoryExec) |
174 | 249 | //! * Explain the plan: [`ExplainExec`](physical_plan::explain::ExplainExec) |
175 | 250 | //! |
176 | | -//! ## Customize |
177 | | -//! |
178 | | -//! DataFusion allows users to |
179 | | -//! * extend the planner to use user-defined logical and physical nodes ([`QueryPlanner`](execution::context::QueryPlanner)) |
180 | | -//! * declare and use user-defined scalar functions ([`ScalarUDF`](physical_plan::udf::ScalarUDF)) |
181 | | -//! * declare and use user-defined aggregate functions ([`AggregateUDF`](physical_plan::udaf::AggregateUDF)) |
182 | | -//! |
183 | | -//! You can find examples of each of them in examples section. |
184 | | -//! |
185 | | -//! ## Examples |
186 | | -//! |
187 | | -//! Examples are located in [datafusion-examples directory](https://github.com/apache/arrow-datafusion/tree/main/datafusion-examples) |
188 | | -//! |
189 | | -//! Here's how to run them |
190 | | -//! |
191 | | -//! ```bash |
192 | | -//! git clone https://github.com/apache/arrow-datafusion |
193 | | -//! cd arrow-datafusion |
194 | | -//! # Download test data |
195 | | -//! git submodule update --init |
196 | | -//! |
197 | | -//! cargo run --example csv_sql |
198 | | -//! |
199 | | -//! cargo run --example parquet_sql |
200 | | -//! |
201 | | -//! cargo run --example dataframe |
202 | | -//! |
203 | | -//! cargo run --example dataframe_in_memory |
204 | | -//! |
205 | | -//! cargo run --example simple_udaf |
206 | | -//! |
207 | | -//! cargo run --example simple_udf |
208 | | -//! ``` |
| 251 | +//! Future topics (coming soon): |
| 252 | +//! * Analyzer Rules |
| 253 | +//! * Resource management (memory and disk) |
| 254 | +//! |
| 255 | +//! [sqlparser]: https://docs.rs/sqlparser/latest/sqlparser |
| 256 | +//! [`SqlToRel`]: sql::planner::SqlToRel |
| 257 | +//! [`Expr`]: datafusion_expr::Expr |
| 258 | +//! [`LogicalPlan`]: datafusion_expr::LogicalPlan |
| 259 | +//! [`OptimizerRule`]: optimizer::optimizer::OptimizerRule |
| 260 | +//! [`ExecutionPlan`]: physical_plan::ExecutionPlan |
| 261 | +//! [`PhysicalPlanner`]: physical_plan::PhysicalPlanner |
| 262 | +//! [`Schema`]: arrow::datatypes::Schema |
| 263 | +//! [`datafusion_expr`]: datafusion_expr |
| 264 | +//! [`PhysicalExpr`]: physical_plan::PhysicalExpr |
| 265 | +//! [`AggregateExpr`]: physical_plan::AggregateExpr |
| 266 | +//! [`RecordBatch`]: arrow::record_batch::RecordBatch |
| 267 | +//! [`RecordBatchReader`]: arrow::record_batch::RecordBatchReader |
| 268 | +//! [`Array`]: arrow::array::Array |
209 | 269 |
|
210 | 270 | /// DataFusion crate version |
211 | 271 | pub const DATAFUSION_VERSION: &str = env!("CARGO_PKG_VERSION"); |
|
0 commit comments