Skip to content

Commit e9edd0c

Browse files
authored
Improve contributor guide (#5921)
Move some content into the code organization section
1 parent 570a1cf commit e9edd0c

File tree

8 files changed

+169
-92
lines changed

8 files changed

+169
-92
lines changed

datafusion-examples/README.md

Lines changed: 16 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -21,10 +21,25 @@
2121

2222
This crate includes several examples of how to use various DataFusion APIs and help you on your way.
2323

24-
Prerequisites:
24+
## Prerequisites:
2525

2626
Run `git submodule update --init` to init test files.
2727

28+
## Running Examples
29+
30+
To run the examples, use the `cargo run` command, such as:
31+
32+
```bash
33+
git clone https://github.com/apache/arrow-datafusion
34+
cd arrow-datafusion
35+
# Download test data
36+
git submodule update --init
37+
38+
# Run the `csv_sql` example:
39+
# ... use the equivalent for other examples
40+
cargo run --example csv_sql
41+
```
42+
2843
## Single Process
2944

3045
- [`avro_sql.rs`](examples/avro_sql.rs): Build and run a query plan from a SQL statement against a local AVRO file

datafusion/common/src/tree_node.rs

Lines changed: 2 additions & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -289,7 +289,7 @@ impl<T> Transformed<T> {
289289
/// Helper trait for implementing [`TreeNode`] that have children stored as Arc's
290290
///
291291
/// If some trait object, such as `dyn T`, implements this trait,
292-
/// its related Arc<dyn T> will automatically implement [`TreeNode`]
292+
/// its related `Arc<dyn T>` will automatically implement [`TreeNode`]
293293
pub trait DynTreeNode {
294294
/// Returns all children of the specified TreeNode
295295
fn arc_children(&self) -> Vec<Arc<Self>>;
@@ -303,7 +303,7 @@ pub trait DynTreeNode {
303303
}
304304

305305
/// Blanket implementation for Arc for any tye that implements
306-
/// [`DynTreeNode`] (such as Arc<dyn PhysicalExpr>)
306+
/// [`DynTreeNode`] (such as [`Arc<dyn PhysicalExpr>`])
307307
impl<T: DynTreeNode + ?Sized> TreeNode for Arc<T> {
308308
fn apply_children<F>(&self, op: &mut F) -> Result<VisitRecursion>
309309
where

datafusion/core/src/lib.rs

Lines changed: 132 additions & 72 deletions
Original file line numberDiff line numberDiff line change
@@ -16,16 +16,37 @@
1616
// under the License.
1717
#![warn(missing_docs, clippy::needless_borrow)]
1818

19-
//! [DataFusion](https://github.com/apache/arrow-datafusion)
20-
//! is an extensible query execution framework that uses
21-
//! [Apache Arrow](https://arrow.apache.org) as its in-memory format.
19+
//! [DataFusion] is an extensible query engine written in Rust that
20+
//! uses [Apache Arrow] as its in-memory format. DataFusion's [use
21+
//! cases] include building very fast database and analytic systems,
22+
//! customized to particular workloads.
2223
//!
23-
//! DataFusion supports both an SQL and a DataFrame API for building logical query plans
24-
//! as well as a query optimizer and execution engine capable of parallel execution
25-
//! against partitioned data sources (CSV and Parquet) using threads.
24+
//! "Out of the box," DataFusion quickly runs complex [SQL] and
25+
//! [`DataFrame`] queries using a sophisticated query planner, a columnar,
26+
//! multi-threaded, vectorized execution engine, and partitioned data
27+
//! sources (Parquet, CSV, JSON, and Avro).
2628
//!
27-
//! Below is an example of how to execute a query against data stored
28-
//! in a CSV file using a [`DataFrame`](dataframe::DataFrame):
29+
//! DataFusion can also be easily customized to support additional
30+
//! data sources, query languages, functions, custom operators and
31+
//! more.
32+
//!
33+
//! [DataFusion]: https://arrow.apache.org/datafusion/
34+
//! [Apache Arrow]: https://arrow.apache.org
35+
//! [use cases]: https://arrow.apache.org/datafusion/user-guide/introduction.html#use-cases
36+
//! [SQL]: https://arrow.apache.org/datafusion/user-guide/sql/index.html
37+
//! [`DataFrame`]: dataframe::DataFrame
38+
//!
39+
//! # Examples
40+
//!
41+
//! The main entry point for interacting with DataFusion is the
42+
//! [`SessionContext`].
43+
//!
44+
//! [`SessionContext`]: execution::context::SessionContext
45+
//!
46+
//! ## DataFrame
47+
//!
48+
//! To execute a query against data stored
49+
//! in a CSV file using a [`DataFrame`]:
2950
//!
3051
//! ```rust
3152
//! # use datafusion::prelude::*;
@@ -64,7 +85,9 @@
6485
//! # }
6586
//! ```
6687
//!
67-
//! and how to execute a query against a CSV using SQL:
88+
//! ## SQL
89+
//!
90+
//! To execute a query against a CSV file using [SQL]:
6891
//!
6992
//! ```
7093
//! # use datafusion::prelude::*;
@@ -100,57 +123,109 @@
100123
//! # }
101124
//! ```
102125
//!
103-
//! ## Parse, Plan, Optimize, Execute
126+
//! ## More Examples
127+
//!
128+
//! There are many additional annotated examples of using DataFusion in the [datafusion-examples] directory.
129+
//!
130+
//! [datafusion-examples]: https://github.com/apache/arrow-datafusion/tree/main/datafusion-examples
131+
//!
132+
//! ## Customization and Extension
133+
//!
134+
//! DataFusion supports extension at many points:
135+
//!
136+
//! * read from any datasource ([`TableProvider`])
137+
//! * define your own catalogs, schemas, and table lists ([`CatalogProvider`])
138+
//! * build your own query langue or plans using the ([`LogicalPlanBuilder`])
139+
//! * declare and use user-defined scalar functions ([`ScalarUDF`])
140+
//! * declare and use user-defined aggregate functions ([`AggregateUDF`])
141+
//! * add custom optimizer rewrite passes ([`OptimizerRule`] and [`PhysicalOptimizerRule`])
142+
//! * extend the planner to use user-defined logical and physical nodes ([`QueryPlanner`])
143+
//!
144+
//! You can find examples of each of them in the [datafusion-examples] directory.
145+
//!
146+
//! [`TableProvider`]: crate::datasource::TableProvider
147+
//! [`CatalogProvider`]: crate::catalog::catalog::CatalogProvider
148+
//! [`LogicalPlanBuilder`]: datafusion_expr::logical_plan::builder::LogicalPlanBuilder
149+
//! [`ScalarUDF`]: physical_plan::udf::ScalarUDF
150+
//! [`AggregateUDF`]: physical_plan::udaf::AggregateUDF
151+
//! [`QueryPlanner`]: execution::context::QueryPlanner
152+
//! [`OptimizerRule`]: datafusion_optimizer::optimizer::OptimizerRule
153+
//! [`PhysicalOptimizerRule`]: datafusion::physical_optimizer::optimizer::PhysicalOptimizerRule
154+
//!
155+
//! # Code Organization
156+
//!
157+
//! ## Overview Presentations
158+
//!
159+
//! The following presentations offer high level overviews of the
160+
//! different components and how they interact together.
161+
//!
162+
//! - [Apr 2023]: The Apache Arrow DataFusion Architecture talks
163+
//! - _Query Engine_: [recording](https://youtu.be/NVKujPxwSBA) and [slides](https://docs.google.com/presentation/d/1D3GDVas-8y0sA4c8EOgdCvEjVND4s2E7I6zfs67Y4j8/edit#slide=id.p)
164+
//! - _Logical Plan and Expressions_: [recording](https://youtu.be/EzZTLiSJnhY) and [slides](https://docs.google.com/presentation/d/1ypylM3-w60kVDW7Q6S99AHzvlBgciTdjsAfqNP85K30)
165+
//! - _Physical Plan and Execution_: [recording](https://youtu.be/2jkWU3_w6z0) and [slides](https://docs.google.com/presentation/d/1cA2WQJ2qg6tx6y4Wf8FH2WVSm9JQ5UgmBWATHdik0hg)
166+
//! - [February 2021]: How DataFusion is used within the Ballista Project is described in \*Ballista: Distributed Compute with Rust and Apache Arrow: [recording](https://www.youtube.com/watch?v=ZZHQaOap9pQ)
167+
//! - [July 2022]: DataFusion and Arrow: Supercharge Your Data Analytical Tool with a Rusty Query Engine: [recording](https://www.youtube.com/watch?v=Rii1VTn3seQ) and [slides](https://docs.google.com/presentation/d/1q1bPibvu64k2b7LPi7Yyb0k3gA1BiUYiUbEklqW1Ckc/view#slide=id.g11054eeab4c_0_1165)
168+
//! - [March 2021]: The DataFusion architecture is described in _Query Engine Design and the Rust-Based DataFusion in Apache Arrow_: [recording](https://www.youtube.com/watch?v=K6eCAVEk4kU) (DataFusion content starts [~ 15 minutes in](https://www.youtube.com/watch?v=K6eCAVEk4kU&t=875s)) and [slides](https://www.slideshare.net/influxdata/influxdb-iox-tech-talks-query-engine-design-and-the-rustbased-datafusion-in-apache-arrow-244161934)
169+
//! - [February 2021]: How DataFusion is used within the Ballista Project is described in \*Ballista: Distributed Compute with Rust and Apache Arrow: [recording](https://www.youtube.com/watch?v=ZZHQaOap9pQ)
170+
//!
171+
//! ## Architecture
104172
//!
105173
//! DataFusion is a fully fledged query engine capable of performing complex operations.
106174
//! Specifically, when DataFusion receives an SQL query, there are different steps
107175
//! that it passes through until a result is obtained. Broadly, they are:
108176
//!
109-
//! 1. The string is parsed to an Abstract syntax tree (AST) using [sqlparser](https://docs.rs/sqlparser/latest/sqlparser/).
110-
//! 2. The planner [`SqlToRel`](sql::planner::SqlToRel) converts logical expressions on the AST to logical expressions [`Expr`s](datafusion_expr::Expr).
111-
//! 3. The planner [`SqlToRel`](sql::planner::SqlToRel) converts logical nodes on the AST to a [`LogicalPlan`](datafusion_expr::LogicalPlan).
112-
//! 4. [`OptimizerRules`](optimizer::optimizer::OptimizerRule) are applied to the [`LogicalPlan`](datafusion_expr::LogicalPlan) to optimize it.
113-
//! 5. The [`LogicalPlan`](datafusion_expr::LogicalPlan) is converted to an [`ExecutionPlan`](physical_plan::ExecutionPlan) by a [`PhysicalPlanner`](physical_plan::PhysicalPlanner)
114-
//! 6. The [`ExecutionPlan`](physical_plan::ExecutionPlan) is executed against data through the [`SessionContext`](execution::context::SessionContext)
177+
//! 1. The string is parsed to an Abstract syntax tree (AST) using [sqlparser].
178+
//! 2. The planner [`SqlToRel`] converts logical expressions on the AST to logical expressions [`Expr`]s.
179+
//! 3. The planner [`SqlToRel`] converts logical nodes on the AST to a [`LogicalPlan`].
180+
//! 4. [`OptimizerRule`]s are applied to the [`LogicalPlan`] to optimize it.
181+
//! 5. The [`LogicalPlan`] is converted to an [`ExecutionPlan`] by a [`PhysicalPlanner`]
182+
//! 6. The [`ExecutionPlan`]is executed against data through the [`SessionContext`]
115183
//!
116-
//! With a [`DataFrame`](dataframe::DataFrame) API, steps 1-3 are not used as the DataFrame builds the [`LogicalPlan`](datafusion_expr::LogicalPlan) directly.
184+
//! With the [`DataFrame`] API, steps 1-3 are not used as the DataFrame builds the [`LogicalPlan`] directly.
117185
//!
118186
//! Phases 1-5 are typically cheap when compared to phase 6, and thus DataFusion puts a
119187
//! lot of effort to ensure that phase 6 runs efficiently and without errors.
120188
//!
121189
//! DataFusion's planning is divided in two main parts: logical planning and physical planning.
122190
//!
123-
//! ### Logical plan
191+
//! ### Logical planning
124192
//!
125-
//! Logical planning yields [`logical plans`](datafusion_expr::LogicalPlan) and [`logical expressions`](datafusion_expr::Expr).
126-
//! These are [`Schema`](arrow::datatypes::Schema)-aware traits that represent statements whose result is independent of how it should physically be executed.
193+
//! Logical planning yields [`LogicalPlan`]s and logical [`Expr`]
194+
//! expressions which are [`Schema`]aware and represent statements
195+
//! whose result is independent of how it should physically be
196+
//! executed.
127197
//!
128-
//! A [`LogicalPlan`](datafusion_expr::LogicalPlan) is a Directed Acyclic Graph (DAG) of other [`LogicalPlan`s](datafusion_expr::LogicalPlan) and each node contains logical expressions ([`Expr`s](logical_expr::Expr)).
129-
//! All of these are located in [`datafusion_expr`](datafusion_expr).
198+
//! A [`LogicalPlan`] is a Directed Acyclic Graph (DAG) of other
199+
//! [`LogicalPlan`]s, and each node contains [`Expr`]s. All of these
200+
//! are located in [`datafusion_expr`] module.
130201
//!
131-
//! ### Physical plan
202+
//! ### Physical planning
132203
//!
133-
//! A Physical plan ([`ExecutionPlan`](physical_plan::ExecutionPlan)) is a plan that can be executed against data.
134-
//! Contrarily to a logical plan, the physical plan has concrete information about how the calculation
135-
//! should be performed (e.g. what Rust functions are used) and how data should be loaded into memory.
204+
//! An [`ExecutionPlan`] (sometimes referred to as a "physical plan")
205+
//! is a plan that can be executed against data. Compared to a
206+
//! logical plan, the physical plan has concrete information about how
207+
//! calculations should be performed (e.g. what Rust functions are
208+
//! used) and how data should be loaded into memory.
136209
//!
137-
//! [`ExecutionPlan`](physical_plan::ExecutionPlan) uses the Arrow format as its in-memory representation of data, through the [arrow] crate.
138-
//! We recommend going through [its documentation](arrow) for details on how the data is physically represented.
210+
//! [`ExecutionPlan`]s uses the [Apache Arrow] format as its in-memory
211+
//! representation of data, through the [arrow] crate. The [arrow]
212+
//! crate documents how the memory is physically represented.
139213
//!
140-
//! A [`ExecutionPlan`](physical_plan::ExecutionPlan) is composed by nodes (implement the trait [`ExecutionPlan`](physical_plan::ExecutionPlan)),
141-
//! and each node is composed by physical expressions ([`PhysicalExpr`](physical_plan::PhysicalExpr))
142-
//! or aggreagate expressions ([`AggregateExpr`](physical_plan::AggregateExpr)).
143-
//! All of these are located in the module [`physical_plan`](physical_plan).
214+
//! A [`ExecutionPlan`] is composed by nodes (which each implement the
215+
//! [`ExecutionPlan`] trait). Each node can contain physical
216+
//! expressions ([`PhysicalExpr`]) or aggreagate expressions
217+
//! ([`AggregateExpr`]). All of these are located in the
218+
//! [`physical_plan`] module.
144219
//!
145220
//! Broadly speaking,
146221
//!
147-
//! * an [`ExecutionPlan`](physical_plan::ExecutionPlan) receives a partition number and asynchronously returns
148-
//! an iterator over [`RecordBatch`](arrow::record_batch::RecordBatch)
149-
//! (a node-specific struct that implements [`RecordBatchReader`](arrow::record_batch::RecordBatchReader))
150-
//! * a [`PhysicalExpr`](physical_plan::PhysicalExpr) receives a [`RecordBatch`](arrow::record_batch::RecordBatch)
151-
//! and returns an [`Array`](arrow::array::Array)
152-
//! * an [`AggregateExpr`](physical_plan::AggregateExpr) receives [`RecordBatch`es](arrow::record_batch::RecordBatch)
153-
//! and returns a [`RecordBatch`](arrow::record_batch::RecordBatch) of a single row(*)
222+
//! * an [`ExecutionPlan`] receives a partition number and
223+
//! asynchronously returns an iterator over [`RecordBatch`] (a
224+
//! node-specific struct that implements [`RecordBatchReader`])
225+
//! * a [`PhysicalExpr`] receives a [`RecordBatch`]
226+
//! and returns an [`Array`]
227+
//! * an [`AggregateExpr`] receives a series of [`RecordBatch`]es
228+
//! and returns a [`RecordBatch`] of a single row(*)
154229
//!
155230
//! (*) Technically, it aggregates the results on each partition and then merges the results into a single partition.
156231
//!
@@ -173,39 +248,24 @@
173248
//! * Scan from memory: [`MemoryExec`](physical_plan::memory::MemoryExec)
174249
//! * Explain the plan: [`ExplainExec`](physical_plan::explain::ExplainExec)
175250
//!
176-
//! ## Customize
177-
//!
178-
//! DataFusion allows users to
179-
//! * extend the planner to use user-defined logical and physical nodes ([`QueryPlanner`](execution::context::QueryPlanner))
180-
//! * declare and use user-defined scalar functions ([`ScalarUDF`](physical_plan::udf::ScalarUDF))
181-
//! * declare and use user-defined aggregate functions ([`AggregateUDF`](physical_plan::udaf::AggregateUDF))
182-
//!
183-
//! You can find examples of each of them in examples section.
184-
//!
185-
//! ## Examples
186-
//!
187-
//! Examples are located in [datafusion-examples directory](https://github.com/apache/arrow-datafusion/tree/main/datafusion-examples)
188-
//!
189-
//! Here's how to run them
190-
//!
191-
//! ```bash
192-
//! git clone https://github.com/apache/arrow-datafusion
193-
//! cd arrow-datafusion
194-
//! # Download test data
195-
//! git submodule update --init
196-
//!
197-
//! cargo run --example csv_sql
198-
//!
199-
//! cargo run --example parquet_sql
200-
//!
201-
//! cargo run --example dataframe
202-
//!
203-
//! cargo run --example dataframe_in_memory
204-
//!
205-
//! cargo run --example simple_udaf
206-
//!
207-
//! cargo run --example simple_udf
208-
//! ```
251+
//! Future topics (coming soon):
252+
//! * Analyzer Rules
253+
//! * Resource management (memory and disk)
254+
//!
255+
//! [sqlparser]: https://docs.rs/sqlparser/latest/sqlparser
256+
//! [`SqlToRel`]: sql::planner::SqlToRel
257+
//! [`Expr`]: datafusion_expr::Expr
258+
//! [`LogicalPlan`]: datafusion_expr::LogicalPlan
259+
//! [`OptimizerRule`]: optimizer::optimizer::OptimizerRule
260+
//! [`ExecutionPlan`]: physical_plan::ExecutionPlan
261+
//! [`PhysicalPlanner`]: physical_plan::PhysicalPlanner
262+
//! [`Schema`]: arrow::datatypes::Schema
263+
//! [`datafusion_expr`]: datafusion_expr
264+
//! [`PhysicalExpr`]: physical_plan::PhysicalExpr
265+
//! [`AggregateExpr`]: physical_plan::AggregateExpr
266+
//! [`RecordBatch`]: arrow::record_batch::RecordBatch
267+
//! [`RecordBatchReader`]: arrow::record_batch::RecordBatchReader
268+
//! [`Array`]: arrow::array::Array
209269
210270
/// DataFusion crate version
211271
pub const DATAFUSION_VERSION: &str = env!("CARGO_PKG_VERSION");

datafusion/expr/src/operator.rs

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -113,7 +113,7 @@ impl Operator {
113113
/// Return true if the operator is a numerical operator.
114114
///
115115
/// For example, 'Binary(a, +, b)' would be a numerical expression.
116-
/// PostgresSQL concept: https://www.postgresql.org/docs/7.0/operators2198.htm
116+
/// PostgresSQL concept: <https://www.postgresql.org/docs/7.0/operators2198.htm>
117117
pub fn is_numerical_operators(&self) -> bool {
118118
matches!(
119119
self,

datafusion/physical-expr/src/sort_expr.rs

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -195,7 +195,7 @@ impl PhysicalSortRequirement {
195195
///
196196
/// This function converts `PhysicalSortRequirement` to `PhysicalSortExpr`
197197
/// for each entry in the input. If required ordering is None for an entry
198-
/// default ordering `ASC, NULLS LAST` if given (see [`into_sort_expr`])
198+
/// default ordering `ASC, NULLS LAST` if given (see [`Self::into_sort_expr`])
199199
pub fn to_sort_exprs(
200200
requirements: impl IntoIterator<Item = PhysicalSortRequirement>,
201201
) -> Vec<PhysicalSortExpr> {

datafusion/sql/src/planner.rs

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -106,7 +106,7 @@ pub struct PlannerContext {
106106
/// in `PREPARE` statement
107107
prepare_param_data_types: Vec<DataType>,
108108
/// Map of CTE name to logical plan of the WITH clause.
109-
/// Use Arc<LogicalPlan> to allow cheap cloning
109+
/// Use `Arc<LogicalPlan>` to allow cheap cloning
110110
ctes: HashMap<String, Arc<LogicalPlan>>,
111111
/// The query schema of the outer query plan, used to resolve the columns in subquery
112112
outer_query_schema: Option<DFSchema>,

docs/source/contributor-guide/architecture.md

Lines changed: 4 additions & 9 deletions
Original file line numberDiff line numberDiff line change
@@ -19,13 +19,8 @@
1919

2020
# Architecture
2121

22-
There is no formal document describing DataFusion's architecture yet, but the following presentations offer a good overview of its different components and how they interact together.
22+
DataFusion's code structure and organization is described in the
23+
[Crate Documentation], to keep it as close to the source as
24+
possible.
2325

24-
- [Apr 2023]: The Apache Arrow DataFusion Architecture talks series by @alamb
25-
- _Query Engine_: [recording](https://youtu.be/NVKujPxwSBA) and [slides](https://docs.google.com/presentation/d/1D3GDVas-8y0sA4c8EOgdCvEjVND4s2E7I6zfs67Y4j8/edit#slide=id.p)
26-
- _Logical Plan and Expressions_: [recording](https://youtu.be/EzZTLiSJnhY) and [slides](https://docs.google.com/presentation/d/1ypylM3-w60kVDW7Q6S99AHzvlBgciTdjsAfqNP85K30/edit#slide=id.gbe21b752a6_0_218)
27-
- _Physical Plan and Execution_: [recording](https://youtu.be/2jkWU3_w6z0) and [slides](https://docs.google.com/presentation/d/1cA2WQJ2qg6tx6y4Wf8FH2WVSm9JQ5UgmBWATHdik0hg/edit?usp=sharing)
28-
- [February 2021]: How DataFusion is used within the Ballista Project is described in \*Ballista: Distributed Compute with Rust and Apache Arrow: [recording](https://www.youtube.com/watch?v=ZZHQaOap9pQ)
29-
- [July 2022]: DataFusion and Arrow: Supercharge Your Data Analytical Tool with a Rusty Query Engine: [recording](https://www.youtube.com/watch?v=Rii1VTn3seQ) and [slides](https://docs.google.com/presentation/d/1q1bPibvu64k2b7LPi7Yyb0k3gA1BiUYiUbEklqW1Ckc/view#slide=id.g11054eeab4c_0_1165)
30-
- [March 2021]: The DataFusion architecture is described in _Query Engine Design and the Rust-Based DataFusion in Apache Arrow_: [recording](https://www.youtube.com/watch?v=K6eCAVEk4kU) (DataFusion content starts [~ 15 minutes in](https://www.youtube.com/watch?v=K6eCAVEk4kU&t=875s)) and [slides](https://www.slideshare.net/influxdata/influxdb-iox-tech-talks-query-engine-design-and-the-rustbased-datafusion-in-apache-arrow-244161934)
31-
- [February 2021]: How DataFusion is used within the Ballista Project is described in \*Ballista: Distributed Compute with Rust and Apache Arrow: [recording](https://www.youtube.com/watch?v=ZZHQaOap9pQ)
26+
[crate documentation]: https://docs.rs/datafusion/latest/datafusion/index.html#code-organization

0 commit comments

Comments
 (0)