Skip to content
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
17 changes: 16 additions & 1 deletion datafusion-examples/README.md
Original file line number Diff line number Diff line change
Expand Up @@ -21,10 +21,25 @@

This crate includes several examples of how to use various DataFusion APIs and help you on your way.

Prerequisites:
## Prerequisites:

Run `git submodule update --init` to init test files.

## Running Examples

To run the examples, use the `cargo run` command, such as:

```bash
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Moved from the main lib.rs guide

git clone https://github.com/apache/arrow-datafusion
cd arrow-datafusion
# Download test data
git submodule update --init

# Run the `csv_sql` example:
# ... use the equivalent for other examples
cargo run --example csv_sql
```

## Single Process

- [`avro_sql.rs`](examples/avro_sql.rs): Build and run a query plan from a SQL statement against a local AVRO file
Expand Down
4 changes: 2 additions & 2 deletions datafusion/common/src/tree_node.rs
Original file line number Diff line number Diff line change
Expand Up @@ -289,7 +289,7 @@ impl<T> Transformed<T> {
/// Helper trait for implementing [`TreeNode`] that have children stored as Arc's
///
/// If some trait object, such as `dyn T`, implements this trait,
/// its related Arc<dyn T> will automatically implement [`TreeNode`]
/// its related `Arc<dyn T>` will automatically implement [`TreeNode`]
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

drive by rustdoc cleanups

pub trait DynTreeNode {
/// Returns all children of the specified TreeNode
fn arc_children(&self) -> Vec<Arc<Self>>;
Expand All @@ -303,7 +303,7 @@ pub trait DynTreeNode {
}

/// Blanket implementation for Arc for any tye that implements
/// [`DynTreeNode`] (such as Arc<dyn PhysicalExpr>)
/// [`DynTreeNode`] (such as [`Arc<dyn PhysicalExpr>`])
impl<T: DynTreeNode + ?Sized> TreeNode for Arc<T> {
fn apply_children<F>(&self, op: &mut F) -> Result<VisitRecursion>
where
Expand Down
204 changes: 132 additions & 72 deletions datafusion/core/src/lib.rs
Original file line number Diff line number Diff line change
Expand Up @@ -16,16 +16,37 @@
// under the License.
#![warn(missing_docs, clippy::needless_borrow)]

//! [DataFusion](https://github.com/apache/arrow-datafusion)
//! is an extensible query execution framework that uses
//! [Apache Arrow](https://arrow.apache.org) as its in-memory format.
//! [DataFusion] is an extensible query engine written in Rust that
//! uses [Apache Arrow] as its in-memory format. DataFusion's [use
//! cases] include building very fast database and analytic systems,
//! customized to particular workloads.
//!
//! DataFusion supports both an SQL and a DataFrame API for building logical query plans
//! as well as a query optimizer and execution engine capable of parallel execution
//! against partitioned data sources (CSV and Parquet) using threads.
//! "Out of the box," DataFusion quickly runs complex [SQL] and
//! [`DataFrame`] queries using a sophisticated query planner, a columnar,
//! multi-threaded, vectorized execution engine, and partitioned data
//! sources (Parquet, CSV, JSON, and Avro).
//!
//! Below is an example of how to execute a query against data stored
//! in a CSV file using a [`DataFrame`](dataframe::DataFrame):
//! DataFusion can also be easily customized to support additional
//! data sources, query languages, functions, custom operators and
//! more.
//!
//! [DataFusion]: https://arrow.apache.org/datafusion/
//! [Apache Arrow]: https://arrow.apache.org
//! [use cases]: https://arrow.apache.org/datafusion/user-guide/introduction.html#use-cases
//! [SQL]: https://arrow.apache.org/datafusion/user-guide/sql/index.html
//! [`DataFrame`]: dataframe::DataFrame
//!
//! # Examples
//!
//! The main entry point for interacting with DataFusion is the
//! [`SessionContext`].
//!
//! [`SessionContext`]: execution::context::SessionContext
//!
//! ## DataFrame
//!
//! To execute a query against data stored
//! in a CSV file using a [`DataFrame`]:
//!
//! ```rust
//! # use datafusion::prelude::*;
Expand Down Expand Up @@ -64,7 +85,9 @@
//! # }
//! ```
//!
//! and how to execute a query against a CSV using SQL:
//! ## SQL
//!
//! To execute a query against a CSV file using [SQL]:
//!
//! ```
//! # use datafusion::prelude::*;
Expand Down Expand Up @@ -100,57 +123,109 @@
//! # }
//! ```
//!
//! ## Parse, Plan, Optimize, Execute
//! ## More Examples
//!
//! There are many additional annotated examples of using DataFusion in the [datafusion-examples] directory.
//!
//! [datafusion-examples]: https://github.com/apache/arrow-datafusion/tree/main/datafusion-examples
//!
//! ## Customization and Extension
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I rearranged and expanded the customization points

//!
//! DataFusion supports extension at many points:
//!
//! * read from any datasource ([`TableProvider`])
//! * define your own catalogs, schemas, and table lists ([`CatalogProvider`])
//! * build your own query langue or plans using the ([`LogicalPlanBuilder`])
//! * declare and use user-defined scalar functions ([`ScalarUDF`])
//! * declare and use user-defined aggregate functions ([`AggregateUDF`])
//! * add custom optimizer rewrite passes ([`OptimizerRule`] and [`PhysicalOptimizerRule`])
//! * extend the planner to use user-defined logical and physical nodes ([`QueryPlanner`])
//!
//! You can find examples of each of them in the [datafusion-examples] directory.
//!
//! [`TableProvider`]: crate::datasource::TableProvider
//! [`CatalogProvider`]: crate::catalog::catalog::CatalogProvider
//! [`LogicalPlanBuilder`]: datafusion_expr::logical_plan::builder::LogicalPlanBuilder
//! [`ScalarUDF`]: physical_plan::udf::ScalarUDF
//! [`AggregateUDF`]: physical_plan::udaf::AggregateUDF
//! [`QueryPlanner`]: execution::context::QueryPlanner
//! [`OptimizerRule`]: datafusion_optimizer::optimizer::OptimizerRule
//! [`PhysicalOptimizerRule`]: datafusion::physical_optimizer::optimizer::PhysicalOptimizerRule
//!
//! # Code Organization
//!
//! ## Overview Presentations
//!
//! The following presentations offer high level overviews of the
//! different components and how they interact together.
//!
//! - [Apr 2023]: The Apache Arrow DataFusion Architecture talks
//! - _Query Engine_: [recording](https://youtu.be/NVKujPxwSBA) and [slides](https://docs.google.com/presentation/d/1D3GDVas-8y0sA4c8EOgdCvEjVND4s2E7I6zfs67Y4j8/edit#slide=id.p)
//! - _Logical Plan and Expressions_: [recording](https://youtu.be/EzZTLiSJnhY) and [slides](https://docs.google.com/presentation/d/1ypylM3-w60kVDW7Q6S99AHzvlBgciTdjsAfqNP85K30)
//! - _Physical Plan and Execution_: [recording](https://youtu.be/2jkWU3_w6z0) and [slides](https://docs.google.com/presentation/d/1cA2WQJ2qg6tx6y4Wf8FH2WVSm9JQ5UgmBWATHdik0hg)
//! - [February 2021]: How DataFusion is used within the Ballista Project is described in \*Ballista: Distributed Compute with Rust and Apache Arrow: [recording](https://www.youtube.com/watch?v=ZZHQaOap9pQ)
//! - [July 2022]: DataFusion and Arrow: Supercharge Your Data Analytical Tool with a Rusty Query Engine: [recording](https://www.youtube.com/watch?v=Rii1VTn3seQ) and [slides](https://docs.google.com/presentation/d/1q1bPibvu64k2b7LPi7Yyb0k3gA1BiUYiUbEklqW1Ckc/view#slide=id.g11054eeab4c_0_1165)
//! - [March 2021]: The DataFusion architecture is described in _Query Engine Design and the Rust-Based DataFusion in Apache Arrow_: [recording](https://www.youtube.com/watch?v=K6eCAVEk4kU) (DataFusion content starts [~ 15 minutes in](https://www.youtube.com/watch?v=K6eCAVEk4kU&t=875s)) and [slides](https://www.slideshare.net/influxdata/influxdb-iox-tech-talks-query-engine-design-and-the-rustbased-datafusion-in-apache-arrow-244161934)
//! - [February 2021]: How DataFusion is used within the Ballista Project is described in \*Ballista: Distributed Compute with Rust and Apache Arrow: [recording](https://www.youtube.com/watch?v=ZZHQaOap9pQ)
//!
//! ## Architecture
//!
//! DataFusion is a fully fledged query engine capable of performing complex operations.
//! Specifically, when DataFusion receives an SQL query, there are different steps
//! that it passes through until a result is obtained. Broadly, they are:
//!
//! 1. The string is parsed to an Abstract syntax tree (AST) using [sqlparser](https://docs.rs/sqlparser/latest/sqlparser/).
//! 2. The planner [`SqlToRel`](sql::planner::SqlToRel) converts logical expressions on the AST to logical expressions [`Expr`s](datafusion_expr::Expr).
//! 3. The planner [`SqlToRel`](sql::planner::SqlToRel) converts logical nodes on the AST to a [`LogicalPlan`](datafusion_expr::LogicalPlan).
//! 4. [`OptimizerRules`](optimizer::optimizer::OptimizerRule) are applied to the [`LogicalPlan`](datafusion_expr::LogicalPlan) to optimize it.
//! 5. The [`LogicalPlan`](datafusion_expr::LogicalPlan) is converted to an [`ExecutionPlan`](physical_plan::ExecutionPlan) by a [`PhysicalPlanner`](physical_plan::PhysicalPlanner)
//! 6. The [`ExecutionPlan`](physical_plan::ExecutionPlan) is executed against data through the [`SessionContext`](execution::context::SessionContext)
//! 1. The string is parsed to an Abstract syntax tree (AST) using [sqlparser].
//! 2. The planner [`SqlToRel`] converts logical expressions on the AST to logical expressions [`Expr`]s.
//! 3. The planner [`SqlToRel`] converts logical nodes on the AST to a [`LogicalPlan`].
//! 4. [`OptimizerRule`]s are applied to the [`LogicalPlan`] to optimize it.
//! 5. The [`LogicalPlan`] is converted to an [`ExecutionPlan`] by a [`PhysicalPlanner`]
//! 6. The [`ExecutionPlan`]is executed against data through the [`SessionContext`]
//!
//! With a [`DataFrame`](dataframe::DataFrame) API, steps 1-3 are not used as the DataFrame builds the [`LogicalPlan`](datafusion_expr::LogicalPlan) directly.
//! With the [`DataFrame`] API, steps 1-3 are not used as the DataFrame builds the [`LogicalPlan`] directly.
//!
//! Phases 1-5 are typically cheap when compared to phase 6, and thus DataFusion puts a
//! lot of effort to ensure that phase 6 runs efficiently and without errors.
//!
//! DataFusion's planning is divided in two main parts: logical planning and physical planning.
//!
//! ### Logical plan
//! ### Logical planning
//!
//! Logical planning yields [`logical plans`](datafusion_expr::LogicalPlan) and [`logical expressions`](datafusion_expr::Expr).
//! These are [`Schema`](arrow::datatypes::Schema)-aware traits that represent statements whose result is independent of how it should physically be executed.
//! Logical planning yields [`LogicalPlan`]s and logical [`Expr`]
//! expressions which are [`Schema`]aware and represent statements
//! whose result is independent of how it should physically be
//! executed.
//!
//! A [`LogicalPlan`](datafusion_expr::LogicalPlan) is a Directed Acyclic Graph (DAG) of other [`LogicalPlan`s](datafusion_expr::LogicalPlan) and each node contains logical expressions ([`Expr`s](logical_expr::Expr)).
//! All of these are located in [`datafusion_expr`](datafusion_expr).
//! A [`LogicalPlan`] is a Directed Acyclic Graph (DAG) of other
//! [`LogicalPlan`]s, and each node contains [`Expr`]s. All of these
//! are located in [`datafusion_expr`] module.
//!
//! ### Physical plan
//! ### Physical planning
//!
//! A Physical plan ([`ExecutionPlan`](physical_plan::ExecutionPlan)) is a plan that can be executed against data.
//! Contrarily to a logical plan, the physical plan has concrete information about how the calculation
//! should be performed (e.g. what Rust functions are used) and how data should be loaded into memory.
//! An [`ExecutionPlan`] (sometimes referred to as a "physical plan")
//! is a plan that can be executed against data. Compared to a
//! logical plan, the physical plan has concrete information about how
//! calculations should be performed (e.g. what Rust functions are
//! used) and how data should be loaded into memory.
//!
//! [`ExecutionPlan`](physical_plan::ExecutionPlan) uses the Arrow format as its in-memory representation of data, through the [arrow] crate.
//! We recommend going through [its documentation](arrow) for details on how the data is physically represented.
//! [`ExecutionPlan`]s uses the [Apache Arrow] format as its in-memory
//! representation of data, through the [arrow] crate. The [arrow]
//! crate documents how the memory is physically represented.
//!
//! A [`ExecutionPlan`](physical_plan::ExecutionPlan) is composed by nodes (implement the trait [`ExecutionPlan`](physical_plan::ExecutionPlan)),
//! and each node is composed by physical expressions ([`PhysicalExpr`](physical_plan::PhysicalExpr))
//! or aggreagate expressions ([`AggregateExpr`](physical_plan::AggregateExpr)).
//! All of these are located in the module [`physical_plan`](physical_plan).
//! A [`ExecutionPlan`] is composed by nodes (which each implement the
//! [`ExecutionPlan`] trait). Each node can contain physical
//! expressions ([`PhysicalExpr`]) or aggreagate expressions
//! ([`AggregateExpr`]). All of these are located in the
//! [`physical_plan`] module.
//!
//! Broadly speaking,
//!
//! * an [`ExecutionPlan`](physical_plan::ExecutionPlan) receives a partition number and asynchronously returns
//! an iterator over [`RecordBatch`](arrow::record_batch::RecordBatch)
//! (a node-specific struct that implements [`RecordBatchReader`](arrow::record_batch::RecordBatchReader))
//! * a [`PhysicalExpr`](physical_plan::PhysicalExpr) receives a [`RecordBatch`](arrow::record_batch::RecordBatch)
//! and returns an [`Array`](arrow::array::Array)
//! * an [`AggregateExpr`](physical_plan::AggregateExpr) receives [`RecordBatch`es](arrow::record_batch::RecordBatch)
//! and returns a [`RecordBatch`](arrow::record_batch::RecordBatch) of a single row(*)
//! * an [`ExecutionPlan`] receives a partition number and
//! asynchronously returns an iterator over [`RecordBatch`] (a
//! node-specific struct that implements [`RecordBatchReader`])
//! * a [`PhysicalExpr`] receives a [`RecordBatch`]
//! and returns an [`Array`]
//! * an [`AggregateExpr`] receives a series of [`RecordBatch`]es
//! and returns a [`RecordBatch`] of a single row(*)
//!
//! (*) Technically, it aggregates the results on each partition and then merges the results into a single partition.
//!
Expand All @@ -173,39 +248,24 @@
//! * Scan from memory: [`MemoryExec`](physical_plan::memory::MemoryExec)
//! * Explain the plan: [`ExplainExec`](physical_plan::explain::ExplainExec)
//!
//! ## Customize
//!
//! DataFusion allows users to
//! * extend the planner to use user-defined logical and physical nodes ([`QueryPlanner`](execution::context::QueryPlanner))
//! * declare and use user-defined scalar functions ([`ScalarUDF`](physical_plan::udf::ScalarUDF))
//! * declare and use user-defined aggregate functions ([`AggregateUDF`](physical_plan::udaf::AggregateUDF))
//!
//! You can find examples of each of them in examples section.
//!
//! ## Examples
//!
//! Examples are located in [datafusion-examples directory](https://github.com/apache/arrow-datafusion/tree/main/datafusion-examples)
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I moved the link to the other examples up with the other examples, and then moved the details on running closer to the examples themselves in the datafusion-examples directory

//!
//! Here's how to run them
//!
//! ```bash
//! git clone https://github.com/apache/arrow-datafusion
//! cd arrow-datafusion
//! # Download test data
//! git submodule update --init
//!
//! cargo run --example csv_sql
//!
//! cargo run --example parquet_sql
//!
//! cargo run --example dataframe
//!
//! cargo run --example dataframe_in_memory
//!
//! cargo run --example simple_udaf
//!
//! cargo run --example simple_udf
//! ```
//! Future topics (coming soon):
//! * Analyzer Rules
//! * Resource management (memory and disk)
//!
//! [sqlparser]: https://docs.rs/sqlparser/latest/sqlparser
//! [`SqlToRel`]: sql::planner::SqlToRel
//! [`Expr`]: datafusion_expr::Expr
//! [`LogicalPlan`]: datafusion_expr::LogicalPlan
//! [`OptimizerRule`]: optimizer::optimizer::OptimizerRule
//! [`ExecutionPlan`]: physical_plan::ExecutionPlan
//! [`PhysicalPlanner`]: physical_plan::PhysicalPlanner
//! [`Schema`]: arrow::datatypes::Schema
//! [`datafusion_expr`]: datafusion_expr
//! [`PhysicalExpr`]: physical_plan::PhysicalExpr
//! [`AggregateExpr`]: physical_plan::AggregateExpr
//! [`RecordBatch`]: arrow::record_batch::RecordBatch
//! [`RecordBatchReader`]: arrow::record_batch::RecordBatchReader
//! [`Array`]: arrow::array::Array
/// DataFusion crate version
pub const DATAFUSION_VERSION: &str = env!("CARGO_PKG_VERSION");
Expand Down
2 changes: 1 addition & 1 deletion datafusion/expr/src/operator.rs
Original file line number Diff line number Diff line change
Expand Up @@ -113,7 +113,7 @@ impl Operator {
/// Return true if the operator is a numerical operator.
///
/// For example, 'Binary(a, +, b)' would be a numerical expression.
/// PostgresSQL concept: https://www.postgresql.org/docs/7.0/operators2198.htm
/// PostgresSQL concept: <https://www.postgresql.org/docs/7.0/operators2198.htm>
pub fn is_numerical_operators(&self) -> bool {
matches!(
self,
Expand Down
2 changes: 1 addition & 1 deletion datafusion/physical-expr/src/sort_expr.rs
Original file line number Diff line number Diff line change
Expand Up @@ -195,7 +195,7 @@ impl PhysicalSortRequirement {
///
/// This function converts `PhysicalSortRequirement` to `PhysicalSortExpr`
/// for each entry in the input. If required ordering is None for an entry
/// default ordering `ASC, NULLS LAST` if given (see [`into_sort_expr`])
/// default ordering `ASC, NULLS LAST` if given (see [`Self::into_sort_expr`])
pub fn to_sort_exprs(
requirements: impl IntoIterator<Item = PhysicalSortRequirement>,
) -> Vec<PhysicalSortExpr> {
Expand Down
2 changes: 1 addition & 1 deletion datafusion/sql/src/planner.rs
Original file line number Diff line number Diff line change
Expand Up @@ -80,7 +80,7 @@ pub struct PlannerContext {
/// in `PREPARE` statement
prepare_param_data_types: Vec<DataType>,
/// Map of CTE name to logical plan of the WITH clause.
/// Use Arc<LogicalPlan> to allow cheap cloning
/// Use `Arc<LogicalPlan>` to allow cheap cloning
ctes: HashMap<String, Arc<LogicalPlan>>,
/// The query schema of the outer query plan, used to resolve the columns in subquery
outer_query_schema: Option<DFSchema>,
Expand Down
13 changes: 4 additions & 9 deletions docs/source/contributor-guide/architecture.md
Original file line number Diff line number Diff line change
Expand Up @@ -19,13 +19,8 @@

# Architecture

There is no formal document describing DataFusion's architecture yet, but the following presentations offer a good overview of its different components and how they interact together.
DataFusion's code structure and organization is described in the
[Crate Documentation], to keep it as close to the source as
possible.

- [Apr 2023]: The Apache Arrow DataFusion Architecture talks series by @alamb
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Moved to the main library docs

- _Query Engine_: [recording](https://youtu.be/NVKujPxwSBA) and [slides](https://docs.google.com/presentation/d/1D3GDVas-8y0sA4c8EOgdCvEjVND4s2E7I6zfs67Y4j8/edit#slide=id.p)
- _Logical Plan and Expressions_: [recording](https://youtu.be/EzZTLiSJnhY) and [slides](https://docs.google.com/presentation/d/1ypylM3-w60kVDW7Q6S99AHzvlBgciTdjsAfqNP85K30/edit#slide=id.gbe21b752a6_0_218)
- _Physical Plan and Execution_: [recording](https://youtu.be/2jkWU3_w6z0) and [slides](https://docs.google.com/presentation/d/1cA2WQJ2qg6tx6y4Wf8FH2WVSm9JQ5UgmBWATHdik0hg/edit?usp=sharing)
- [February 2021]: How DataFusion is used within the Ballista Project is described in \*Ballista: Distributed Compute with Rust and Apache Arrow: [recording](https://www.youtube.com/watch?v=ZZHQaOap9pQ)
- [July 2022]: DataFusion and Arrow: Supercharge Your Data Analytical Tool with a Rusty Query Engine: [recording](https://www.youtube.com/watch?v=Rii1VTn3seQ) and [slides](https://docs.google.com/presentation/d/1q1bPibvu64k2b7LPi7Yyb0k3gA1BiUYiUbEklqW1Ckc/view#slide=id.g11054eeab4c_0_1165)
- [March 2021]: The DataFusion architecture is described in _Query Engine Design and the Rust-Based DataFusion in Apache Arrow_: [recording](https://www.youtube.com/watch?v=K6eCAVEk4kU) (DataFusion content starts [~ 15 minutes in](https://www.youtube.com/watch?v=K6eCAVEk4kU&t=875s)) and [slides](https://www.slideshare.net/influxdata/influxdb-iox-tech-talks-query-engine-design-and-the-rustbased-datafusion-in-apache-arrow-244161934)
- [February 2021]: How DataFusion is used within the Ballista Project is described in \*Ballista: Distributed Compute with Rust and Apache Arrow: [recording](https://www.youtube.com/watch?v=ZZHQaOap9pQ)
[crate documentation]: https://docs.rs/datafusion/latest/datafusion/index.html#code-organization
Loading