Improve contributor guide (#5921)

alamb · web-flow · commit e9edd0cb4592 · 2023-04-10T17:31:55.000-04:00
Move some content into the code organization section
diff --git a/datafusion-examples/README.md b/datafusion-examples/README.md
@@ -21,10 +21,25 @@
 
 This crate includes several examples of how to use various DataFusion APIs and help you on your way.
 
-Prerequisites:
+## Prerequisites:
 
 Run `git submodule update --init` to init test files.
 
+## Running Examples
+
+To run the examples, use the `cargo run` command, such as:
+
+```bash
+git clone https://github.com/apache/arrow-datafusion
+cd arrow-datafusion
+# Download test data
+git submodule update --init
+
+# Run the `csv_sql` example:
+# ... use the equivalent for other examples
+cargo run --example csv_sql
+```
+
 ## Single Process
 
 - [`avro_sql.rs`](examples/avro_sql.rs): Build and run a query plan from a SQL statement against a local AVRO file
diff --git a/datafusion/common/src/tree_node.rs b/datafusion/common/src/tree_node.rs
@@ -289,7 +289,7 @@ impl<T> Transformed<T> {
 /// Helper trait for implementing [`TreeNode`] that have children stored as Arc's
 ///
 /// If some trait object, such as `dyn T`, implements this trait,
-/// its related Arc<dyn T> will automatically implement [`TreeNode`]
+/// its related `Arc<dyn T>` will automatically implement [`TreeNode`]
 pub trait DynTreeNode {
     /// Returns all children of the specified TreeNode
     fn arc_children(&self) -> Vec<Arc<Self>>;
@@ -303,7 +303,7 @@ pub trait DynTreeNode {
 }
 
 /// Blanket implementation for Arc for any tye that implements
-/// [`DynTreeNode`] (such as Arc<dyn PhysicalExpr>)
+/// [`DynTreeNode`] (such as [`Arc<dyn PhysicalExpr>`])
 impl<T: DynTreeNode + ?Sized> TreeNode for Arc<T> {
     fn apply_children<F>(&self, op: &mut F) -> Result<VisitRecursion>
     where
diff --git a/datafusion/core/src/lib.rs b/datafusion/core/src/lib.rs
@@ -16,16 +16,37 @@
 // under the License.
 #![warn(missing_docs, clippy::needless_borrow)]
 
-//! [DataFusion](https://github.com/apache/arrow-datafusion)
-//! is an extensible query execution framework that uses
-//! [Apache Arrow](https://arrow.apache.org) as its in-memory format.
+//! [DataFusion] is an extensible query engine written in Rust that
+//! uses [Apache Arrow] as its in-memory format. DataFusion's [use
+//! cases] include building very fast database and analytic systems,
+//! customized to particular workloads.
 //!
-//! DataFusion supports both an SQL and a DataFrame API for building logical query plans
-//! as well as a query optimizer and execution engine capable of parallel execution
-//! against partitioned data sources (CSV and Parquet) using threads.
+//! "Out of the box," DataFusion quickly runs complex [SQL] and
+//! [`DataFrame`] queries using a sophisticated query planner, a columnar,
+//! multi-threaded, vectorized execution engine, and partitioned data
+//! sources (Parquet, CSV, JSON, and Avro).
 //!
-//! Below is an example of how to execute a query against data stored
-//! in a CSV file using a [`DataFrame`](dataframe::DataFrame):
+//! DataFusion can also be easily customized to support additional
+//! data sources, query languages, functions, custom operators and
+//! more.
+//!
+//! [DataFusion]: https://arrow.apache.org/datafusion/
+//! [Apache Arrow]: https://arrow.apache.org
+//! [use cases]: https://arrow.apache.org/datafusion/user-guide/introduction.html#use-cases
+//! [SQL]: https://arrow.apache.org/datafusion/user-guide/sql/index.html
+//! [`DataFrame`]: dataframe::DataFrame
+//!
+//! # Examples
+//!
+//! The main entry point for interacting with DataFusion is the
+//! [`SessionContext`].
+//!
+//! [`SessionContext`]: execution::context::SessionContext
+//!
+//! ## DataFrame
+//!
+//! To execute a query against data stored
+//! in a CSV file using a [`DataFrame`]:
 //!
 //! ```rust
 //! # use datafusion::prelude::*;
@@ -64,7 +85,9 @@
 //! # }
 //! ```
 //!
-//! and how to execute a query against a CSV using SQL:
+//! ## SQL
+//!
+//! To execute a query against a CSV file using [SQL]:
 //!
 //! ```
 //! # use datafusion::prelude::*;
@@ -100,57 +123,109 @@
 //! # }
 //! ```
 //!
-//! ## Parse, Plan, Optimize, Execute
+//! ## More Examples
+//!
+//! There are many additional annotated examples of using DataFusion in the [datafusion-examples] directory.
+//!
+//! [datafusion-examples]: https://github.com/apache/arrow-datafusion/tree/main/datafusion-examples
+//!
+//! ## Customization and Extension
+//!
+//! DataFusion supports extension at many points:
+//!
+//! * read from any datasource ([`TableProvider`])
+//! * define your own catalogs, schemas, and table lists ([`CatalogProvider`])
+//! * build your own query langue or plans using the ([`LogicalPlanBuilder`])
+//! * declare and use user-defined scalar functions ([`ScalarUDF`])
+//! * declare and use user-defined aggregate functions ([`AggregateUDF`])
+//! * add custom optimizer rewrite passes ([`OptimizerRule`] and [`PhysicalOptimizerRule`])
+//! * extend the planner to use user-defined logical and physical nodes ([`QueryPlanner`])
+//!
+//! You can find examples of each of them in the [datafusion-examples] directory.
+//!
+//! [`TableProvider`]: crate::datasource::TableProvider
+//! [`CatalogProvider`]: crate::catalog::catalog::CatalogProvider
+//! [`LogicalPlanBuilder`]: datafusion_expr::logical_plan::builder::LogicalPlanBuilder
+//! [`ScalarUDF`]: physical_plan::udf::ScalarUDF
+//! [`AggregateUDF`]: physical_plan::udaf::AggregateUDF
+//! [`QueryPlanner`]: execution::context::QueryPlanner
+//! [`OptimizerRule`]: datafusion_optimizer::optimizer::OptimizerRule
+//! [`PhysicalOptimizerRule`]: datafusion::physical_optimizer::optimizer::PhysicalOptimizerRule
+//!
+//! # Code Organization
+//!
+//! ## Overview  Presentations
+//!
+//! The following presentations offer high level overviews of the
+//! different components and how they interact together.
+//!
+//! - [Apr 2023]: The Apache Arrow DataFusion Architecture talks
+//!   - _Query Engine_: [recording](https://youtu.be/NVKujPxwSBA) and [slides](https://docs.google.com/presentation/d/1D3GDVas-8y0sA4c8EOgdCvEjVND4s2E7I6zfs67Y4j8/edit#slide=id.p)
+//!   - _Logical Plan and Expressions_: [recording](https://youtu.be/EzZTLiSJnhY) and [slides](https://docs.google.com/presentation/d/1ypylM3-w60kVDW7Q6S99AHzvlBgciTdjsAfqNP85K30)
+//!   - _Physical Plan and Execution_: [recording](https://youtu.be/2jkWU3_w6z0) and [slides](https://docs.google.com/presentation/d/1cA2WQJ2qg6tx6y4Wf8FH2WVSm9JQ5UgmBWATHdik0hg)
+//! - [February 2021]: How DataFusion is used within the Ballista Project is described in \*Ballista: Distributed Compute with Rust and Apache Arrow: [recording](https://www.youtube.com/watch?v=ZZHQaOap9pQ)
+//! - [July 2022]: DataFusion and Arrow: Supercharge Your Data Analytical Tool with a Rusty Query Engine: [recording](https://www.youtube.com/watch?v=Rii1VTn3seQ) and [slides](https://docs.google.com/presentation/d/1q1bPibvu64k2b7LPi7Yyb0k3gA1BiUYiUbEklqW1Ckc/view#slide=id.g11054eeab4c_0_1165)
+//! - [March 2021]: The DataFusion architecture is described in _Query Engine Design and the Rust-Based DataFusion in Apache Arrow_: [recording](https://www.youtube.com/watch?v=K6eCAVEk4kU) (DataFusion content starts [~ 15 minutes in](https://www.youtube.com/watch?v=K6eCAVEk4kU&t=875s)) and [slides](https://www.slideshare.net/influxdata/influxdb-iox-tech-talks-query-engine-design-and-the-rustbased-datafusion-in-apache-arrow-244161934)
+//! - [February 2021]: How DataFusion is used within the Ballista Project is described in \*Ballista: Distributed Compute with Rust and Apache Arrow: [recording](https://www.youtube.com/watch?v=ZZHQaOap9pQ)
+//!
+//! ## Architecture
 //!
 //! DataFusion is a fully fledged query engine capable of performing complex operations.
 //! Specifically, when DataFusion receives an SQL query, there are different steps
 //! that it passes through until a result is obtained. Broadly, they are:
 //!
-//! 1. The string is parsed to an Abstract syntax tree (AST) using [sqlparser](https://docs.rs/sqlparser/latest/sqlparser/).
-//! 2. The planner [`SqlToRel`](sql::planner::SqlToRel) converts logical expressions on the AST to logical expressions [`Expr`s](datafusion_expr::Expr).
-//! 3. The planner [`SqlToRel`](sql::planner::SqlToRel) converts logical nodes on the AST to a [`LogicalPlan`](datafusion_expr::LogicalPlan).
-//! 4. [`OptimizerRules`](optimizer::optimizer::OptimizerRule) are applied to the [`LogicalPlan`](datafusion_expr::LogicalPlan) to optimize it.
-//! 5. The [`LogicalPlan`](datafusion_expr::LogicalPlan) is converted to an [`ExecutionPlan`](physical_plan::ExecutionPlan) by a [`PhysicalPlanner`](physical_plan::PhysicalPlanner)
-//! 6. The [`ExecutionPlan`](physical_plan::ExecutionPlan) is executed against data through the [`SessionContext`](execution::context::SessionContext)
+//! 1. The string is parsed to an Abstract syntax tree (AST) using [sqlparser].
+//! 2. The planner [`SqlToRel`] converts logical expressions on the AST to logical expressions [`Expr`]s.
+//! 3. The planner [`SqlToRel`] converts logical nodes on the AST to a [`LogicalPlan`].
+//! 4. [`OptimizerRule`]s are applied to the [`LogicalPlan`] to optimize it.
+//! 5. The [`LogicalPlan`] is converted to an [`ExecutionPlan`] by a [`PhysicalPlanner`]
+//! 6. The [`ExecutionPlan`]is executed against data through the [`SessionContext`]
 //!
-//! With a [`DataFrame`](dataframe::DataFrame) API, steps 1-3 are not used as the DataFrame builds the [`LogicalPlan`](datafusion_expr::LogicalPlan) directly.
+//! With the [`DataFrame`] API, steps 1-3 are not used as the DataFrame builds the [`LogicalPlan`] directly.
 //!
 //! Phases 1-5 are typically cheap when compared to phase 6, and thus DataFusion puts a
 //! lot of effort to ensure that phase 6 runs efficiently and without errors.
 //!
 //! DataFusion's planning is divided in two main parts: logical planning and physical planning.
 //!
-//! ### Logical plan
+//! ### Logical planning
 //!
-//! Logical planning yields [`logical plans`](datafusion_expr::LogicalPlan) and [`logical expressions`](datafusion_expr::Expr).
-//! These are [`Schema`](arrow::datatypes::Schema)-aware traits that represent statements whose result is independent of how it should physically be executed.
+//! Logical planning yields [`LogicalPlan`]s and logical [`Expr`]
+//! expressions which are [`Schema`]aware and represent statements
+//! whose result is independent of how it should physically be
+//! executed.
 //!
-//! A [`LogicalPlan`](datafusion_expr::LogicalPlan) is a Directed Acyclic Graph (DAG) of other [`LogicalPlan`s](datafusion_expr::LogicalPlan) and each node contains logical expressions ([`Expr`s](logical_expr::Expr)).
-//! All of these are located in [`datafusion_expr`](datafusion_expr).
+//! A [`LogicalPlan`] is a Directed Acyclic Graph (DAG) of other
+//! [`LogicalPlan`]s, and each node contains [`Expr`]s.  All of these
+//! are located in [`datafusion_expr`] module.
 //!
-//! ### Physical plan
+//! ### Physical planning
 //!
-//! A Physical plan ([`ExecutionPlan`](physical_plan::ExecutionPlan)) is a plan that can be executed against data.
-//! Contrarily to a logical plan, the physical plan has concrete information about how the calculation
-//! should be performed (e.g. what Rust functions are used) and how data should be loaded into memory.
+//! An [`ExecutionPlan`] (sometimes referred to as a "physical plan")
+//! is a plan that can be executed against data. Compared to a
+//! logical plan, the physical plan has concrete information about how
+//! calculations should be performed (e.g. what Rust functions are
+//! used) and how data should be loaded into memory.
 //!
-//! [`ExecutionPlan`](physical_plan::ExecutionPlan) uses the Arrow format as its in-memory representation of data, through the [arrow] crate.
-//! We recommend going through [its documentation](arrow) for details on how the data is physically represented.
+//! [`ExecutionPlan`]s uses the [Apache Arrow] format as its in-memory
+//! representation of data, through the [arrow] crate. The [arrow]
+//! crate documents how the memory is physically represented.
 //!
-//! A [`ExecutionPlan`](physical_plan::ExecutionPlan) is composed by nodes (implement the trait [`ExecutionPlan`](physical_plan::ExecutionPlan)),
-//! and each node is composed by physical expressions ([`PhysicalExpr`](physical_plan::PhysicalExpr))
-//! or aggreagate expressions ([`AggregateExpr`](physical_plan::AggregateExpr)).
-//! All of these are located in the module [`physical_plan`](physical_plan).
+//! A [`ExecutionPlan`] is composed by nodes (which each implement the
+//! [`ExecutionPlan`] trait). Each node can contain physical
+//! expressions ([`PhysicalExpr`]) or aggreagate expressions
+//! ([`AggregateExpr`]).  All of these are located in the
+//! [`physical_plan`] module.
 //!
 //! Broadly speaking,
 //!
-//! * an [`ExecutionPlan`](physical_plan::ExecutionPlan) receives a partition number and asynchronously returns
-//!   an iterator over [`RecordBatch`](arrow::record_batch::RecordBatch)
-//!   (a node-specific struct that implements [`RecordBatchReader`](arrow::record_batch::RecordBatchReader))
-//! * a [`PhysicalExpr`](physical_plan::PhysicalExpr) receives a [`RecordBatch`](arrow::record_batch::RecordBatch)
-//!   and returns an [`Array`](arrow::array::Array)
-//! * an [`AggregateExpr`](physical_plan::AggregateExpr) receives [`RecordBatch`es](arrow::record_batch::RecordBatch)
-//!   and returns a [`RecordBatch`](arrow::record_batch::RecordBatch) of a single row(*)
+//! * an [`ExecutionPlan`] receives a partition number and
+//!   asynchronously returns an iterator over [`RecordBatch`] (a
+//!   node-specific struct that implements [`RecordBatchReader`])
+//! * a [`PhysicalExpr`] receives a [`RecordBatch`]
+//!   and returns an [`Array`]
+//! * an [`AggregateExpr`] receives a series of [`RecordBatch`]es
+//!   and returns a [`RecordBatch`] of a single row(*)
 //!
 //! (*) Technically, it aggregates the results on each partition and then merges the results into a single partition.
 //!
@@ -173,39 +248,24 @@
 //! * Scan from memory: [`MemoryExec`](physical_plan::memory::MemoryExec)
 //! * Explain the plan: [`ExplainExec`](physical_plan::explain::ExplainExec)
 //!
-//! ## Customize
-//!
-//! DataFusion allows users to
-//! * extend the planner to use user-defined logical and physical nodes ([`QueryPlanner`](execution::context::QueryPlanner))
-//! * declare and use user-defined scalar functions ([`ScalarUDF`](physical_plan::udf::ScalarUDF))
-//! * declare and use user-defined aggregate functions ([`AggregateUDF`](physical_plan::udaf::AggregateUDF))
-//!
-//! You can find examples of each of them in examples section.
-//!
-//! ## Examples
-//!
-//! Examples are located in [datafusion-examples directory](https://github.com/apache/arrow-datafusion/tree/main/datafusion-examples)
-//!
-//! Here's how to run them
-//!
-//! ```bash
-//! git clone https://github.com/apache/arrow-datafusion
-//! cd arrow-datafusion
-//! # Download test data
-//! git submodule update --init
-//!
-//! cargo run --example csv_sql
-//!
-//! cargo run --example parquet_sql
-//!
-//! cargo run --example dataframe
-//!
-//! cargo run --example dataframe_in_memory
-//!
-//! cargo run --example simple_udaf
-//!
-//! cargo run --example simple_udf
-//! ```
+//! Future topics (coming soon):
+//! * Analyzer Rules
+//! * Resource management (memory and disk)
+//!
+//! [sqlparser]: https://docs.rs/sqlparser/latest/sqlparser
+//! [`SqlToRel`]: sql::planner::SqlToRel
+//! [`Expr`]: datafusion_expr::Expr
+//! [`LogicalPlan`]: datafusion_expr::LogicalPlan
+//! [`OptimizerRule`]: optimizer::optimizer::OptimizerRule
+//! [`ExecutionPlan`]: physical_plan::ExecutionPlan
+//! [`PhysicalPlanner`]: physical_plan::PhysicalPlanner
+//! [`Schema`]: arrow::datatypes::Schema
+//! [`datafusion_expr`]: datafusion_expr
+//! [`PhysicalExpr`]: physical_plan::PhysicalExpr
+//! [`AggregateExpr`]: physical_plan::AggregateExpr
+//! [`RecordBatch`]: arrow::record_batch::RecordBatch
+//! [`RecordBatchReader`]: arrow::record_batch::RecordBatchReader
+//! [`Array`]: arrow::array::Array
 
 /// DataFusion crate version
 pub const DATAFUSION_VERSION: &str = env!("CARGO_PKG_VERSION");
diff --git a/datafusion/expr/src/operator.rs b/datafusion/expr/src/operator.rs
@@ -113,7 +113,7 @@ impl Operator {
     /// Return true if the operator is a numerical operator.
     ///
     /// For example, 'Binary(a, +, b)' would be a numerical expression.
-    /// PostgresSQL concept: https://www.postgresql.org/docs/7.0/operators2198.htm
+    /// PostgresSQL concept: <https://www.postgresql.org/docs/7.0/operators2198.htm>
     pub fn is_numerical_operators(&self) -> bool {
         matches!(
             self,
diff --git a/datafusion/physical-expr/src/sort_expr.rs b/datafusion/physical-expr/src/sort_expr.rs
@@ -195,7 +195,7 @@ impl PhysicalSortRequirement {
     ///
     /// This function converts `PhysicalSortRequirement` to `PhysicalSortExpr`
     /// for each entry in the input. If required ordering is None for an entry
-    /// default ordering `ASC, NULLS LAST` if given (see [`into_sort_expr`])
+    /// default ordering `ASC, NULLS LAST` if given (see [`Self::into_sort_expr`])
     pub fn to_sort_exprs(
         requirements: impl IntoIterator<Item = PhysicalSortRequirement>,
     ) -> Vec<PhysicalSortExpr> {
diff --git a/datafusion/sql/src/planner.rs b/datafusion/sql/src/planner.rs
@@ -106,7 +106,7 @@ pub struct PlannerContext {
     /// in `PREPARE` statement
     prepare_param_data_types: Vec<DataType>,
     /// Map of CTE name to logical plan of the WITH clause.
-    /// Use Arc<LogicalPlan> to allow cheap cloning
+    /// Use `Arc<LogicalPlan>` to allow cheap cloning
     ctes: HashMap<String, Arc<LogicalPlan>>,
     /// The query schema of the outer query plan, used to resolve the columns in subquery
     outer_query_schema: Option<DFSchema>,
diff --git a/docs/source/contributor-guide/architecture.md b/docs/source/contributor-guide/architecture.md
@@ -19,13 +19,8 @@
 
 # Architecture
 
-There is no formal document describing DataFusion's architecture yet, but the following presentations offer a good overview of its different components and how they interact together.
+DataFusion's code structure and organization is described in the
+[Crate Documentation], to keep it as close to the source as
+possible.
 
-- [Apr 2023]: The Apache Arrow DataFusion Architecture talks series by @alamb
-  - _Query Engine_: [recording](https://youtu.be/NVKujPxwSBA) and [slides](https://docs.google.com/presentation/d/1D3GDVas-8y0sA4c8EOgdCvEjVND4s2E7I6zfs67Y4j8/edit#slide=id.p)
-  - _Logical Plan and Expressions_: [recording](https://youtu.be/EzZTLiSJnhY) and [slides](https://docs.google.com/presentation/d/1ypylM3-w60kVDW7Q6S99AHzvlBgciTdjsAfqNP85K30/edit#slide=id.gbe21b752a6_0_218)
-  - _Physical Plan and Execution_: [recording](https://youtu.be/2jkWU3_w6z0) and [slides](https://docs.google.com/presentation/d/1cA2WQJ2qg6tx6y4Wf8FH2WVSm9JQ5UgmBWATHdik0hg/edit?usp=sharing)
-- [February 2021]: How DataFusion is used within the Ballista Project is described in \*Ballista: Distributed Compute with Rust and Apache Arrow: [recording](https://www.youtube.com/watch?v=ZZHQaOap9pQ)
-- [July 2022]: DataFusion and Arrow: Supercharge Your Data Analytical Tool with a Rusty Query Engine: [recording](https://www.youtube.com/watch?v=Rii1VTn3seQ) and [slides](https://docs.google.com/presentation/d/1q1bPibvu64k2b7LPi7Yyb0k3gA1BiUYiUbEklqW1Ckc/view#slide=id.g11054eeab4c_0_1165)
-- [March 2021]: The DataFusion architecture is described in _Query Engine Design and the Rust-Based DataFusion in Apache Arrow_: [recording](https://www.youtube.com/watch?v=K6eCAVEk4kU) (DataFusion content starts [~ 15 minutes in](https://www.youtube.com/watch?v=K6eCAVEk4kU&t=875s)) and [slides](https://www.slideshare.net/influxdata/influxdb-iox-tech-talks-query-engine-design-and-the-rustbased-datafusion-in-apache-arrow-244161934)
-- [February 2021]: How DataFusion is used within the Ballista Project is described in \*Ballista: Distributed Compute with Rust and Apache Arrow: [recording](https://www.youtube.com/watch?v=ZZHQaOap9pQ)
+[crate documentation]: https://docs.rs/datafusion/latest/datafusion/index.html#code-organization
diff --git a/docs/source/contributor-guide/index.md b/docs/source/contributor-guide/index.md