DataFusion is an in-memory query engine that uses Apache Arrow as the memory model. It supports executing SQL queries against CSV and Parquet files as well as querying directly against in-memory data.
DataFusion can be used as a library by adding the following to your Cargo.toml
file.
[dependencies]
datafusion = "2.0.0-SNAPSHOT"
DataFusion includes a simple command-line interactive SQL utility. See the CLI reference for more information.
- SQL Parser
- SQL Query Planner
- Query Optimizer
- Projection push down
- Predicate push down
- Type coercion
- Parallel query execution
- Projection
- Filter (WHERE)
- Limit
- Aggregate
- UDFs (user-defined functions)
- UDAFs (user-defined aggregate functions)
- Common math functions
- String functions
- Length
- Concatenate
- Common date/time functions
- Basic date functions
- Basic time functions
- Basic timestamp functions
- nested functions
- Array of columns
- Sorting
- Nested types
- Lists
- Subqueries
- Joins
- CSV
- Parquet primitive types
- Parquet nested types
This library currently supports the following SQL constructs:
CREATE EXTERNAL TABLE X STORED AS PARQUET LOCATION '...';
to register a table's locationsSELECT ... FROM ...
together with any expressionALIAS
to name an expressionCAST
to change types, including e.g.Timestamp(Nanosecond, None)
- most mathematical unary and binary expressions such as
+
,/
,sqrt
,tan
,>=
. WHERE
to filterGROUP BY
together with one of the following aggregations:MIN
,MAX
,COUNT
,SUM
,AVG
ORDER BY
together with an expression and optionalASC
orDESC
and also optionalNULLS FIRST
orNULLS LAST
DataFusion uses Arrow, and thus the Arrow type system, for query execution. The SQL types from sqlparser-rs are mapped to Arrow types according to the following table
SQL Data Type | Arrow DataType |
---|---|
CHAR |
Utf8 |
VARCHAR |
Utf8 |
UUID |
Not yet supported |
CLOB |
Not yet supported |
BINARY |
Not yet supported |
VARBINARY |
Not yet supported |
DECIMAL |
Float64 |
FLOAT |
Float32 |
SMALLINT |
Int16 |
INT |
Int32 |
BIGINT |
Int64 |
REAL |
Float64 |
DOUBLE |
Float64 |
BOOLEAN |
Boolean |
DATE |
Date64(DateUnit::Day) |
TIME |
Time64(TimeUnit::Millisecond) |
TIMESTAMP |
Date64(DateUnit::Millisecond) |
INTERVAL |
Not yet supported |
REGCLASS |
Not yet supported |
TEXT |
Not yet supported |
BYTEA |
Not yet supported |
CUSTOM |
Not yet supported |
ARRAY |
Not yet supported |
This section describes how you can get started at developing DataFusion.
DataFusion's memory layout is the columnar format Arrow. Because a column type is only known
at runtime, DataFusion, like Arrow's create, uses dynamic typing throughout most of its code. Thus, a central aspect of DataFusion's query engine is keeping track of an expression's datatype
.
Arrow's columnar format natively supports the notion of null values, and DataFusion also. Like types,
DataFusion keeps track of an expression's nullability
throughout planning and execution.
Arrow's implementation in rust has a Field
that contains information about a column:
- name
- datatype
- nullability
A Schema
is essentially a vector of fields.
When a query is sent to DataFusion, there are different steps that it passes through until a result is obtained. Broadly, they are:
- The string is parsed to an Abstract syntax tree (AST). We use sqlparser for this.
- The AST is converted to a logical plan (src/sql)
- The logical plan is optimized to a new logical plan (src/optimizer)
- The logical plan is converted to a physical plan (src/physical_plan/planner)
- The physical plan is executed (src/execution/context.rs)
Phases 1-4 are typically cheap/fast when compared to phase 5, and thus DataFusion puts a lot of effort to ensure that phase 5 runs without errors.
A logical plan is a representation of the plan without details of how it is executed. In general,
- given a data schema and a logical plan, the resulting schema is known.
- given data and a logical plan, we agree on the result, irrespectively of how it is computed.
A logical plan is composed by nodes (called LogicalPlan
), and each node is composed by logical expressions (called Expr
). All of these are located in src/logical_plan/mod.rs.
A Physical plan is a plan that can be executed. Contrarily to a logical plan, the physical plan has specific information about how the calculation should be performed (e.g. what actual rust functions are used).
A physical plan is composed by nodes (implement the trait ExecutionPlan
), and each node is composed by physical expressions (implement the trait PhysicalExpr
) or aggreagate expressions (implement the trait AggregateExpr
). All of these are located in src/physical_plan.
Physical expressions are evaluated against RecordBatch
(a group of Array
s and a Schema
).
DataFusion is written in Rust and it uses a standard rust toolkit:
cargo build
cargo fmt
to format the codecargo test
to test- etc.
Below is a checklist of what you need to do to add a new scalar function to DataFusion:
- Add the actual implementation of the function:
- In src/physical_plan/functions, add:
- a new variant to
BuiltinScalarFunction
- a new entry to
FromStr
with the name of the function as called by SQL - a new line in
return_type
with the expected return type of the function, given an incoming type - a new line in
signature
with the signature of the function (number and types of its arguments) - a new line in
create_physical_expr
mapping the built-in to the implementation - tests to the function.
- a new variant to
- In tests/sql.rs, add a new test where the function is called through SQL against well known data and returns the expected result.
Below is a checklist of what you need to do to add a new aggregate function to DataFusion:
- Add the actual implementation of an
Accumulator
andAggregateExpr
: - In src/physical_plan/aggregates, add:
- a new variant to
BuiltinAggregateFunction
- a new entry to
FromStr
with the name of the function as called by SQL - a new line in
return_type
with the expected return type of the function, given an incoming type - a new line in
signature
with the signature of the function (number and types of its arguments) - a new line in
create_aggregate_expr
mapping the built-in to the implementation - tests to the function.
- a new variant to
- In tests/sql.rs, add a new test where the function is called through SQL against well known data and returns the expected result.