Skip to content

Latest commit

 

History

History
226 lines (171 loc) · 9.4 KB

README.md

File metadata and controls

226 lines (171 loc) · 9.4 KB

DataFusion

DataFusion is an in-memory query engine that uses Apache Arrow as the memory model. It supports executing SQL queries against CSV and Parquet files as well as querying directly against in-memory data.

Using DataFusion as a library

DataFusion can be used as a library by adding the following to your Cargo.toml file.

[dependencies]
datafusion = "2.0.0-SNAPSHOT"

Using DataFusion as a binary

DataFusion includes a simple command-line interactive SQL utility. See the CLI reference for more information.

Status

General

  • SQL Parser
  • SQL Query Planner
  • Query Optimizer
  • Projection push down
  • Predicate push down
  • Type coercion
  • Parallel query execution

SQL Support

  • Projection
  • Filter (WHERE)
  • Limit
  • Aggregate
  • UDFs (user-defined functions)
  • UDAFs (user-defined aggregate functions)
  • Common math functions
  • String functions
    • Length
    • Concatenate
  • Common date/time functions
    • Basic date functions
    • Basic time functions
    • Basic timestamp functions
  • nested functions
    • Array of columns
  • Sorting
  • Nested types
  • Lists
  • Subqueries
  • Joins

Data Sources

  • CSV
  • Parquet primitive types
  • Parquet nested types

Supported SQL

This library currently supports the following SQL constructs:

  • CREATE EXTERNAL TABLE X STORED AS PARQUET LOCATION '...'; to register a table's locations
  • SELECT ... FROM ... together with any expression
  • ALIAS to name an expression
  • CAST to change types, including e.g. Timestamp(Nanosecond, None)
  • most mathematical unary and binary expressions such as +, /, sqrt, tan, >=.
  • WHERE to filter
  • GROUP BY together with one of the following aggregations: MIN, MAX, COUNT, SUM, AVG
  • ORDER BY together with an expression and optional ASC or DESC and also optional NULLS FIRST or NULLS LAST

Supported Data Types

DataFusion uses Arrow, and thus the Arrow type system, for query execution. The SQL types from sqlparser-rs are mapped to Arrow types according to the following table

SQL Data Type Arrow DataType
CHAR Utf8
VARCHAR Utf8
UUID Not yet supported
CLOB Not yet supported
BINARY Not yet supported
VARBINARY Not yet supported
DECIMAL Float64
FLOAT Float32
SMALLINT Int16
INT Int32
BIGINT Int64
REAL Float64
DOUBLE Float64
BOOLEAN Boolean
DATE Date64(DateUnit::Day)
TIME Time64(TimeUnit::Millisecond)
TIMESTAMP Date64(DateUnit::Millisecond)
INTERVAL Not yet supported
REGCLASS Not yet supported
TEXT Not yet supported
BYTEA Not yet supported
CUSTOM Not yet supported
ARRAY Not yet supported

Developer's guide

This section describes how you can get started at developing DataFusion.

Core concepts

Dynamic typing

DataFusion's memory layout is the columnar format Arrow. Because a column type is only known at runtime, DataFusion, like Arrow's create, uses dynamic typing throughout most of its code. Thus, a central aspect of DataFusion's query engine is keeping track of an expression's datatype.

Nullability

Arrow's columnar format natively supports the notion of null values, and DataFusion also. Like types, DataFusion keeps track of an expression's nullability throughout planning and execution.

Field and Schema

Arrow's implementation in rust has a Field that contains information about a column:

  • name
  • datatype
  • nullability

A Schema is essentially a vector of fields.

parse, plan, optimize, execute

When a query is sent to DataFusion, there are different steps that it passes through until a result is obtained. Broadly, they are:

  1. The string is parsed to an Abstract syntax tree (AST). We use sqlparser for this.
  2. The AST is converted to a logical plan (src/sql)
  3. The logical plan is optimized to a new logical plan (src/optimizer)
  4. The logical plan is converted to a physical plan (src/physical_plan/planner)
  5. The physical plan is executed (src/execution/context.rs)

Phases 1-4 are typically cheap/fast when compared to phase 5, and thus DataFusion puts a lot of effort to ensure that phase 5 runs without errors.

Logical plan

A logical plan is a representation of the plan without details of how it is executed. In general,

  • given a data schema and a logical plan, the resulting schema is known.
  • given data and a logical plan, we agree on the result, irrespectively of how it is computed.

A logical plan is composed by nodes (called LogicalPlan), and each node is composed by logical expressions (called Expr). All of these are located in src/logical_plan/mod.rs.

Physical plan

A Physical plan is a plan that can be executed. Contrarily to a logical plan, the physical plan has specific information about how the calculation should be performed (e.g. what actual rust functions are used).

A physical plan is composed by nodes (implement the trait ExecutionPlan), and each node is composed by physical expressions (implement the trait PhysicalExpr) or aggreagate expressions (implement the trait AggregateExpr). All of these are located in src/physical_plan.

Physical expressions are evaluated against RecordBatch (a group of Arrays and a Schema).

Bootstrap environment

DataFusion is written in Rust and it uses a standard rust toolkit:

  • cargo build
  • cargo fmt to format the code
  • cargo test to test
  • etc.

How to add a new scalar function

Below is a checklist of what you need to do to add a new scalar function to DataFusion:

  • Add the actual implementation of the function:
    • here for string functions
    • here for math functions
    • here for datetime functions
    • create a new module here for other functions
  • In src/physical_plan/functions, add:
    • a new variant to BuiltinScalarFunction
    • a new entry to FromStr with the name of the function as called by SQL
    • a new line in return_type with the expected return type of the function, given an incoming type
    • a new line in signature with the signature of the function (number and types of its arguments)
    • a new line in create_physical_expr mapping the built-in to the implementation
    • tests to the function.
  • In tests/sql.rs, add a new test where the function is called through SQL against well known data and returns the expected result.

How to add a new aggregate function

Below is a checklist of what you need to do to add a new aggregate function to DataFusion:

  • Add the actual implementation of an Accumulator and AggregateExpr:
    • here for string functions
    • here for math functions
    • here for datetime functions
    • create a new module here for other functions
  • In src/physical_plan/aggregates, add:
    • a new variant to BuiltinAggregateFunction
    • a new entry to FromStr with the name of the function as called by SQL
    • a new line in return_type with the expected return type of the function, given an incoming type
    • a new line in signature with the signature of the function (number and types of its arguments)
    • a new line in create_aggregate_expr mapping the built-in to the implementation
    • tests to the function.
  • In tests/sql.rs, add a new test where the function is called through SQL against well known data and returns the expected result.