Code for benchmarking the DataFusion query engine over time: https://alamb.github.io/datafusion-benchmarking/
This repo contains scripts used to run benchmarks against the DataFusion query
engine, specifically using the datafusion-cli
binary. The benchmarks are
designed to measure the performance of various queries and operations in
DataFusion over time.
It is hugely inspired by Mike McCandless's https://benchmarks.mikemccandless.com/
The goal is to spot any long-term regressions (or, gains!) in
Lucene'sDataFusion's performance that might otherwise accidentally slip past the committers, hopefully avoiding the fate of the boiling frog
This is purposely not in the actual DataFusion repo so that it can be run
against any DataFusion version, including the latest master branch and
historical releases.
Because building and benchmarking DataFusion takes a long time, the scripts separate the build, execution, and reporting phase, saving all intermediate results as files which allows easily re-running benchmarks or analyzing results without needing to rebuild
The directory structure is as follows:
data
: benchmark input data (symlink todatafusion/data
)queries
: query scripts (new copy)results
: output directory for benchmark resultsbuilds
: directory binaries, named with the formatdatafusion-cli@<revision>@<revision_timestamp>
scripts
: scripts for manually running benchmarks (not part of this report)
(DONE) 1. Add events.json that annotates major PRs in the commit history 2. Add both raw times and normalized times 3. Automate parallel builds of multiple DataFusion versions 4. Rerun benchmarks on a dedicated machine (ec2 metal?) 3. Rerun the benchmarks on a regular basis (cron job?)
- Add clickbench extended queries
- Add tpch queries
- Add h2o.ai benchmarks
- sqlplanner benchmarks
git clone git@github.com:apache/datafusion.git
Build datafusion-cli for the desired version(s) of DataFusion using the build_datafusion.sh
script, which will:
- Checkout the specified version of DataFusion.
- Build the datafusion-cli using the
cargo build --release
command. - Copy the built binary to the
builds
directory with a specific naming convention.
Example usage:
git clone git@github.com:apache/datafusion.git
./build_datafusion.sh 47.0.0
Example building datafusion-cli for version 48
rust DATAFUSION_DIR=/home/alamb/arrow-datafusion2 ./build_datafusion_cli.sh 48.0.0
The ./benchmark.py
script can be used to run benchmarks for datafusion-cli
binaries in builds
Results are left in the results
directory, with each benchmark's results stored in a separate CSV file.
You can then produce reports using the provided report generator script
# Run analysis on the results (outputs to docs/ directory)
./report.py
# Or specify a custom results directory
./report.py --results-dir results
TODO: create a cron job or similar to automate the daily builds and tests. Ideally it will automatically build datafusion-cli for all commits in the last day and run the benchmarks, storing the results for later analysis.
builds.sh (remaining builds)
Then we'll generate the benchmark results
Then we'll do some charting and analaysis along with starting to automate the daily builds
The commands used to build the datafusion-cli command have changed over time
To build older versions, use the build_datafusion_cli_old
script:
DATAFUSION_DIR=/home/alamb/arrow-datafusion2 ./build_datafusion_cli_old.sh 45.0.0 > 45.0.0.log 2>&1 &
DATAFUSION_DIR=/home/alamb/arrow-datafusion3 ./build_datafusion_cli_old.sh 44.0.0 > 44.0.0.log 2>&1 &
DATAFUSION_DIR=/home/alamb/arrow-datafusion4 ./build_datafusion_cli_old.sh 43.0.0 > 43.0.0.log 2>&1 &
Here are some possible useful commands to run manually.
Find one git commit per day
- Dump git log to a csv file:
cd datafusion
echo "revision,time,url" > ../commits.csv
git log --pretty=format:"%h,%ci,https://github.com/apache/datafusion/commit/%h" >> ../commits.csv
cd ..
Now use sql to find the first commit of each day:
SELECT revision, day, time
FROM (
SELECT revision, day, time, first_value(revision) OVER (PARTITION BY day ORDER BY time DESC) as first_rev, url
FROM (select *, date_bin('1 day', time) as day from 'commits.csv')
)
WHERE first_rev = revision
ORDER by time DESC;"
Here is how to use datafusion-cli to generate the commands to build datafusion-cli for each commit in the last day:
datafusion-cli --format csv -c "SELECT './build_datafusion_cli.sh ' || revision FROM (select revision, day, time, first_value(revision) OVER (PARTITION BY day ORDER BY time DESC) as first_rev, url FROM (select *, date_bin('1 day', time) as day from 'commits.csv')) WHERE first_rev = revision ORDER by time DESC;"
Here is how to use datafusion-cli to generate the commands to build datafusion-cli for all commits
datafusion-cli --format=csv -c "SELECT 'DATAFUSION_DIR=/home/alamb/arrow-datafusion ./build_datafusion_cli.sh ' || revision from 'commits.csv' order by time desc"
Run the ClickBench queries with datafusion and output the results to a CSV file
./run_clickbench.py --output-dir /path/to/output/dir