Skip to content

Shell script to collect benchmarks for multiple versions #15144

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
wants to merge 17 commits into
base: main
Choose a base branch
from
Open
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension


Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
3 changes: 2 additions & 1 deletion .github/workflows/dev.yml
Original file line number Diff line number Diff line change
Expand Up @@ -44,8 +44,9 @@ jobs:
#
# ignore subproject CHANGELOG.md because they are machine generated
npx prettier@2.7.1 --write \
'{datafusion,datafusion-cli,datafusion-examples,dev,docs}/**/*.md' \
'{benchmarks,datafusion,datafusion-cli,datafusion-examples,dev,docs}/**/*.md' \
'!datafusion/CHANGELOG.md' \
'!benchmarks/venv/**/*.md' \
README.md \
CONTRIBUTING.md
git diff --exit-code
94 changes: 71 additions & 23 deletions benchmarks/README.md
Original file line number Diff line number Diff line change
Expand Up @@ -23,19 +23,18 @@ This crate contains benchmarks based on popular public data sets and
open source benchmark suites, to help with performance and scalability
testing of DataFusion.


## Other engines

The benchmarks measure changes to DataFusion itself, rather than
its performance against other engines. For competitive benchmarking,
DataFusion is included in the benchmark setups for several popular
benchmarks that compare performance with other engines. For example:

* [ClickBench] scripts are in the [ClickBench repo](https://github.com/ClickHouse/ClickBench/tree/main/datafusion)
* [H2o.ai `db-benchmark`] scripts are in [db-benchmark](https://github.com/apache/datafusion/tree/main/benchmarks/src/h2o.rs)
- [ClickBench][clickbench] scripts are in the [ClickBench repo](https://github.com/ClickHouse/ClickBench/tree/main/datafusion)
- [H2o.ai ][h2o.ai `db-benchmark`] scripts are in [db-benchmark](https://github.com/apache/datafusion/tree/main/benchmarks/src/h2o.rs)

[ClickBench]: https://github.com/ClickHouse/ClickBench/tree/main
[H2o.ai `db-benchmark`]: https://github.com/h2oai/db-benchmark
[clickbench]: https://github.com/ClickHouse/ClickBench/tree/main
[h2o.ai `db-benchmark`]: https://github.com/h2oai/db-benchmark

# Running the benchmarks

Expand Down Expand Up @@ -65,42 +64,54 @@ Create / download a specific dataset (TPCH)
```shell
./bench.sh data tpch
```

Data is placed in the `data` subdirectory.

## Running benchmarks

Run benchmark for TPC-H dataset

```shell
./bench.sh run tpch
```

or for TPC-H dataset scale 10

```shell
./bench.sh run tpch10
```

To run for specific query, for example Q21
To run a specific query, for example Q21

```shell
./bench.sh run tpch10 21
```

## Benchmark with modified configurations
## Benchmarks with modified configurations

### Select join algorithm

The benchmark runs with `prefer_hash_join == true` by default, which enforces HASH join algorithm.
To run TPCH benchmarks with join other than HASH:

```shell
PREFER_HASH_JOIN=false ./bench.sh run tpch
```

### Configure with environment variables
Any [datafusion options](https://datafusion.apache.org/user-guide/configs.html) that are provided environment variables are
also considered by the benchmarks.
The following configuration runs the TPCH benchmark with datafusion configured to *not* repartition join keys.

DataFusion [configuration settings](https://datafusion.apache.org/user-guide/configs.html)
can be provided as environment variables to the benchmarks.

For exmaple, to run the TPCH benchmark with DataFusion configured to _not_ repartition join keys.

```shell
DATAFUSION_OPTIMIZER_REPARTITION_JOINS=false ./bench.sh run tpch
```

You might want to adjust the results location to avoid overwriting previous results.
Environment configuration that was picked up by datafusion is logged at `info` level.
To verify that datafusion picked up your configuration, run the benchmarks with `RUST_LOG=info` or higher.
Environment configuration that was picked up by DataFusion is logged at `info` level.
To verify that DataFusion picked up your configuration, run the benchmarks with `RUST_LOG=info` or higher.

## Comparing performance of main and a branch

Expand Down Expand Up @@ -195,7 +206,7 @@ cargo run --release --bin dfbench -- tpch --iterations 3 --path ./data --format

See the help for more details.

### Different features
### Benchmarking with different DataFusion crate features

You can enable `mimalloc` or `snmalloc` (to use either the mimalloc or snmalloc allocator) as features by passing them in as `--features`. For example:

Expand Down Expand Up @@ -313,7 +324,7 @@ FLAGS:

# Writing a new benchmark

## Creating or downloading data outside of the benchmark
## Creating or downloading data outside the benchmark

If you want to create or download the data with Rust as part of running the benchmark, see the next
section on adding a benchmark subcommand and add code to create or download data as part of its
Expand Down Expand Up @@ -365,7 +376,7 @@ Your benchmark should create and use an instance of `BenchmarkRun` defined in `b
- When all cases are done, call the `BenchmarkRun`'s `maybe_write_json` method, giving it the value
of the `--output` structopt field on `RunOpt`.

# Benchmarks
# Benchmark Descriptions

The output of `dfbench` help includes a description of each benchmark, which is reproduced here for convenience.

Expand Down Expand Up @@ -402,9 +413,9 @@ done dropping runtime in 83.531417ms

## ClickBench

The ClickBench[1] benchmarks are widely cited in the industry and
The ClickBench[1][1] benchmarks are widely cited in the industry and
focus on grouping / aggregation / filtering. This runner uses the
scripts and queries from [2].
scripts and queries from [2][2].

[1]: https://github.com/ClickHouse/ClickBench
[2]: https://github.com/ClickHouse/ClickBench/tree/main/datafusion
Expand All @@ -419,7 +430,7 @@ logs.

Example

dfbench parquet-filter --path ./data --scale-factor 1.0
dfbench parquet-filter --path ./data --scale-factor 1.0

generates the synthetic dataset at `./data/logs.parquet`. The size
of the dataset can be controlled through the `size_factor`
Expand Down Expand Up @@ -451,6 +462,7 @@ Iteration 2 returned 1781686 rows in 1947 ms
```

## Sort

Test performance of sorting large datasets

This test sorts a a synthetic dataset generated during the
Expand All @@ -474,22 +486,27 @@ Additionally, an optional `--limit` flag is available for the sort benchmark. Wh
See [`sort_tpch.rs`](src/sort_tpch.rs) for more details.

### Sort TPCH Benchmark Example Runs

1. Run all queries with default setting:

```bash
cargo run --release --bin dfbench -- sort-tpch -p './datafusion/benchmarks/data/tpch_sf1' -o '/tmp/sort_tpch.json'
```

2. Run a specific query:

```bash
cargo run --release --bin dfbench -- sort-tpch -p './datafusion/benchmarks/data/tpch_sf1' -o '/tmp/sort_tpch.json' --query 2
```

3. Run all queries as TopK queries on presorted data:

```bash
cargo run --release --bin dfbench -- sort-tpch --sorted --limit 10 -p './datafusion/benchmarks/data/tpch_sf1' -o '/tmp/sort_tpch.json'
```

4. Run all queries with `bench.sh` script:

```bash
./bench.sh run sort_tpch
```
Expand Down Expand Up @@ -524,116 +541,147 @@ When the memory limit is exceeded, the aggregation intermediate results will be

External aggregation benchmarks run several aggregation queries with different memory limits, on TPCH `lineitem` table. Queries can be found in [`external_aggr.rs`](src/bin/external_aggr.rs).

This benchmark is inspired by [DuckDB's external aggregation paper](https://hannes.muehleisen.org/publications/icde2024-out-of-core-kuiper-boncz-muehleisen.pdf), specifically Section VI.
This benchmark is inspired by [DuckDB's external aggregation paper](https://hannes.muehleisen.org/publications/icde2024-out-of-core-kuiper-boncz-muehleisen.pdf), specifically Section VI.

### External Aggregation Example Runs

1. Run all queries with predefined memory limits:

```bash
# Under 'benchmarks/' directory
cargo run --release --bin external_aggr -- benchmark -n 4 --iterations 3 -p '....../data/tpch_sf1' -o '/tmp/aggr.json'
```

2. Run a query with specific memory limit:

```bash
cargo run --release --bin external_aggr -- benchmark -n 4 --iterations 3 -p '....../data/tpch_sf1' -o '/tmp/aggr.json' --query 1 --memory-limit 30M
```

3. Run all queries with `bench.sh` script:

```bash
./bench.sh data external_aggr
./bench.sh run external_aggr
```


## h2o benchmarks for groupby

### Generate data for h2o benchmarks

There are three options for generating data for h2o benchmarks: `small`, `medium`, and `big`. The data is generated in the `data` directory.

1. Generate small data (1e7 rows)

```bash
./bench.sh data h2o_small
```


2. Generate medium data (1e8 rows)

```bash
./bench.sh data h2o_medium
```


3. Generate large data (1e9 rows)

```bash
./bench.sh data h2o_big
```

### Run h2o benchmarks

There are three options for running h2o benchmarks: `small`, `medium`, and `big`.

1. Run small data benchmark

```bash
./bench.sh run h2o_small
```

2. Run medium data benchmark

```bash
./bench.sh run h2o_medium
```

3. Run large data benchmark

```bash
./bench.sh run h2o_big
```

4. Run a specific query with a specific data path

For example, to run query 1 with the small data generated above:

```bash
cargo run --release --bin dfbench -- h2o --path ./benchmarks/data/h2o/G1_1e7_1e7_100_0.csv --query 1
```

## h2o benchmarks for join

### Generate data for h2o benchmarks

There are three options for generating data for h2o benchmarks: `small`, `medium`, and `big`. The data is generated in the `data` directory.

1. Generate small data (4 table files, the largest is 1e7 rows)

```bash
./bench.sh data h2o_small_join
```


2. Generate medium data (4 table files, the largest is 1e8 rows)

```bash
./bench.sh data h2o_medium_join
```

3. Generate large data (4 table files, the largest is 1e9 rows)

```bash
./bench.sh data h2o_big_join
```

### Run h2o benchmarks

There are three options for running h2o benchmarks: `small`, `medium`, and `big`.

1. Run small data benchmark

```bash
./bench.sh run h2o_small_join
```

2. Run medium data benchmark

```bash
./bench.sh run h2o_medium_join
```

3. Run large data benchmark

```bash
./bench.sh run h2o_big_join
```

4. Run a specific query with a specific join data paths, the data paths are including 4 table files.

For example, to run query 1 with the small data generated above:

```bash
cargo run --release --bin dfbench -- h2o --join-paths ./benchmarks/data/h2o/J1_1e7_NA_0.csv,./benchmarks/data/h2o/J1_1e7_1e1_0.csv,./benchmarks/data/h2o/J1_1e7_1e4_0.csv,./benchmarks/data/h2o/J1_1e7_1e7_NA.csv --queries-path ./benchmarks/queries/h2o/join.sql --query 1
```

### Collect Benchmarks

Collect benchmarks of current main and 5 previous releases.

```bash
./collect_bench.sh [benchmark_name](optional)
```

Note: `benchmark_name` can be any benchmark defined in bench.sh. Defaults to `all` just like bench.sh.

[1]: http://www.tpc.org/tpch/
[2]: https://www1.nyc.gov/site/tlc/about/tlc-trip-record-data.page
Loading