-
Notifications
You must be signed in to change notification settings - Fork 1.6k
Description
Is your feature request related to a problem or challenge?
The ClickBench benchmark has excellent coverage for aggregate / grouping
We have used the clickbench benchmark, run via bench.sh
, for important work improving aggregates such as #6988 and #7064. However there are some important optimizations like #8849 and #7191 from @avantgardnerio where the clickbench benchmark does not cover the existing usecase
For example, @jayzhan211 's change in #8849 (comment) makes certain realistic queries
Details on `bench.sh`
$ ./benchmarks/bench.sh --help
Orchestrates running benchmarks against DataFusion checkouts
Usage:
./benchmarks/bench.sh data [benchmark]
./benchmarks/bench.sh run [benchmark]
./benchmarks/bench.sh compare <branch1> <branch2>
**********
Examples:
**********
# Create the datasets for all benchmarks in /Users/andrewlamb/Software/arrow-datafusion/benchmarks/data
./bench.sh data
# Run the 'tpch' benchmark on the datafusion checkout in /source/arrow-datafusion
DATAFASION_DIR=/source/arrow-datafusion ./bench.sh run tpch
**********
* Commands
**********
data: Generates data needed for benchmarking
run: Runs the named benchmark
compare: Comares results from benchmark runs
**********
* Benchmarks
**********
all(default): Data/Run/Compare for all benchmarks
tpch: TPCH inspired benchmark on Scale Factor (SF) 1 (~1GB), single parquet file per table
tpch_mem: TPCH inspired benchmark on Scale Factor (SF) 1 (~1GB), query from memory
tpch10: TPCH inspired benchmark on Scale Factor (SF) 10 (~10GB), single parquet file per table
tpch10_mem: TPCH inspired benchmark on Scale Factor (SF) 10 (~10GB), query from memory
parquet: Benchmark of parquet reader's filtering speed
sort: Benchmark of sorting speed
clickbench_1: ClickBench queries against a single parquet file
clickbench_partitioned: ClickBench queries against a partitioned (100 files) parquet
**********
* Supported Configuration (Environment Variables)
**********
DATA_DIR directory to store datasets
CARGO_COMMAND command that runs the benchmark binary
DATAFASION_DIR directory to use (default /Users/andrewlamb/Software/arrow-datafusion/benchmarks/..)
Describe the solution you'd like
I would like to add a new benchmark to bench.sh
that uses the same dataset but has different queries than the existing
$ ./benchmarks/bench.sh run clickbench_extended
The new queries should be
- realistic (can write an English sentence explaining the quantity the compute and how it might be used)
- Reflect some query pattern
Here is an example from #8849 (comment)
Query: Distinct counts
Query Explanation: Data exploration: understand the qualities of the data in hits.parquet
Query Properties: multiple count distinct aggregates on string datatypes
❯ SELECT
COUNT(DISTINCT "SearchPhrase"),
COUNT(DISTINCT "MobilePhone"),
COUNT(DISTINCT "MobilePhoneModel")
FROM 'hits.parquet';
Describe alternatives you've considered
No response
Additional context
No response