Repository for reproducible benchmarking of database-like operations in single-node environment.
For benchmark report see h2oai.github.io/db-benchmark.
Benchmark is mainly focused on portability and reproducibility. This benchmark is meant to compare scalability both in data volume and data complexity.
- groupby
- join
- sort
- read
- dask
- data.table
- dplyr
- DataFrames.jl
- pandas
- (py)datatable
- spark
- modin (for status see #38)
- cudf (for status see #44)
- clickhouse (for status see #73)
- edit
path.env
and setjulia
andjava
paths - if solution uses python create new
virtualenv
as$solution/py-$solution
, example forpandas
usevirtualenv pandas/py-pandas --python=/usr/bin/python3.6
- install every solution (if needed activate each
virtualenv
) - edit
run.conf
to define solutions and tasks to benchmark - generate data, for
groupby
useRscript groupby-datagen.R 1e7 1e2 0 0
to createG1_1e7_1e2_0_0.csv
, re-save to binary data where needed, createdata
directory and keep all data files there - edit
data.csv
to define data sizes to benchmark usingactive
flag - start benchmark with
./run.sh
- setting up r3-8xlarge: 244GB RAM, 32 cores: Amazon EC2 for beginners
- (outdated) full reproduce script on clean Ubuntu 16.04: repro.sh
- Timings for some solutions might be missing for particular datasizes or questions. Some functions are not yet implemented in all solutions so we were unable to answer all questions in all solutions. Some solutions might also run out of memory when running benchmark script which results the process to be killed by OS. Lastly we also added timeout for single benchmark script to run, once timeout value is reached script is terminated.