Skip to content

jangorecki/db-benchmark

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

Repository for reproducible benchmarking of database-like operations in single-node environment.
For benchmark report see h2oai.github.io/db-benchmark.
Benchmark is mainly focused on portability and reproducibility. This benchmark is meant to compare scalability both in data volume and data complexity.

Tasks

  • groupby
  • join
  • sort
  • read

Solutions

Reproduce

  • edit path.env and set julia and java paths
  • if solution uses python create new virtualenv as $solution/py-$solution, example for pandas use virtualenv pandas/py-pandas --python=/usr/bin/python3.6
  • install every solution (if needed activate each virtualenv)
  • edit run.conf to define solutions and tasks to benchmark
  • generate data, for groupby use Rscript groupby-datagen.R 1e7 1e2 0 0 to create G1_1e7_1e2_0_0.csv, re-save to binary data where needed, create data directory and keep all data files there
  • edit data.csv to define data sizes to benchmark using active flag
  • start benchmark with ./run.sh

Example environment

Acknowledgment

  • Timings for some solutions might be missing for particular datasizes or questions. Some functions are not yet implemented in all solutions so we were unable to answer all questions in all solutions. Some solutins might also run out of memory when running benchmark script which results the process to be killed by OS. Lastly we also added timeout for single benchmark script to run, once timeout value is reached script is terminated.

About

reproducible benchmark of database-like ops

Resources

License

Stars

Watchers

Forks

Packages

No packages published

Languages

  • R 49.1%
  • Python 39.2%
  • Shell 7.7%
  • Julia 3.9%
  • HTML 0.1%