Name	Name	Last commit message	Last commit date
Latest commit History 563 Commits
clickhouse	clickhouse
dask	dask
datatable	datatable
dplyr	dplyr
juliadf	juliadf
modin	modin
pandas	pandas
pydatatable	pydatatable
rmarkdown_child	rmarkdown_child
spark	spark
.gitignore	.gitignore
LICENSE	LICENSE
README.md	README.md
answers-validation.R	answers-validation.R
benchplot.R	benchplot.R
ch.sh	ch.sh
clickhouse-exec.sh	clickhouse-exec.sh
data.csv	data.csv
ga.html	ga.html
groupby-datagen.R	groupby-datagen.R
groupby.Rmd	groupby.Rmd
helpers.R	helpers.R
helpers.jl	helpers.jl
helpers.py	helpers.py
history.Rmd	history.Rmd
index.Rmd	index.Rmd
join-datagen.R	join-datagen.R
launcher.R	launcher.R
nodenames.csv	nodenames.csv
path.env	path.env
publish.sh	publish.sh
questions.csv	questions.csv
report-code.R	report-code.R
report.R	report.R
repro.sh	repro.sh
run.conf	run.conf
run.sh	run.sh
tech.Rmd	tech.Rmd
timeout.csv	timeout.csv
versions.sh	versions.sh

Repository for reproducible benchmarking of database-like operations in single-node environment.
For benchmark report see h2oai.github.io/db-benchmark.
Benchmark is mainly focused on portability and reproducibility. This benchmark is meant to compare scalability both in data volume and data complexity.

Tasks

groupby
join
sort
read

Solutions

Reproduce

edit path.env and set julia and java paths
if solution uses python create new virtualenv as $solution/py-$solution, example for pandas use virtualenv pandas/py-pandas --python=/usr/bin/python3.6
install every solution (if needed activate each virtualenv)
edit run.conf to define solutions and tasks to benchmark
generate data, for groupby use Rscript groupby-datagen.R 1e7 1e2 0 0 to create G1_1e7_1e2_0_0.csv, re-save to binary data where needed, create data directory and keep all data files there
edit data.csv to define data sizes to benchmark using active flag
start benchmark with ./run.sh

Example environment

setting up r3-8xlarge: 244GB RAM, 32 cores: Amazon EC2 for beginners
(outdated) full reproduce script on clean Ubuntu 16.04: repro.sh

Acknowledgment

Timings for some solutions might be missing for particular datasizes or questions. Some functions are not yet implemented in all solutions so we were unable to answer all questions in all solutions. Some solutins might also run out of memory when running benchmark script which results the process to be killed by OS. Lastly we also added timeout for single benchmark script to run, once timeout value is reached script is terminated.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Tasks

Solutions

Reproduce

Example environment

Acknowledgment

About

Releases

Packages

Languages

License

jangorecki/db-benchmark

Folders and files

Latest commit

History

Repository files navigation

Tasks

Solutions

Reproduce

Example environment

Acknowledgment

About

Resources

License

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages