forked from h2oai/db-benchmark
-
Notifications
You must be signed in to change notification settings - Fork 0
Commit
This commit does not belong to any branch on this repository, and may belong to a fork outside of the repository.
- Loading branch information
1 parent
d4f0275
commit 85bf945
Showing
9 changed files
with
180 additions
and
92 deletions.
There are no files selected for viewing
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,77 @@ | ||
--- | ||
title: "Database-like ops benchmark" | ||
output: | ||
html_document: | ||
self_contained: no | ||
includes: | ||
in_header: ga.html | ||
--- | ||
|
||
This page aims to benchmark various database-like tools popular in open-source data science. It runs regularly against very latest versions of these packages and automatically updates. We provide this as a service to both developers of these packages and to users. We hope to add joins and updates with a focus on ordered operations which are hard to achieve in (unordered) SQL. We hope to add more solutions over time although the most interesting solutions seems to be not mature enough. See [README.md](https://github.com/h2oai/db-benchmark/blob/master/README.md) for detailed status. | ||
|
||
We limit the scope to what can be achieved on a single machine. Laptop size memory (8GB) and server size memory (250GB) are in scope. Out-of-memory using local disk such as NVMe is in scope. Multi-node systems such as Spark running in single machine mode is in scope, too. Machines are getting bigger: EC2 X1 has 2TB RAM and 1TB NVMe disk is under $300. If you can perform the task on a single machine, then perhaps you should. To our knowledge, nobody has yet compared this software in this way and published results too. | ||
|
||
We also include the syntax being timed alongside the timing. This way you can immediately see whether you are doing these tasks or not, and if the timing differences matter to you or not. A 10x difference may be irrelevant if that's just 1s vs 0.1s on your data size. The intention is that you click the tab for the size of data you have. | ||
|
||
Because we have been asked many times to do so, the first task and initial motivation for this page, was to update the benchmark designed and run by [Matt Dowle](https://twitter.com/MattDowle) (creator of [data.table](https://github.com/Rdatatable/data.table)) in 2014 [here](https://github.com/Rdatatable/data.table/wiki/Benchmarks-%3A-Grouping). The methodology and reproducible code can be obtained there. Exact code of this report and benchmark script can be found at [h2oai/db-benchmark](https://github.com/h2oai/db-benchmark) created by [Jan Gorecki](https://github.com/jangorecki) funded by [H2O.ai](https://www.h2o.ai). In case of questions/feedback, feel free to file an issue there. | ||
|
||
```{r opts, echo=FALSE} | ||
knitr::opts_chunk$set(echo=FALSE, cache=FALSE) | ||
``` | ||
|
||
```{r render} | ||
report_name = "index" | ||
# Rscript -e 'rmarkdown::render("index.Rmd", output_dir="public")' # has to be output_dir='public' as there is hardcode in benchplot for that path | ||
``` | ||
|
||
```{r init, child="rmarkdown_child/init.Rmd"} | ||
``` | ||
|
||
## Groupby {.tabset .tabset-fade .tabset-pills} | ||
|
||
Plot below presents just single input data and _basic_ set of questions. Complete results of _groupby_ task benchmark can be found in [h2oai.github.io/db-benchmark/groupby.html](./groupby.html) report. | ||
|
||
```{r filter_task_groupby} | ||
dt_task = lld[task=="groupby" & question_group=="basic"] | ||
``` | ||
|
||
### 0.5 GB | ||
|
||
```{r o_groupby1_plot} | ||
fn = "1e7_1e2_0_0" | ||
fnam = paste0("groupby.",fn,".png") | ||
unlink(file.path("public",report_name,"plots", fnam)) | ||
benchplot(1e7, data=paste0("G1_",fn), timings=dt_task, code=groupby.code, colors=solution.colors, fnam=fnam, path=file.path("public",report_name,"plots")) | ||
``` | ||
![](public/index/plots/groupby.1e7_1e2_0_0.png) | ||
|
||
### 5 GB | ||
|
||
```{r o_groupby2_plot} | ||
fn = "1e8_1e2_0_0" | ||
fnam = paste0("groupby.",fn,".png") | ||
unlink(file.path("public",report_name,"plots", fnam)) | ||
benchplot(1e8, data=paste0("G1_",fn), timings=dt_task, code=groupby.code, colors=solution.colors, fnam=fnam, path=file.path("public",report_name,"plots")) | ||
``` | ||
![](public/index/plots/groupby.1e8_1e2_0_0.png) | ||
|
||
### 50 GB {.active} | ||
|
||
```{r o_groupby3_plot} | ||
fn = "1e9_1e2_0_0" | ||
fnam = paste0("groupby.",fn,".png") | ||
unlink(file.path("public",report_name,"plots", fnam)) | ||
benchplot(1e9, data=paste0("G1_",fn), timings=dt_task, code=groupby.code, colors=solution.colors, fnam=fnam, path=file.path("public",report_name,"plots")) | ||
``` | ||
![](public/index/plots/groupby.1e9_1e2_0_0.png) | ||
|
||
```{r environment, child="rmarkdown_child/environment.Rmd"} | ||
``` | ||
|
||
------ | ||
|
||
```{r timetaken, child="rmarkdown_child/timetaken.Rmd"} | ||
``` | ||
|
||
```{r status, child="rmarkdown_child/status.Rmd"} | ||
``` |
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,11 @@ | ||
|
||
## Environment configuration | ||
|
||
Listed solutions where run using following versions of languages: | ||
- R 3.5.1 | ||
- python 3.6 | ||
- Julia 1.0.2 | ||
|
||
```{r environment_hardware} | ||
as.data.table(na.omit(fread("../nodenames.csv")[lld_nodename, on="nodename", t(.SD)]), keep.rownames=TRUE)[rn!="nodename", .(Component=rn, Value=V1)][, kk(.SD)] | ||
``` |
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,19 @@ | ||
|
||
```{r init_source_data} | ||
source("../report.R", chdir=TRUE) | ||
source("../helpers.R", chdir=TRUE) | ||
source("../report-code.R", chdir=TRUE) | ||
source("../benchplot.R", chdir=TRUE) | ||
ld = time_logs(path="..") | ||
lld = ld[script_recent==TRUE] | ||
``` | ||
|
||
```{r init_validation} | ||
lld_nodename = as.character(unique(lld$nodename)) | ||
if (length(lld_nodename)>1L) | ||
stop(sprintf("There are multiple different 'nodename' to be presented on single report '%s'", report_name)) | ||
lld_unfinished = lld[is.na(script_time_sec)] | ||
if (nrow(lld_unfinished)) { | ||
warning(sprintf("Missing solution finish timestamp in logs.csv for '%s' (still running or launcher script killed): %s", paste(unique(lld_unfinished$task), collapse=","), paste(unique(lld_unfinished$solution), collapse=", "))) | ||
} | ||
``` |
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,6 @@ | ||
|
||
Report was generated on: `r format(Sys.time(), usetz=TRUE)`. | ||
|
||
```{r status_set_success} | ||
cat(paste0(report_name,"\n"), file=get_report_status_file(), append=TRUE) | ||
``` |
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,15 @@ | ||
|
||
```{r timetaken_text_items} | ||
lld_script_time = lld[, .(n_script_time_sec=uniqueN(script_time_sec), script_time_sec=unique(script_time_sec)), .(solution, task, data)] | ||
if (nrow(lld_script_time[n_script_time_sec>1L])) | ||
stop(sprintf("There are multiple different 'script_time_sec' for single solution+task+data on report '%s'", report_name)) | ||
if (report_name=="index") { | ||
what_bench = "Benchmark" | ||
hours_took = lld_script_time[, round(sum(script_time_sec, na.rm=TRUE)/60/60, 1)] | ||
} else { | ||
what_bench = paste(tools::toTitleCase(report_name), "benchmark") | ||
hours_took = lld_script_time[task==report_name, round(sum(script_time_sec, na.rm=TRUE)/60/60, 1)] | ||
} | ||
``` | ||
|
||
`r what_bench` run took around `r hours_took` hours. |
Oops, something went wrong.