Skip to content

Commit

Permalink
reorg of reports, re-use chunks
Browse files Browse the repository at this point in the history
  • Loading branch information
jangorecki committed Jan 12, 2019
1 parent d4f0275 commit 85bf945
Show file tree
Hide file tree
Showing 9 changed files with 180 additions and 92 deletions.
2 changes: 1 addition & 1 deletion benchplot.R
Original file line number Diff line number Diff line change
Expand Up @@ -57,7 +57,7 @@ if (!interactive()) browser = function(...) stop("some new exception in timings
# .interactive default interactive(), when TRUE it will print some output to console and open png after finished
# by.nsolutions default FALSE, when TRUE it will generate png filename as 'groupby.[nsolutions].[in_rows].png' so scaling of benchplot can be easily compared for various number of solutions
# fnam fixed filename if do not want to generate from pattern
benchplot = function(.nrow=Inf, task="groupby", data, timings, code, colors, cutoff="spark", cutoff.after=0.2, .interactive=interactive(), by.nsolutions=FALSE, fnam=NULL, path="public/plots") {
benchplot = function(.nrow=Inf, task="groupby", data, timings, code, colors, cutoff="spark", cutoff.after=0.2, .interactive=interactive(), by.nsolutions=FALSE, fnam=NULL, path="public/dev/plots") {
stopifnot(c("task","time_sec_1","time_sec_2","question","question_group","solution","in_rows","out_rows","out_cols","version","git","batch") %in% names(timings))
stopifnot(is.character(task), length(task)==1L, !is.na(task))
if (!is.data.table(colors)) stop("argument colors must be data.table of solutions and colors assigned to each")
Expand Down
110 changes: 35 additions & 75 deletions groupby.Rmd
Original file line number Diff line number Diff line change
@@ -1,36 +1,34 @@
---
title: "Single-node data aggregation benchmark"
title: "Aggregation benchmark"
output:
html_document:
self_contained: no
includes:
in_header: ga.html
---

This page aims to benchmark various database-like tools popular in open-source data science. It runs regularly against very latest versions of these packages and automatically updates. We provide this as a service to both developers of these packages and to users. We hope to add joins and updates with a focus on ordered operations which are hard to achieve in (unordered) SQL. We hope to add more solutions over time although the most interesting solutions seems to be not mature enough. See [README.md](https://github.com/h2oai/db-benchmark/blob/master/README.md) for detailed status.
This page presents results of [h2oai.github.io/db-benchmark](./index.html) _groupby_ task benchmark for various datasizes and various data characteristis (cardinality, percentage of missing values, pre-sorted input). There are 10 different questions run for each input data, questions are categorized into two groups. _Basic_ questions refers to set of 5 questions designed by [Matt Dowle](https://twitter.com/MattDowle) (creator of [data.table](https://github.com/Rdatatable/data.table)) in 2014 [here](https://github.com/Rdatatable/data.table/wiki/Benchmarks-%3A-Grouping). _Advanced_ questions are 5 new questions meant to cover more complex queries, which are also less obvious to optimize.

We limit the scope to what can be achieved on a single machine. Laptop size memory (8GB) and server size memory (250GB) are in scope. Out-of-memory using local disk such as NVMe is in scope. Multi-node systems such as Spark running in single machine mode is in scope, too. Machines are getting bigger: EC2 X1 has 2TB RAM and 1TB NVMe disk is under $300. If you can perform the task on a single machine, then perhaps you should. To our knowledge, nobody has yet compared this software in this way and published results too.
```{r opts, echo=FALSE}
knitr::opts_chunk$set(echo=FALSE, cache=FALSE)
```

We also include the syntax being timed alongside the timing. This way you can immediately see whether you are doing these tasks or not, and if the timing differences matter to you or not. A 10x difference may be irrelevant if that's just 1s vs 0.1s on your data size. The intention is that you click the tab for the size of data you have.
```{r render}
report_name = "groupby"
# Rscript -e 'rmarkdown::render("groupby.Rmd", output_dir="public")' # has to be output_dir='public' as there is hardcode in benchplot for that path
```

Because we have been asked many times to do so, the first task and initial motivation for this page, was to update the benchmark designed and run by [Matt Dowle](https://twitter.com/MattDowle) (creator of [data.table](https://github.com/Rdatatable/data.table)) in 2014 [here](https://github.com/Rdatatable/data.table/wiki/Benchmarks-%3A-Grouping). The methodology and reproducible code can be obtained there. Exact code of this report and benchmark script can be found at [h2oai/db-benchmark](https://github.com/h2oai/db-benchmark) created by [Jan Gorecki](https://github.com/jangorecki) funded by [H2O.ai](https://www.h2o.ai). In case of questions/feedback, feel free to file an issue there.
```{r init, child="rmarkdown_child/init.Rmd"}
```

```{r init, echo=FALSE}
# rm -rf public && Rscript -e 'rmarkdown::render("index.Rmd", output_dir="public")' # has to be output_dir='public' as there is hardcode in benchplot for that path
knitr::opts_chunk$set(echo=FALSE, cache=FALSE)
source("report.R")
report_status_file = get_report_status_file()
ld = time_logs()
source("helpers.R")
source("report-code.R")
source("benchplot.R") # also creates 'code' for groupby
```{r links_plots}
link = function(x) sprintf("[%s](%s/%s.png)", x, "plots", gsub("G1_", "groupby.", x, fixed=TRUE))
```

## Groupby {.tabset .tabset-fade .tabset-pills}

```{r filter_task}
dt = ld[task=="groupby" & script_recent==TRUE & question_group=="basic"]
```{r filter_task_groupby}
dt_task = lld[task=="groupby" & question_group=="basic"]
by_data = function(dt, .in_rows, .task) {
dt = dt[in_rows==as.character(.in_rows)]
if (!nrow(dt)) return(invisible(NULL))
Expand All @@ -47,102 +45,64 @@ by_data = function(dt, .in_rows, .task) {

Below timings are presented for a single dataset case having random order, no NAs (missing values) and particular cardinality factor (group size question 1 `k=100`). To see timings for other cases scroll down to full timings table. If a solution is missing on particular data size timings table refer to benchplot for a reason and check its speed on smaller data size tab.

### 0.5GB
### 0.5 GB

```{r o_groupby1_plot}
for (fn in c("1e7_1e2_0_0","1e7_1e1_0_0","1e7_2e0_0_0","1e7_1e2_0_1")) {
fnam = paste0("groupby.",fn,".png")
unlink(fnam)
benchplot(1e7, data=paste0("G1_",fn), timings=dt, code=groupby.code, colors=solution.colors, fnam=fnam, cutoff="spark")
unlink(file.path("public",report_name,"plots", fnam))
benchplot(1e7, data=paste0("G1_",fn), timings=dt_task, code=groupby.code, colors=solution.colors, fnam=fnam, path=file.path("public",report_name,"plots"))
}
```
![](public/plots/groupby.1e7_1e2_0_0.png)
![](public/groupby/plots/groupby.1e7_1e2_0_0.png)
 
Plots of all cases can be found at `r dt[in_rows=="1e7", paste(link(unique(data)), collapse=", ")]`. Below first run timings.
Plots of all cases can be found at `r dt_task[in_rows=="1e7", paste(link(unique(data)), collapse=", ")]`. Below first run timings.

```{r o_groupby1_table}
by_data(dt, "1e7", "groupby")
by_data(dt_task, "1e7", "groupby")
```

### 5GB
### 5 GB

```{r o_groupby2_plot}
for (fn in c("1e8_1e2_0_0","1e8_1e1_0_0","1e8_2e0_0_0","1e8_1e2_0_1")) {
fnam = paste0("groupby.",fn,".png")
unlink(fnam)
benchplot(1e8, data=paste0("G1_",fn), timings=dt, code=groupby.code, colors=solution.colors, fnam=fnam, cutoff="spark")
unlink(file.path("public",report_name,"plots", fnam))
benchplot(1e8, data=paste0("G1_",fn), timings=dt_task, code=groupby.code, colors=solution.colors, fnam=fnam, path=file.path("public",report_name,"plots"))
}
```
![](public/plots/groupby.1e8_1e2_0_0.png)
![](public/groupby/plots/groupby.1e8_1e2_0_0.png)
 
Plots of all cases can be found at `r dt[in_rows=="1e8", paste(link(unique(data)), collapse=", ")]`. Below first run timings.
Plots of all cases can be found at `r dt_task[in_rows=="1e8", paste(link(unique(data)), collapse=", ")]`. Below first run timings.

```{r o_groupby2_table}
by_data(dt, "1e8", "groupby")
by_data(dt_task, "1e8", "groupby")
```

### 50GB {.active}
### 50 GB {.active}

```{r o_groupby3_plot}
for (fn in c("1e9_1e2_0_0","1e9_1e1_0_0","1e9_2e0_0_0","1e9_1e2_0_1")) {
fnam = paste0("groupby.",fn,".png")
unlink(fnam)
benchplot(1e9, data=paste0("G1_",fn), timings=dt, code=groupby.code, colors=solution.colors, fnam=fnam, cutoff="spark")
unlink(file.path("public",report_name,"plots", fnam))
benchplot(1e9, data=paste0("G1_",fn), timings=dt_task, code=groupby.code, colors=solution.colors, fnam=fnam, path=file.path("public",report_name,"plots"))
}
```
![](public/plots/groupby.1e9_1e2_0_0.png)
![](public/groupby/plots/groupby.1e9_1e2_0_0.png)
 
Plots of all cases can be found at `r dt[in_rows=="1e9", paste(link(unique(data)), collapse=", ")]`. Below first run timings.
Plots of all cases can be found at `r dt_task[in_rows=="1e9", paste(link(unique(data)), collapse=", ")]`. Below first run timings.

```{r o_groupby3_table}
by_data(dt, "1e9", "groupby")
by_data(dt_task, "1e9", "groupby")
```

## Environment configuration

Listed solutions where run using following versions of languages:
- R 3.5.1
- python 3.6
- Julia 1.0.2

```{r logs}
recent_l = dt[script_recent==TRUE, .(unq_nodename=uniqueN(nodename), nodename=nodename[1L], unq_script_time_sec=uniqueN(script_time_sec), script_time_sec=script_time_sec[1L]), .(solution, task, data)]
if (nrow(recent_l[unq_script_time_sec>1]))
stop("There are multiple different 'script_time_sec' for solution+task+data run")
if (nrow(recent_l[unq_nodename>1]))
stop("There are multiple different 'nodename' for same solution+task+data run")
```

```{r hardware}
as.data.table(na.omit(fread("nodenames.csv")[as.character(unique(recent_l$nodename)), on="nodename", t(.SD)]), keep.rownames=TRUE)[rn!="nodename", .(Component=rn, Value=V1)][, kk(.SD)]
#kB_to_GB = function(x) {
# nx = nchar(x)
# if (!identical(substring(x, nx-1, nx), "kB")) stop("unexpected units of memory returned from 'grep ^MemTotal /proc/meminfo', expects 'kB'")
# sprintf("%.2f GB", as.numeric(trimws(gsub("kB", "", x)))/1024^2)
#}
#fread(
# cmd="lscpu | grep '^Model name:\\|^CPU(s):' && grep ^MemTotal /proc/meminfo",
# sep=":", header=FALSE
#)[V1=="MemTotal", `:=`(V1="Memory", V2=kB_to_GB(V2))
# ][, .(Component=V1, Value=V2)
# ][, kk(.SD)]
```{r environment, child="rmarkdown_child/environment.Rmd"}
```

------

```{r total_task_time}
unfinished = recent_l[is.na(script_time_sec)]
if (nrow(unfinished)) {
warning(sprintf("Missing solution finish timestamp in logs.csv for '%s' (still running or killed): %s", "groupby", paste(unique(unfinished$solution), collapse=", ")))
hours_took = "at least "
} else hours_took = ""
hours_took = paste0(hours_took, recent_l[, round(sum(script_time_sec)/60/60, 1)])
```{r timetaken, child="rmarkdown_child/timetaken.Rmd"}
```

Benchmark run took around `r hours_took` hours.

```{r set_success_state}
cat("groupby\n", file=report_status_file, append=TRUE)
```{r status, child="rmarkdown_child/status.Rmd"}
```

Report was generated on: `r format(Sys.time(), usetz=TRUE)`.
77 changes: 77 additions & 0 deletions index.Rmd
Original file line number Diff line number Diff line change
@@ -0,0 +1,77 @@
---
title: "Database-like ops benchmark"
output:
html_document:
self_contained: no
includes:
in_header: ga.html
---

This page aims to benchmark various database-like tools popular in open-source data science. It runs regularly against very latest versions of these packages and automatically updates. We provide this as a service to both developers of these packages and to users. We hope to add joins and updates with a focus on ordered operations which are hard to achieve in (unordered) SQL. We hope to add more solutions over time although the most interesting solutions seems to be not mature enough. See [README.md](https://github.com/h2oai/db-benchmark/blob/master/README.md) for detailed status.

We limit the scope to what can be achieved on a single machine. Laptop size memory (8GB) and server size memory (250GB) are in scope. Out-of-memory using local disk such as NVMe is in scope. Multi-node systems such as Spark running in single machine mode is in scope, too. Machines are getting bigger: EC2 X1 has 2TB RAM and 1TB NVMe disk is under $300. If you can perform the task on a single machine, then perhaps you should. To our knowledge, nobody has yet compared this software in this way and published results too.

We also include the syntax being timed alongside the timing. This way you can immediately see whether you are doing these tasks or not, and if the timing differences matter to you or not. A 10x difference may be irrelevant if that's just 1s vs 0.1s on your data size. The intention is that you click the tab for the size of data you have.

Because we have been asked many times to do so, the first task and initial motivation for this page, was to update the benchmark designed and run by [Matt Dowle](https://twitter.com/MattDowle) (creator of [data.table](https://github.com/Rdatatable/data.table)) in 2014 [here](https://github.com/Rdatatable/data.table/wiki/Benchmarks-%3A-Grouping). The methodology and reproducible code can be obtained there. Exact code of this report and benchmark script can be found at [h2oai/db-benchmark](https://github.com/h2oai/db-benchmark) created by [Jan Gorecki](https://github.com/jangorecki) funded by [H2O.ai](https://www.h2o.ai). In case of questions/feedback, feel free to file an issue there.

```{r opts, echo=FALSE}
knitr::opts_chunk$set(echo=FALSE, cache=FALSE)
```

```{r render}
report_name = "index"
# Rscript -e 'rmarkdown::render("index.Rmd", output_dir="public")' # has to be output_dir='public' as there is hardcode in benchplot for that path
```

```{r init, child="rmarkdown_child/init.Rmd"}
```

## Groupby {.tabset .tabset-fade .tabset-pills}

Plot below presents just single input data and _basic_ set of questions. Complete results of _groupby_ task benchmark can be found in [h2oai.github.io/db-benchmark/groupby.html](./groupby.html) report.

```{r filter_task_groupby}
dt_task = lld[task=="groupby" & question_group=="basic"]
```

### 0.5 GB

```{r o_groupby1_plot}
fn = "1e7_1e2_0_0"
fnam = paste0("groupby.",fn,".png")
unlink(file.path("public",report_name,"plots", fnam))
benchplot(1e7, data=paste0("G1_",fn), timings=dt_task, code=groupby.code, colors=solution.colors, fnam=fnam, path=file.path("public",report_name,"plots"))
```
![](public/index/plots/groupby.1e7_1e2_0_0.png)

### 5 GB

```{r o_groupby2_plot}
fn = "1e8_1e2_0_0"
fnam = paste0("groupby.",fn,".png")
unlink(file.path("public",report_name,"plots", fnam))
benchplot(1e8, data=paste0("G1_",fn), timings=dt_task, code=groupby.code, colors=solution.colors, fnam=fnam, path=file.path("public",report_name,"plots"))
```
![](public/index/plots/groupby.1e8_1e2_0_0.png)

### 50 GB {.active}

```{r o_groupby3_plot}
fn = "1e9_1e2_0_0"
fnam = paste0("groupby.",fn,".png")
unlink(file.path("public",report_name,"plots", fnam))
benchplot(1e9, data=paste0("G1_",fn), timings=dt_task, code=groupby.code, colors=solution.colors, fnam=fnam, path=file.path("public",report_name,"plots"))
```
![](public/index/plots/groupby.1e9_1e2_0_0.png)

```{r environment, child="rmarkdown_child/environment.Rmd"}
```

------

```{r timetaken, child="rmarkdown_child/timetaken.Rmd"}
```

```{r status, child="rmarkdown_child/status.Rmd"}
```
20 changes: 10 additions & 10 deletions report.R
Original file line number Diff line number Diff line change
Expand Up @@ -10,23 +10,23 @@ get_report_solutions = function() {

# load ----

load_time = function() {
fread("time.csv")[
load_time = function(path=getwd()) {
fread(file.path(path, "time.csv"))[
!is.na(batch) &
in_rows %in% c(1e7, 1e8, 1e9) &
solution %in% get_report_solutions()
][order(timestamp)]
}
load_logs = function() {
fread("logs.csv")[
load_logs = function(path=getwd()) {
fread(file.path(path, "logs.csv"))[
!is.na(batch) &
nzchar(solution) &
solution %in% get_report_solutions() &
action %in% c("start","finish")
][order(timestamp)]
}
load_questions = function() {
fread("questions.csv")
load_questions = function(path=getwd()) {
fread(file.path(path, "questions.csv"))
}

# clean ----
Expand Down Expand Up @@ -147,10 +147,10 @@ transform = function(ld) {

# all ----

time_logs = function() {
d = model_time(clean_time(load_time()))
l = model_logs(clean_logs(load_logs()))
q = model_questions(clean_questions(load_questions()))
time_logs = function(path=getwd()) {
d = model_time(clean_time(load_time(path=path)))
l = model_logs(clean_logs(load_logs(path=path)))
q = model_questions(clean_questions(load_questions(path=path)))

lq = merge_logs_questions(l, q)
ld = merge_time_logsquestions(d, lq)
Expand Down
11 changes: 11 additions & 0 deletions rmarkdown_child/environment.Rmd
Original file line number Diff line number Diff line change
@@ -0,0 +1,11 @@

## Environment configuration

Listed solutions where run using following versions of languages:
- R 3.5.1
- python 3.6
- Julia 1.0.2

```{r environment_hardware}
as.data.table(na.omit(fread("../nodenames.csv")[lld_nodename, on="nodename", t(.SD)]), keep.rownames=TRUE)[rn!="nodename", .(Component=rn, Value=V1)][, kk(.SD)]
```
19 changes: 19 additions & 0 deletions rmarkdown_child/init.Rmd
Original file line number Diff line number Diff line change
@@ -0,0 +1,19 @@

```{r init_source_data}
source("../report.R", chdir=TRUE)
source("../helpers.R", chdir=TRUE)
source("../report-code.R", chdir=TRUE)
source("../benchplot.R", chdir=TRUE)
ld = time_logs(path="..")
lld = ld[script_recent==TRUE]
```

```{r init_validation}
lld_nodename = as.character(unique(lld$nodename))
if (length(lld_nodename)>1L)
stop(sprintf("There are multiple different 'nodename' to be presented on single report '%s'", report_name))
lld_unfinished = lld[is.na(script_time_sec)]
if (nrow(lld_unfinished)) {
warning(sprintf("Missing solution finish timestamp in logs.csv for '%s' (still running or launcher script killed): %s", paste(unique(lld_unfinished$task), collapse=","), paste(unique(lld_unfinished$solution), collapse=", ")))
}
```
6 changes: 6 additions & 0 deletions rmarkdown_child/status.Rmd
Original file line number Diff line number Diff line change
@@ -0,0 +1,6 @@

Report was generated on: `r format(Sys.time(), usetz=TRUE)`.

```{r status_set_success}
cat(paste0(report_name,"\n"), file=get_report_status_file(), append=TRUE)
```
15 changes: 15 additions & 0 deletions rmarkdown_child/timetaken.Rmd
Original file line number Diff line number Diff line change
@@ -0,0 +1,15 @@

```{r timetaken_text_items}
lld_script_time = lld[, .(n_script_time_sec=uniqueN(script_time_sec), script_time_sec=unique(script_time_sec)), .(solution, task, data)]
if (nrow(lld_script_time[n_script_time_sec>1L]))
stop(sprintf("There are multiple different 'script_time_sec' for single solution+task+data on report '%s'", report_name))
if (report_name=="index") {
what_bench = "Benchmark"
hours_took = lld_script_time[, round(sum(script_time_sec, na.rm=TRUE)/60/60, 1)]
} else {
what_bench = paste(tools::toTitleCase(report_name), "benchmark")
hours_took = lld_script_time[task==report_name, round(sum(script_time_sec, na.rm=TRUE)/60/60, 1)]
}
```

`r what_bench` run took around `r hours_took` hours.
Loading

0 comments on commit 85bf945

Please sign in to comment.