Skip to content

Commit

Permalink
rework report to single page with more tabsets
Browse files Browse the repository at this point in the history
  • Loading branch information
jangorecki committed Oct 28, 2019
1 parent 3d8a9d2 commit 03a27bb
Show file tree
Hide file tree
Showing 10 changed files with 151 additions and 404 deletions.
155 changes: 0 additions & 155 deletions groupby.Rmd

This file was deleted.

14 changes: 8 additions & 6 deletions history.Rmd
Original file line number Diff line number Diff line change
Expand Up @@ -7,16 +7,14 @@ output:
includes:
in_header: ga.html
---
```{r render, include=FALSE}
# Rscript -e 'rmarkdown::render("history.Rmd", output_dir="public")' # has to be output_dir='public' as there is hardcode in benchplot for that path
```

```{r opts, echo=FALSE}
knitr::opts_chunk$set(echo=FALSE, cache=FALSE)
```

```{r render}
report_name = "history"
# Rscript -e 'rmarkdown::render("history.Rmd", output_dir="public")' # has to be output_dir='public' as there is hardcode in benchplot for that path
```

```{r init}
library(lattice)
source("report.R")
Expand Down Expand Up @@ -49,6 +47,10 @@ p = sapply(setNames(nm=as.character(unique(ld$solution))), simplify = FALSE, fun
sapply(seq_along(p), function(i) print(p[[i]], split=c(1, i, 1, length(p)), more=i!=length(p))) -> nul
```

------

Report was generated on: `r format(Sys.time(), usetz=TRUE)`.

```{r status, child="rmarkdown_child/status.Rmd"}
```{r status_set_success}
cat("history\n", file=get_report_status_file(), append=TRUE)
```
164 changes: 133 additions & 31 deletions index.Rmd
Original file line number Diff line number Diff line change
Expand Up @@ -6,61 +6,161 @@ output:
includes:
in_header: ga.html
---
```{r render, include=FALSE}
# Rscript -e 'rmarkdown::render("index.Rmd", output_dir="public")' # has to be output_dir='public' as there is hardcode in benchplot for that path
```

This page aims to benchmark various database-like tools popular in open-source data science. It runs regularly against very latest versions of these packages and automatically updates. We provide this as a service to both developers of these packages and to users.

We also include the syntax being timed alongside the timing. This way you can immediately see whether you are doing these tasks or not, and if the timing differences matter to you or not. A 10x difference may be irrelevant if that's just 1s vs 0.1s on your data size. The intention is that you click the tab for the size of data you have. Use this page to naviagates to _task_ reports, as of now we have _groupby_ and _join_ tasks.
We also include the syntax being timed alongside the timing. This way you can immediately see whether you are doing these tasks or not, and if the timing differences matter to you or not. A 10x difference may be irrelevant if that's just 1s vs 0.1s on your data size. The intention is that you click the tab for the size of data you have.

```{r opts, echo=FALSE}
knitr::opts_chunk$set(echo=FALSE, cache=FALSE)
```

```{r render}
report_name = "index"
# Rscript -e 'rmarkdown::render("index.Rmd", output_dir="public")' # has to be output_dir='public' as there is hardcode in benchplot for that path
```{r helpers}
loop_benchplot = function(dt_task, report_name, code, exceptions, colors, data_namev, q_groupv) {
path = file.path("public", report_name, "plots")
for (data_name in data_namev) {
in_rows = strsplit(data_name, "_", fixed=TRUE)[[1L]][2L]
for (q_group in q_groupv) {
benchplot(as.numeric(in_rows), task=report_name, data=data_name, timings=dt_task[question_group==q_group], code=code, exceptions=exceptions, colors=colors, fnam=paste(data_name, q_group, "png", sep="."), path=path, .interactive=FALSE)
}
}
}
link = function(data_name, q_group, report_name) {
fnam = sprintf("%s.%s.png", data_name, q_group)
path = file.path(report_name, "plots")
sprintf("[%s](%s)", fnam, file.path(path, fnam))
}
hours_took = function(lld) {
lld_script_time = lld[, .(n_script_time_sec=uniqueN(script_time_sec), script_time_sec=unique(script_time_sec)), .(solution, task, data)]
if (nrow(lld_script_time[n_script_time_sec>1L]))
stop("There are multiple different 'script_time_sec' for single solution+task+data on report 'index'")
lld_script_time[, round(sum(script_time_sec, na.rm=TRUE)/60/60, 1)]
}
```

```{r init}
source("report.R", chdir=TRUE)
source("helpers.R", chdir=TRUE)
source("report-code.R", chdir=TRUE)
source("benchplot.R", chdir=TRUE)
ld = time_logs()
lld = ld[script_recent==TRUE]
lld_nodename = as.character(unique(lld$nodename))
if (length(lld_nodename)>1L)
stop(sprintf("There are multiple different 'nodename' to be presented on single report '%s'", report_name))
lld_unfinished = lld[is.na(script_time_sec)]
if (nrow(lld_unfinished)) {
warning(sprintf("Missing solution finish timestamp in logs.csv for '%s' (still running or launcher script killed): %s", paste(unique(lld_unfinished$task), collapse=","), paste(unique(lld_unfinished$solution), collapse=", ")))
}
```

```{r init, child="rmarkdown_child/init.Rmd"}
```{r report_groupby}
in_rows = c("1e7","1e8","1e9")
k_na_sort = c("1e2_0_0","1e1_0_0","2e0_0_0","1e2_0_1")
data_name = paste("G1", paste(rep(in_rows, each=length(k_na_sort)), k_na_sort, sep="_"), sep="_")
dt_groupby = lld[task=="groupby"][substr(data,1,2)=="G1"]
loop_benchplot(dt_groupby, report_name="groupby", code=groupby.code, exceptions=groupby.exceptions, colors=solution.colors, data_namev=data_name, q_groupv=c("basic","advanced"))
```

```{r report_join}
in_rows = c("1e7","1e8")
k_na_sort = c("NA_0_0")
data_name = paste("J1", paste(rep(in_rows, each=length(k_na_sort)), k_na_sort, sep="_"), sep="_")
dt_join = lld[task=="join"]
loop_benchplot(dt_join, report_name="join", code=join.code, exceptions=join.exceptions, colors=solution.colors, data_namev=data_name, q_groupv=c("basic"))
```

## Task {.tabset .tabset-fade .tabset-pills}

Plot below presents chosen task, single input data size and _basic_ set of questions. Follow the link for detailed reports.
### groupby {.tabset .tabset-fade .tabset-pills}

### groupby {.active}
Below timings are presented for a single dataset case having random order, no NAs (missing values) and particular cardinality factor (group size question 1 `k=100`). To see timings for other cases click on the links below. If a solution is missing on particular data size timings table refer to benchplot for reasons and check its speed on smaller data size tab.

Full _groupby_ report available at [h2oai.github.io/db-benchmark/groupby.html](./groupby.html).
#### 0.5 GB {.tabset .tabset-fade .tabset-pills}

```{r o_groupby_plot}
dt_task = lld[task=="groupby" & question_group=="basic"]
fn = "1e9_1e2_0_0"
fnam = paste0("groupby.",fn,".png")
unlink(file.path("public",report_name,"plots", fnam))
benchplot(1e9, task="groupby", data=paste0("G1_",fn), timings=dt_task, code=groupby.code, exceptions=groupby.exceptions, colors=solution.colors, fnam=fnam, path=file.path("public",report_name,"plots"))
```
![](public/index/plots/groupby.1e9_1e2_0_0.png)
All data cases can be found at `r dt_groupby[in_rows=="1e7", .(q_grp_links=paste(link(unique(data), q_group=question_group, report_name="groupby"), collapse=", ")), by=question_group][, paste(q_grp_links, collapse=", ")]`.

### join
##### basic {.active}

Full _join_ report available at [h2oai.github.io/db-benchmark/join.html](./join.html).
![](public/groupby/plots/G1_1e7_1e2_0_0.basic.png)

```{r o_join_plot}
dt_task = lld[task=="join" & question_group=="basic"]
fn = "1e8_NA_0_0"
fnam = paste0("join.",fn,".png")
unlink(file.path("public",report_name,"plots", fnam))
benchplot(1e8, task="join", data=paste0("J1_",fn), timings=dt_task, code=join.code, exceptions=join.exceptions, colors=solution.colors, fnam=fnam, path=file.path("public",report_name,"plots"))
```
![](public/index/plots/join.1e8_NA_0_0.png)
##### advanced

![](public/groupby/plots/G1_1e7_1e2_0_0.advanced.png)

#### 5 GB {.tabset .tabset-fade .tabset-pills}

All data cases can be found at `r dt_groupby[in_rows=="1e8", .(q_grp_links=paste(link(unique(data), q_group=question_group, report_name="groupby"), collapse=", ")), by=question_group][, paste(q_grp_links, collapse=", ")]`.

##### basic {.active}

![](public/groupby/plots/G1_1e8_1e2_0_0.basic.png)

##### advanced

![](public/groupby/plots/G1_1e8_1e2_0_0.advanced.png)

#### 50 GB {.active .tabset .tabset-fade .tabset-pills}

All data cases can be found at `r dt_groupby[in_rows=="1e9", .(q_grp_links=paste(link(unique(data), q_group=question_group, report_name="groupby"), collapse=", ")), by=question_group][, paste(q_grp_links, collapse=", ")]`.

##### basic {.active}

![](public/groupby/plots/G1_1e9_1e2_0_0.basic.png)

##### advanced

![](public/groupby/plots/G1_1e9_1e2_0_0.advanced.png)

### join {.tabset .tabset-fade .tabset-pills}

Below timings are presented for datasets having random order, no NAs (missing values). Data size on tabs corresponds to the LHS dataset of join, while RHS datasets are of the following sizes: _small_ (LHS/1e6), _medium_ (LHS/1e3), _big_ (LHS).

#### 0.6 GB {.tabset .tabset-fade .tabset-pills}

##### basic {.active}

![](public/join/plots/J1_1e7_NA_0_0.basic.png)

<!--
##### advanced
![](public/join/plots/J1_1e7_NA_0_0.advanced.png)
-->

#### 6 GB {.active .tabset .tabset-fade .tabset-pills}

##### basic {.active}

![](public/join/plots/J1_1e8_NA_0_0.basic.png)

<!--
##### advanced
![](public/join/plots/J1_1e8_NA_0_0.advanced.png)
-->

---

## Notes

- You are welcome to run this benchmark yourself! all scripts related to setting up environment, data and benchmark are in [repository](https://github.com/h2oai/db-benchmark).
- Data used to generate plots on this website can be obtained from [time.csv](./time.csv) (together with [logs.csv](./logs.csv)). See [report.R](https://github.com/h2oai/db-benchmark/blob/master/report.R) for quick introduction how to work with those.
- We ensure that calculations are not deferred by solution.
- We also tested that answers produced from different solutions match each others, for details see [answers-validation.R](https://github.com/h2oai/db-benchmark/blob/master/answers-validation.R).
- ClickHouse queries were made against `mergetree` table engine, see [#91](https://github.com/h2oai/db-benchmark/issues/91) for details.

## Environment configuration

- R 3.6.0
- python 3.6
- Julia 1.0.2

```{r environment, child="rmarkdown_child/environment.Rmd"}
```{r environment_hardware}
as.data.table(na.omit(fread("nodenames.csv")[lld_nodename, on="nodename", t(.SD)]), keep.rownames=TRUE)[rn!="nodename", .(Component=rn, Value=V1)][, kk(.SD)]
```

------
Expand All @@ -69,14 +169,16 @@ benchplot(1e8, task="join", data=paste0("J1_",fn), timings=dt_task, code=join.co

We limit the scope to what can be achieved on a single machine. Laptop size memory (8GB) and server size memory (250GB) are in scope. Out-of-memory using local disk such as NVMe is in scope. Multi-node systems such as Spark running in single machine mode is in scope, too. Machines are getting bigger: EC2 X1 has 2TB RAM and 1TB NVMe disk is under $300. If you can perform the task on a single machine, then perhaps you should. To our knowledge, nobody has yet compared this software in this way and published results too.

## Why this project
## Why db-benchmark?

Because we have been asked many times to do so, the first task and initial motivation for this page, was to update the benchmark designed and run by [Matt Dowle](https://twitter.com/MattDowle) (creator of [data.table](https://github.com/Rdatatable/data.table)) in 2014 [here](https://github.com/Rdatatable/data.table/wiki/Benchmarks-%3A-Grouping). The methodology and reproducible code can be obtained there. Exact code of this report and benchmark script can be found at [h2oai/db-benchmark](https://github.com/h2oai/db-benchmark) created by [Jan Gorecki](https://github.com/jangorecki) funded by [H2O.ai](https://www.h2o.ai). In case of questions/feedback, feel free to file an issue there.

------

```{r timetaken, child="rmarkdown_child/timetaken.Rmd"}
```
Benchmark run took around `r hours_took(lld)` hours.

Report was generated on: `r format(Sys.time(), usetz=TRUE)`.

```{r status, child="rmarkdown_child/status.Rmd"}
```{r status_set_success}
cat("index\n", file=get_report_status_file(), append=TRUE)
```
Loading

0 comments on commit 03a27bb

Please sign in to comment.