index.Rmd

---
title: "Database-like ops benchmark"
output:
  html_document:
    self_contained: no
    includes:
      in_header: ga.html
---
```{r render, include=FALSE}
# Rscript -e 'rmarkdown::render("index.Rmd", output_dir="public")' # has to be output_dir='public' as there is hardcode in benchplot for that path
```

This page aims to benchmark various database-like tools popular in open-source data science. It runs regularly against very latest versions of these packages and automatically updates. We provide this as a service to both developers of these packages and to users.

We also include the syntax being timed alongside the timing. This way you can immediately see whether you are doing these tasks or not, and if the timing differences matter to you or not. A 10x difference may be irrelevant if that's just 1s vs 0.1s on your data size. The intention is that you click the tab for the size of data you have.

```{r opts, echo=FALSE}
knitr::opts_chunk$set(echo=FALSE, cache=FALSE)
```

```{r helpers}
loop_benchplot = function(dt_task, report_name, code, exceptions, colors, data_namev, q_groupv) {
  message("starting loop_benchplot for task ", report_name)
  path = file.path("public", report_name, "plots")
  for (data_name in data_namev) {
    in_rows = strsplit(data_name, "_", fixed=TRUE)[[1L]][2L]
    for (q_group in q_groupv) {
      message("loop_benchplot ", report_name, " for ", data_name, " ", q_group)
      benchplot(as.numeric(in_rows), task=report_name, data=data_name, timings=dt_task[question_group==q_group], code=code, exceptions=exceptions, colors=colors, fnam=paste(data_name, q_group, "png", sep="."), path=path, .interactive=FALSE)
    }
  }
}
link = function(data_name, q_group, report_name) {
  fnam = sprintf("%s.%s.png", data_name, q_group)
  path = file.path(report_name, "plots")
  sprintf("[%s](%s)", fnam, file.path(path, fnam))
}
hours_took = function(lld) {
  lld_script_time = lld[, .(n_script_time_sec=uniqueN(script_time_sec), script_time_sec=unique(script_time_sec)), .(solution, task, data)]
  if (nrow(lld_script_time[n_script_time_sec>1L]))
    stop("There are multiple different 'script_time_sec' for single solution+task+data on report 'index'")
  lld_script_time[, round(sum(script_time_sec, na.rm=TRUE)/60/60, 1)]
}
```

```{r init}
source("report.R", chdir=TRUE)
source("helpers.R", chdir=TRUE)
source("report-code.R", chdir=TRUE)
source("benchplot.R", chdir=TRUE)
ld = time_logs()
lld = ld[script_recent==TRUE]
lld_nodename = as.character(unique(lld$nodename))
if (length(lld_nodename)>1L)
  stop(sprintf("There are multiple different 'nodename' to be presented on single report '%s'", report_name))
lld_unfinished = lld[is.na(script_time_sec)]
if (nrow(lld_unfinished)) {
  warning(sprintf("Missing solution finish timestamp in logs.csv for '%s' (still running or launcher script killed): %s", paste(unique(lld_unfinished$task), collapse=","), paste(unique(lld_unfinished$solution), collapse=", ")))
}
```

```{r report_groupby, message=FALSE}
in_rows = c("1e7","1e8","1e9")
k_na_sort = c("1e2_0_0","1e1_0_0","2e0_0_0","1e2_0_1")
data_name = paste("G1", paste(rep(in_rows, each=length(k_na_sort)), k_na_sort, sep="_"), sep="_")
dt_groupby = lld[task=="groupby"][substr(data,1,2)=="G1"]
loop_benchplot(dt_groupby, report_name="groupby", code=groupby.code, exceptions=groupby.exceptions, colors=solution.colors, data_namev=data_name, q_groupv=c("basic","advanced"))
```

```{r report_join, message=FALSE}
in_rows = c("1e7","1e8")
k_na_sort = c("NA_0_0")
data_name = paste("J1", paste(rep(in_rows, each=length(k_na_sort)), k_na_sort, sep="_"), sep="_")
dt_join = lld[task=="join"]
loop_benchplot(dt_join, report_name="join", code=join.code, exceptions=join.exceptions, colors=solution.colors, data_namev=data_name, q_groupv=c("basic"))
```

## Task {.tabset .tabset-fade .tabset-pills}

### groupby {.tabset .tabset-fade .tabset-pills}

Below timings are presented for a single dataset case having random order, no NAs (missing values) and particular cardinality factor (group size question 1 `k=100`). To see timings for other cases click on the links below. If a solution is missing on particular data size timings table refer to benchplot for reasons and check its speed on smaller data size tab.  

#### 0.5 GB {.tabset .tabset-fade .tabset-pills}

All data cases can be found at `r dt_groupby[in_rows=="1e7", .(q_grp_links=paste(link(unique(data), q_group=question_group, report_name="groupby"), collapse=", ")), by=question_group][, paste(q_grp_links, collapse=", ")]`.

##### basic {.active}

![](public/groupby/plots/G1_1e7_1e2_0_0.basic.png)

##### advanced

![](public/groupby/plots/G1_1e7_1e2_0_0.advanced.png)

#### 5 GB {.tabset .tabset-fade .tabset-pills}

All data cases can be found at `r dt_groupby[in_rows=="1e8", .(q_grp_links=paste(link(unique(data), q_group=question_group, report_name="groupby"), collapse=", ")), by=question_group][, paste(q_grp_links, collapse=", ")]`.

##### basic {.active}

![](public/groupby/plots/G1_1e8_1e2_0_0.basic.png)

##### advanced

![](public/groupby/plots/G1_1e8_1e2_0_0.advanced.png)

#### 50 GB {.active .tabset .tabset-fade .tabset-pills}

All data cases can be found at `r dt_groupby[in_rows=="1e9", .(q_grp_links=paste(link(unique(data), q_group=question_group, report_name="groupby"), collapse=", ")), by=question_group][, paste(q_grp_links, collapse=", ")]`.

##### basic {.active}

![](public/groupby/plots/G1_1e9_1e2_0_0.basic.png)

##### advanced

![](public/groupby/plots/G1_1e9_1e2_0_0.advanced.png)

### join {.tabset .tabset-fade .tabset-pills}

Below timings are presented for datasets having random order, no NAs (missing values). Data size on tabs corresponds to the LHS dataset of join, while RHS datasets are of the following sizes: _small_ (LHS/1e6), _medium_ (LHS/1e3), _big_ (LHS).

#### 0.6 GB {.tabset .tabset-fade .tabset-pills}

##### basic {.active}

![](public/join/plots/J1_1e7_NA_0_0.basic.png)

<!--
##### advanced

![](public/join/plots/J1_1e7_NA_0_0.advanced.png)
-->

#### 6 GB {.active .tabset .tabset-fade .tabset-pills}

##### basic {.active}

![](public/join/plots/J1_1e8_NA_0_0.basic.png)

<!--
##### advanced

![](public/join/plots/J1_1e8_NA_0_0.advanced.png)
-->

---

## Notes

- You are welcome to run this benchmark yourself! all scripts related to setting up environment, data and benchmark are in [repository](https://github.com/h2oai/db-benchmark).  
- Data used to generate plots on this website can be obtained from [time.csv](./time.csv) (together with [logs.csv](./logs.csv)). See [report.R](https://github.com/h2oai/db-benchmark/blob/master/report.R) for quick introduction how to work with those.  
- We ensure that calculations are not deferred by solution.  
- We also tested that answers produced from different solutions match each others, for details see [answers-validation.R](https://github.com/h2oai/db-benchmark/blob/master/answers-validation.R).  
- ClickHouse queries were made against `mergetree` table engine, see [#91](https://github.com/h2oai/db-benchmark/issues/91) for details.  

## Environment configuration

- R 3.6.0
- python 3.6
- Julia 1.0.2  

```{r environment_hardware}
as.data.table(na.omit(fread("nodenames.csv")[lld_nodename, on="nodename", t(.SD)]), keep.rownames=TRUE)[rn!="nodename", .(Component=rn, Value=V1)][, kk(.SD)]
```

------

## Scope

We limit the scope to what can be achieved on a single machine. Laptop size memory (8GB) and server size memory (250GB) are in scope. Out-of-memory using local disk such as NVMe is in scope. Multi-node systems such as Spark running in single machine mode is in scope, too. Machines are getting bigger: EC2 X1 has 2TB RAM and 1TB NVMe disk is under $300. If you can perform the task on a single machine, then perhaps you should. To our knowledge, nobody has yet compared this software in this way and published results too.

## Why db-benchmark?

Because we have been asked many times to do so, the first task and initial motivation for this page, was to update the benchmark designed and run by [Matt Dowle](https://twitter.com/MattDowle) (creator of [data.table](https://github.com/Rdatatable/data.table)) in 2014 [here](https://github.com/Rdatatable/data.table/wiki/Benchmarks-%3A-Grouping). The methodology and reproducible code can be obtained there. Exact code of this report and benchmark script can be found at [h2oai/db-benchmark](https://github.com/h2oai/db-benchmark) created by [Jan Gorecki](https://github.com/jangorecki) funded by [H2O.ai](https://www.h2o.ai). In case of questions/feedback, feel free to file an issue there.

------

Benchmark run took around `r hours_took(lld)` hours.  

Report was generated on: `r format(Sys.time(), usetz=TRUE)`.  

```{r status_set_success}
cat("index\n", file=get_report_status_file(), append=TRUE)
```