Name	Name	Last commit message	Last commit date
Latest commit History 914 Commits
R	R
extra	extra
inst/headers	inst/headers
man	man
src	src
tests	tests
tmp	tmp
.Rbuildignore	.Rbuildignore
COPYING	COPYING
DESCRIPTION	DESCRIPTION
NAMESPACE	NAMESPACE
NEWS.md	NEWS.md
README.md	README.md

r2c - Fast Iterated Statistic Computation in R

Proof of Concept. Experimental, incomplete, with an interface subject to change.

"Compiles" a subset of R into machine code so that expressions composed with that subset can be applied repeatedly on varying data without interpreter overhead. {r2c} provides speed ups of up to 100x for iterated statistics, with R semantics, and without the challenges of directly compilable languages.

"Compiling" R

{r2c} "compiles" R expressions or functions composed of basic binary operators and statistics. {r2c} also supports multi-line statements and assignment. "Compile" is in quotes because {r2c} generates an equivalent C program, and compiles that. To compute the slope of a single variable regression we might use:

library(r2c)

slope <- function(x, y) sum((x - mean(x)) * (y - mean(y))) / sum((x - mean(x))^2)
r2c_slope <- r2cf(slope)

with(iris, r2c_slope(Sepal.Width, Sepal.Length))
## [1] -0.2233611
with(iris, slope(Sepal.Width, Sepal.Length))
## [1] -0.2233611

While "r2c_fun" functions can be called in the same way as normal R functions as shown above, there is limited value in doing so. The primary use case of {r2c} functions is iteration.

Iterating `{r2c}` Functions

{r2c} is fast because it avoids the R interpreter overhead otherwise required for each iteration. There are currently two iteration mechanisms available:

group_exec: compute on disjoint groups in data (a.k.a. split-apply-combine).
roll*_exec: compute on (possibly) overlapping sequential windows in data.

For example, to iterate the slope function by groups, we could use:

with(iris, group_exec(r2c_slope, list(Sepal.Width, Sepal.Length), Species))
##    setosa versicolor  virginica
## 0.6904897  0.8650777  0.9015345

I have not found good alternatives for the general¹ use case of {r2c}, as can be seen from the timings of computing group and window slopes on larger data sets²:

{r2c} is substantially faster, primarily because it does not require calling an R function for each iteration. {collapse} does well with group statistics if you can translate a regular R expression to one that will be fast with it. {FastR} is also interesting, but has other drawbacks including the need for its own runtime application, and multiple warm up runs before reaching fast timings (^*times shown are after 4-5 runs).

For the special case of a simple statistic many packages provide dedicated pre-compiled alternatives, some of which are faster than {r2c}:

Even for the simple statistic case {r2c} is competitive with dedicated compiled alternatives like those provided by {RcppRoll}, and {data.table}'s frollsum in "exact" mode. Implementations that re-use overlapping window sections such as {data.table}'s frollsum in "fast" mode, {roll}, and {slider}, will outperform {r2c}, particularly for larger windows. {data.table} and {roll} use "on-line" algorithms, and {slider} uses a "segment tree" algorithm, each with varying speed and precision trade-offs³.

See Related Work and benchmark details.

To summarize:

For iterated calculations on numeric data, {r2c} is fastest at complex expressions, and competitive with specialized pre-compiled alternatives for simple expressions. Additionally, {r2c} observes base R semantics for the expressions it evaluates; if you know R you can easily use {r2c}.

Caveats

First is that r2c requires compilation. I have not included that step in timings⁴ under the view that the compilation time will be amortized over many calculations. The facilities for this don't exist yet, but the plan is to to have {r2c} maintain a local library of pre-compiled user-defined functions, and for packages to compile {r2c} functions at install-time.

More importantly, we cannot compile and execute arbitrary R expressions:

Only {r2c} implemented counterpart functions may be used (currently: basic arithmetic/relational/comparison operators, statistics, {, and <-).
Primary numeric inputs must be attribute-less (e.g. to avoid expectations of S3 method dispatch or attribute manipulation), and any .numeric methods defined will be ignored⁵.
Future {r2c} counterparts will be limited to functions that return attribute-less numeric vectors of constant size (e.g. mean), or of the size of one of their inputs (e.g. +, or even quantile).

Within these constraints r2c is flexible. For example, it is possible to have arbitrary R objects for secondary parameters, as well as to reference iteration-invariant data:

w <- c(1, NA, 2, 3)
u <- c(-1, 1, 0)
h <- rep(1:2, each=2)

r2c_fun <- r2cq(sum(x, na.rm=TRUE) * y)
group_exec(r2c_fun, data=list(x=w), groups=h, MoreArgs=list(y=u))
##  1  1  1  2  2  2
## -1  1  0 -5  5  0

Notice the na.rm, and that the u in list(y=u) is re-used in full for each group setting the output size to 3.

With the exception of ifelse, the C counterparts to the R functions are intended to produce identical outputs, but have different implementations. As such, it is possible that for a particular set of inputs on a particular platform the results might diverge.

Future - Maybe?

In addition to cleaning up the existing code, there are many extensions that can be built on this proof of concept. Some are listed below. How many I end up working on will depend on some interaction of external interest and my own.

Expand the set of R functions that can be translated.
Nested "r2c_fun" functions.
Multi/character/factor grouping variables.
Additional runners (e.g. an apply analogue).
Library for previously "compiled" functions.
Basic loop support, and maybe logicals and branches.
Get on CRAN (there is currently at least one questionable thing we do).
API to allow other native code to invoke {r2c} functions.

Installation

This package is not available on CRAN yet. To install:

f.dl <- tempfile()
f.uz <- tempfile()
github.url <- 'https://github.com/brodieG/r2c/archive/main.zip'
download.file(github.url, f.dl)
unzip(f.dl, exdir=f.uz)
install.packages(file.path(f.uz, 'r2c-main'), repos=NULL, type='source')
unlink(c(f.dl, f.uz))

Or if you have {remotes}:

remotes::install_github("brodieg/r2c")

Related Work

"Compiling" R

FastR an implementation of R that can JIT compile R code to run on the Graal VM. It requires a different runtime (i.e. you can't just run your normal R installation) and has other trade-offs, including warm-up cycles and compatibility limitations⁶. But otherwise you type in what you would have in normal R and see some impressive speed-ups.

The Ř virtual machine an academic project that is superficially similar to FastR (its thesis explains differences). Additionally renjin appears to offer similar capabilities and tradeoffs as FastR. I have tried neither Ř nor renjin.

Closer to {r2c}, there are at least four packages that operate on the principle of translating R code into C (or C++), compiling that, and providing access to the resulting native code from R:

{Odin}, specialized for differential equation solving problems.
{ast2ast}, also targeting ODE solving and optimization.
{armacmp}, a DSL for formulating linear algebra code in R that is translated into C++.
{nCompiler}, a tool for generating C++ and interfacing it with R.

Most of these seem capable of computing iterated statistics in some form, and experienced users can likely achieve it with some work, but it will likely be difficult for someone familiar only with R.

Finally, {inline} and {Rcpp} allow you to write code in C/C++ and easily interface it with R.

Fast Group and Rolling Statistics

I am unaware of any packages that compile R expressions to avoid interpreter overhead in applying them over groups or windows of data. The closest are packages that recognize expressions for which they have equivalent pre-compiled code they run instead. This is limited to simple statistics:

{data.table}'s Gforce (see ?data.table::datatable.optimize).
In theory {dplyr}'s Hybrid Eval is similar to Gforce, but AFAICT it was quietly dropped and despite suggestions it might return for v1.1 I see no trace of it in the most recent 1.1 candidate development versions (as of 2022-07-03).

Additionally, there is {collapse} which provides specialized group statistic functions. These are quite fast, particularly for simple statistics, but you have to be familiar with {collapse} semantics to compose complex statistics from simple ones.

Several packages provide fast dedicated functions for a small set of simple rolling window statistics:

base::filter for weighted rolling sums / means.
{data.table}'s froll* functions.
{slider} slide_<stat> and slide_index_<stat>.
{roll}.
{zoo} roll<stat>.
{RcppRoll}.
{runner}.

Acknowledgments

R Core for developing and maintaining such a wonderful language.
Matt Dowle and Arun Srinivasan for contributing the {data.table}'s radix sort to R.
Sebastian Krantz for the idea of pre-computing group meta data for possible re-use (taken from collapse::GRP).
Achim Zeileis et al. for rollapply in {zoo} from the design of which roll*_exec borrows elements.
David Vaughan for ideas on window functions, including the index concept (position in the roll*_exec functions, borrowed from {slider}).
Byron Ellis and Peter Danenberg for the inspiration behind lcurry (see functional::CurryL), used in tests.
Hadley Wickham and Peter Danenberg for roxygen2.
Tomas Kalibera for rchk and the accompanying vagrant image, and rcnst to help detect errors in compiled code. Tomas also worked on the precursor to the Oracle FastR.
Winston Chang for the r-debug docker container, in particular because of the valgrind level 2 instrumented version of R.
Hadley Wickham et al. for ggplot2.

It turns out there is roll::roll_lm that can compute slopes, but it cannot handle the general case of composing arbitrary statistics from the ones it implements. ↩
These timings do not include the reuse_calls optimization added in 0.2.0. ↩
The "segment tree" algorithm will have better precision than the "on-line" algorithm, and while it is slower than the "on-line" algorithm (see the {roll} README for an explanation), it will begin to outperform {r2c} at window sizes larger than 100 as its performance scales with the logarithm of window size. The "on-line" algorithm is most susceptible to precision issues, but at least on systems with 80bit long double accumulators, it seems likely that the "on-line" algorithm will be sufficiently precise for most applications. ↩
The first compilation can be quite slow as it requires loading the compiler, etc. Subsequent compilations run in tenths of seconds. ↩
E.g. don't expect S3 dispatch to work if you define mean.numeric, although why one would do that for functions covered by {r2c} is unclear. ↩
My limited experience with {FastR}is that it is astonishing, but also frustrating. What it does is amazing, but the compatibility limitations are real (e.g. with the current (c.a. Summer 2022) version neither {data.table} nor {ggplot2} install out of the box, and more), and performance is volatile (e.g. package installation and some other tasks are painfully slow, some expressions will hiccup after the initial warm-up). At this point it does not seem like a viable drop-in replacement to R. It likely excels at running scalar operations in loops and similar, something that R itself struggles at. ↩

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

r2c - Fast Iterated Statistic Computation in R

"Compiling" R

Iterating `{r2c}` Functions

Caveats

Future - Maybe?

Installation

Related Work

"Compiling" R

Fast Group and Rolling Statistics

Acknowledgments

About

Releases

Packages

Languages

License

brodieG/r2c

Folders and files

Latest commit

History

Repository files navigation

r2c - Fast Iterated Statistic Computation in R

"Compiling" R

Iterating {r2c} Functions

Caveats

Future - Maybe?

Installation

Related Work

"Compiling" R

Fast Group and Rolling Statistics

Acknowledgments

Footnotes

About

Resources

License

Stars

Watchers

Forks

Releases

Packages 0

Languages

Iterating `{r2c}` Functions

Packages