Skip to content

geom_raster() of a matrix: Performance analysis and improvements #4989

Open
@zeehio

Description

@zeehio

Summary

  • We see how a plot of a 4k x 3k matrix can be made around 10 times
    faster than when using geom_raster() (45 seconds to 5.6 seconds).

  • The differences in performance tell us that geom_raster() may not
    be the best choice to rasterize a matrix.

  • We can bring the timing further down to 1.5 seconds if we omit
    handling of missing values and reduce the palette of colours, but
    these are shortcuts ggplot2 can’t make.

  • In this issue we will see an efficient way of rasterizing a matrix. We'll see which are the main ggplot2 bottlenecks affecting the performance of that efficient approach and we'll get to some pull requests to address those issues.

  • This issue is structured in several messages:

    • This message explains the problem and shows that ggplot is slower than the expected. It shows what performance we can aim to achieve.
    • The second message shows how performance is improved with a custom geom, available as a ggplot extension, going down from 45 seconds to 18 seconds without any ggplot2 changes.
    • The third message profiles the custom geom, and points to two pull requests.
    • Afterwards we'll discuss the scale mapping system proposing improvements to it
  • All code is included just for reproducibility, but it is not expected that the you linger in the details.

  • If I happen to call your attention, I'm looking for opportunities ideally starting around summer 2023. Happy to do remote work from Barcelona (Spain) or Mexico and open to relocation if needed.

imatge

Introduction: A small example

On my field of work it is a common case to have a matrix that we have to
plot with something similar to filled.contour or geom_raster. The
matrix has two associated axes, one for rows and one for columns. We
often need a scale transformation.

Here is a small example of the data:

library(bench)
library(ggplot2)
library(cowplot)
library(reshape2) # for melt
library(forcats)
x_small <- c(3,5,7,9)
y_small <- c(100,200,300)
# The matrix values are usually "random numbers" acquired from an instrument
# Here I choose them so they will have a nice scale when they are log2 transformed:
mat_small <- matrix(
  2^seq(from = 1, to = 64, length.out = length(x_small)*length(y_small)),
  nrow = length(x_small),
  ncol = length(y_small)
)
dimnames(mat_small) <- list(x = x_small, y = y_small)
mat_small
##    y
## x           100          200          300
##   3      2.0000 1.575265e+07 1.240729e+14
##   5    105.9524 8.345155e+08 6.572914e+15
##   7   5612.9576 4.420947e+10 3.482081e+17
##   9 297353.2218 2.342050e+12 1.844674e+19

We can use geom_raster() to plot it. Let’s call this the naïve
strategy, because we just use ggplot in a naïve way. This approach is
great and it works, but we’ll see that it does not scale very well…

naive_strategy <- function(mat) {
    df <- melt(mat, value.name = "intensity")
    gplt <- ggplot(df) +
      geom_raster(aes(x = x, y = y, fill = intensity)) +
      scale_fill_gradient(trans = "log2")
    gplt
}
gplt_naive_small <- naive_strategy(mat_small)
gplt_naive_small

imatge

Scaling with the naïve strategy:

The same problem, with a 4000 x 3000 matrix:

x_big <- seq(from = 3, by = 2, length.out = 3000)
y_big <- seq(from = 100, by = 100, length.out = 4000)
# The matrix values are usually "random numbers" acquired from an instrument
mat_big <- matrix(2^seq(1, 64, length.out = length(x_big)*length(y_big)), nrow = length(x_big), ncol = length(y_big))
dimnames(mat_big) <- list(x = x_big, y = y_big)
mat_big[1:5,1:5]
##     y
## x         100      200      300      400      500
##   3  2.000000 2.021954 2.044148 2.066587 2.089272
##   5  2.000007 2.021961 2.044156 2.066594 2.089279
##   7  2.000015 2.021968 2.044163 2.066602 2.089287
##   9  2.000022 2.021976 2.044171 2.066609 2.089294
##   11 2.000029 2.021983 2.044178 2.066617 2.089302

This is how it looks like:

naive_strategy(mat_big)

imatge

bm_baseline <- bench::mark(
  naive_strategy = {
    gplt <- naive_strategy(mat_big)
    benchplot(gplt)
  },
  iterations = 1L
)
## Warning: Some expressions had a GC in every iteration; so filtering is disabled.

bm_baseline
## # A tibble: 1 × 6
##   expression          min   median `itr/sec` mem_alloc `gc/sec`
##   <bch:expr>     <bch:tm> <bch:tm>     <dbl> <bch:byt>    <dbl>
## 1 naive_strategy    45.4s    45.4s    0.0220    28.6GB    0.683

Cutting corners strategy:

This is “as fast as I can make it”. It is useful as a reference for what
can be done, but it is not realistic to expect ggplot2 to be this
fast, due to these shortcuts being taken:

  • Ignore handling missing values
  • Limit to 256 colours in the image
cut_corners_strategy <- function(mat, low = "#132B43", high = "#56B1F7", num_levels = 256L) {
  # Helper function:
  mat_to_nativeRaster <- function(x, colormap, rangex)  {
    cols_to_ints <- farver::encode_native(colormap)
    breaks <- seq(from = rangex[1], to = rangex[2], length.out = length(colormap))
    xdim <- dim(x)
    rev_cols <- seq.int(ncol(x), 1L, by = -1L)
    x <- x[, rev_cols]
    x <- findInterval(x, breaks, rightmost.closed = TRUE)
    x <- cols_to_ints[x]
    x <- matrix(x, nrow = xdim[1], ncol = xdim[2], byrow = FALSE)
    
    structure(
      x,
      dim = c(xdim[2], xdim[1]),
      class = "nativeRaster",
      channels = 4L
    )
  }
  
  minmax <- range(mat)
  mat_trans <- log2(mat)
  palette <- scales::seq_gradient_pal(low = low, high = high)
  colormap <- palette(seq(from=0, to = 1, length.out = num_levels))
  nr <- mat_to_nativeRaster(mat_trans, colormap, log2(minmax))
  xmin <- as.numeric(rownames(mat)[1L])
  xmax <- as.numeric(rownames(mat)[nrow(mat)])
  ymin <- as.numeric(colnames(mat)[1L])
  ymax <- as.numeric(colnames(mat)[ncol(mat)])

  # The geom_rect is fake and it is only used to force the fill legend to appear
  # The geom_rect  limits are used to help set the plot limits
  gplt <- ggplot() +
    geom_rect(
      data = data.frame(
        x = NA_real_
      ),
      mapping = aes(fill = x),
      xmin = xmin, xmax = xmax,
      ymin = ymin, ymax = ymax
    ) +
    annotation_raster(
      nr,
      xmin = xmin, xmax = xmax,
      ymin = ymin, ymax = ymax
    ) +
    scale_fill_gradient( # This has to match with the colormap above
      low = low, high = high,
      limits = minmax,
      na.value = "#00000000",
      trans = "log2"
    ) +
    labs(fill="Intensity") +
    lims(
      x = c(xmin, xmax),
      y = c(ymin, ymax)
    )
  gplt
}

Let’s apply this to the small matrix:

cowplot::plot_grid(
  naive_strategy(mat_small) + labs(title = "naive"),
  cut_corners_strategy(mat_small) + labs(title = "cut corners"),
  ncol = 2
)

imatge

bm_cut_corners <- bench::mark(
  cut_corners = {
    gplt <- cut_corners_strategy(mat_big)
    benchplot(gplt)
  },
  iterations = 1L
)

bm_cut_corners
## # A tibble: 1 × 6
##   expression       min   median `itr/sec` mem_alloc `gc/sec`
##   <bch:expr>  <bch:tm> <bch:tm>     <dbl> <bch:byt>    <dbl>
## 1 cut_corners    1.43s    1.43s     0.700     687MB        0

Fast and fair strategy

If we avoid cutting corners, we can still get quite decent performance.
Here we take care of missing values (if there was any) and we don’t
limit the palette to 256 colours.

fast_fair_strategy <- function(mat, low = "#132B43", high = "#56B1F7", na_colour = "grey30") {
  # Helper function:
  mat_to_nativeRaster <- function(x, palette, rangex)  {
    rev_cols <- seq.int(ncol(x), 1L, by = -1L)
    x <- scales::rescale(x, to = c(0, 1), from = rangex)
    colours <- palette(x)
    colours <- farver::encode_native(colours)
    colours <- ifelse(is.na(colours), na_colour, colours)
    xdim <- dim(x)
    x <- matrix(colours, nrow = xdim[1], ncol = xdim[2], byrow = FALSE)
    x <- x[, rev_cols]
    
    structure(
      x,
      dim = c(xdim[2], xdim[1]),
      class = "nativeRaster",
      channels = 4L
    )
  }
  
  minmax <- range(mat)
  mat_trans <- log2(mat)
  if (any(is.na(mat_trans) & !is.na(mat))) {
    warning("NA introduced by log transform")
  }
  palette <- scales::seq_gradient_pal(low = low, high = high)
  nr <- mat_to_nativeRaster(mat_trans, palette, log2(minmax))
  xmin <- as.numeric(rownames(mat)[1L])
  xmax <- as.numeric(rownames(mat)[nrow(mat)])
  ymin <- as.numeric(colnames(mat)[1L])
  ymax <- as.numeric(colnames(mat)[ncol(mat)])

  # The geom_rect is fake and it is only used to force the fill legend to appear
  # The geom_rect  limits are used to help set the plot limits
  gplt <- ggplot() +
    geom_rect(
      data = data.frame(
        x = NA_real_
      ),
      mapping = aes(fill = x),
      xmin = xmin, xmax = xmax,
      ymin = ymin, ymax = ymax
    ) +
    annotation_raster(
      nr,
      xmin = xmin, xmax = xmax,
      ymin = ymin, ymax = ymax
    ) +
    scale_fill_gradient( # This has to match with the colormap above
      low = low, high = high,
      limits = minmax,
      na.value = "#00000000",
      trans = "log2"
    ) +
    labs(fill="Intensity") +
    lims(
      x = c(xmin, xmax),
      y = c(ymin, ymax)
    )
  gplt
}

Let’s apply this to the small matrix to assess correctness visually:

cowplot::plot_grid(
  naive_strategy(mat_small) + labs(title = "naive"),
  fast_fair_strategy(mat_small) + labs(title = "fast_fair"),
  ncol = 2
)

imatge

bm_fast_fair <- bench::mark(
  fast_fair = {
    gplt <- fast_fair_strategy(mat_big)
    benchplot(gplt)
  },
  iterations = 1L
)
## Warning: Some expressions had a GC in every iteration; so filtering is disabled.

bm_fast_fair
## # A tibble: 1 × 6
##   expression      min   median `itr/sec` mem_alloc `gc/sec`
##   <bch:expr> <bch:tm> <bch:tm>     <dbl> <bch:byt>    <dbl>
## 1 fast_fair      5.9s     5.9s     0.169    1.79GB    0.678

Comparison of all strategies:

We can see how ggplot2 creates a lot of extra copies, it allocates and deallocates 28GB of RAM. This hints for room for improvement.

strategies <- rbind(
  bm_baseline,
  bm_cut_corners,
  bm_fast_fair
)
strategies
## # A tibble: 3 × 6
##   expression          min   median `itr/sec` mem_alloc `gc/sec`
##   <bch:expr>     <bch:tm> <bch:tm>     <dbl> <bch:byt>    <dbl>
## 1 naive_strategy   45.41s   45.41s    0.0220   28.59GB    0.683
## 2 cut_corners       1.43s    1.43s    0.700   687.35MB    0    
## 3 fast_fair          5.9s     5.9s    0.169     1.79GB    0.678
ggplot(strategies) +
  geom_col(aes(
    x = forcats::fct_reorder(as.character(expression), as.numeric(min)),
    y = as.numeric(min), 
    fill=as.character(expression))
  ) +
  coord_flip() +
  labs(x = "Strategy", y = "CPU time (s)") + 
  guides(fill = "none")

imatge

ggplot2 is taking more than 40 seconds, while other approaches need
close to 5 seconds.

There is clearly room for improvement and that’s my goal to address in this issue.

Metadata

Metadata

Assignees

No one assigned

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions