Are tweedie models more computationally heavy than alternatives? #301

maxlindmark · 2024-02-15T12:31:57Z

maxlindmark
Feb 15, 2024

Hi!

Me and @VThunell have played around with Tweedie models fitted to stomach content (continious response and about 20% zeros). The dataset is quite large: approx 100.000 observations and 50 years or so. In doing this, we found that a Tweedie model would quite often lead to Error: vector memory exhausted (limit reached?)or other types of memory-issues causing R to crash, even on smaller subsets of the data.

First we explored general memory settings in R, and then packages (e.g., cran/dev versions, TMB, Matrix, INLA etc.) and even R versions, but didn't find anything there. We have also tried this across 3 Mac OS Sonoma 14.3 laptops: Intel, m2 and m3 chips, with 8, 24 and 8 GB of ram. On my laptop (the 24 gb ram), I can get away with the biggest subset of data.

Here are some examples that hopefully reproduce for you.

The data can be found here:

library(dplyr)

df <- readr::read_csv("https://raw.githubusercontent.com/VThunell/Lammska_pred-prey-overlap/main/data/clean/stomachs.csv") %>% 
  mutate(year_f = as.factor(Year)

But we can also reproduce it by modifying the pcod example so that is of similar size.

The code below illustrates the issue for me, but on different laptops you'll find different threshold where the model crashed. The Tweedie model fitted to the big data crashes, the delta_lognormal works fine and fast, and the last model is half the data fitted with Tweedie and it works.

I also note that when I do saveRDS() on the delta_lognormal and the last tweedie, the difference in size is huge -- 1.1 MB and 76 MB! Even though the delta_lognormal has two models and double the data.

Is this working as intendend? Is it something with the Tweedie that makes the model very big/slow?

library(sdmTMB)

# Fake new data with fake years
pcod2 <- pcod %>%
  slice(rep(1:n(), each = 75))

years <- seq(1945, 2010, length.out = nrow(pcod2))

pcod2 <- pcod2 %>%
  mutate(year2 = as.factor(round(years)))

nrow(pcod2)
length(unique(pcod2$year2))

# Fit a tweedie model - crashes on my laptop
m <- sdmTMB(
  density ~ 0 + year2, 
  data = pcod2,
  spatial = "off",
  family = tweedie(link = "log"))

# Works! and fast
m2 <- sdmTMB(
  density ~ 0 + year2,  
  data = pcod2,
  spatial = "off",
  family = delta_lognormal())

#saveRDS(m2, file = "m2.Rdata")

# If we reduce the data to half, it works on my laptop
m3 <- sdmTMB(
  density ~ 0 + year2, 
  data = pcod2[1:(nrow(pcod2) / 2), ],
  spatial = "off",
  family = tweedie(link = "log"))

Here's the session I ran this in (though as mentioned this was reproduced with other versions as well)

> sessionInfo()
R version 4.3.2 (2023-10-31)
Platform: aarch64-apple-darwin20 (64-bit)
Running under: macOS Sonoma 14.3

Matrix products: default
BLAS:   /System/Library/Frameworks/Accelerate.framework/Versions/A/Frameworks/vecLib.framework/Versions/A/libBLAS.dylib 
LAPACK: /Library/Frameworks/R.framework/Versions/4.3-arm64/Resources/lib/libRlapack.dylib;  LAPACK version 3.11.0

locale:
[1] en_US.UTF-8/en_US.UTF-8/en_US.UTF-8/C/en_US.UTF-8/en_US.UTF-8

time zone: Europe/Stockholm
tzcode source: internal

attached base packages:
[1] stats     graphics  grDevices utils     datasets  methods   base     

other attached packages:
[1] dplyr_1.1.4       sdmTMB_0.4.2.9000

loaded via a namespace (and not attached):
 [1] vctrs_0.6.5        nlme_3.1-163       cli_3.6.2          rlang_1.1.3        estimability_1.4.1 generics_0.1.3     assertthat_0.2.1  
 [8] xtable_1.8-4       glue_1.7.0         hms_1.1.3          fansi_1.0.6        grid_4.3.2         tibble_3.2.1       tzdb_0.4.0        
[15] mvtnorm_1.2-4      lifecycle_1.0.4    compiler_4.3.2     coda_0.19-4        emmeans_1.9.0      pkgconfig_2.0.3    mgcv_1.9-0        
[22] rstudioapi_0.15.0  lattice_0.21-9     R6_2.5.1           tidyselect_1.2.0   readr_2.1.5        utf8_1.2.4         pillar_1.9.0      
[29] splines_4.3.2      magrittr_2.0.3     Matrix_1.6-5       tools_4.3.2

*edited for spelling

seananderson · 2024-02-15T20:06:02Z

seananderson
Feb 15, 2024
Maintainer

Fascinating...

I think it's understandable that the Tweedie is slower and more memory hungry given the computational challenges in approximating the likelihood, but in testing this I noticed that glmmTMB scales fitting speed (and memory I think) better with large datasets and the Tweedie. E.g.

library(sdmTMB)
library(tictoc)

# match glmmTMB:
ctl <- sdmTMBcontrol(newton_loops = 0, multiphase = FALSE)

N <- 3e4
set.seed(1)
dat <- data.frame(
  y = fishMod::rTweedie(N, 1, 1, 1.5)
)

tic()
m2 <- sdmTMB(
  y ~ 1,
  data = dat,
  spatial = 'off',
  family = tweedie("log"),
  control = ctl
)
toc()
#> 6.429 sec elapsed

tic()
m1 <- glmmTMB::glmmTMB(
  y ~ 1,
  data = dat,
  family = glmmTMB::tweedie("log")
)
toc()
#> 1.433 sec elapsed

# Gaussian ----------------------------

N <- 3e4
set.seed(1)
dat <- data.frame(
  y = rnorm(N, 0, 1)
)

tic()
m2 <- sdmTMB(
  y ~ 1,
  data = dat,
  spatial = 'off',
  control = sdmTMBcontrol(newton_loops = 1, multiphase = FALSE)
)
toc()
#> 0.116 sec elapsed

tic()
m1 <- glmmTMB::glmmTMB(
  y ~ 1,
  data = dat
)
toc()
#> 0.219 sec elapsed

^{Created on 2024-02-15 with reprex v2.1.0}

I'm not sure what's going on. My first thought was that maybe a single set of epsilon/omega random effects that are left mapped off at zero might be responsible, but when I try shrinking those to have dimensions of 0 it doesn't change much. I don't see a similar scaling problem with the Gaussian. Actually, maybe it's there but not until much larger data sizes and not as badly:

library(sdmTMB)
library(tictoc)

N <- 3e6
set.seed(2)
dat <- data.frame(
  y = rnorm(N, 0, 1)
)
ctl <- sdmTMBcontrol(newton_loops = 0, multiphase = FALSE)

tic()
m2 <- sdmTMB(
  y ~ 1,
  data = dat,
  spatial = 'off',
  control = ctl
)
#> Warning: The model may not have converged. Maximum final gradient:
#> 0.0594763641440051.
toc()
#> 13.274 sec elapsed

tic()
m1 <- glmmTMB::glmmTMB(
  y ~ 1,
  data = dat
)
toc()
#> 7.489 sec elapsed

^{Created on 2024-02-15 with reprex v2.1.0}

There are gradient issues with the Gaussian big data example there.

I wonder if the optimizer settings are different? Or starting values are better? Or if an internal parameter transformation is different? Or if it's because of extra parameters that are mapped off?

I'll move this over to an issue.

0 replies

seananderson · 2024-03-29T19:01:57Z

seananderson
Mar 29, 2024
Maintainer

This is now fixed. The problem was I was ADREPORTing the Tweedie p parameter within a loop over the data and apparently that results in many ADREPORTs and blows up the memory. Thanks for reporting this! 616e8cd

0 replies

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Are tweedie models more computationally heavy than alternatives? #301

{{title}}

{{editor}}'s edit

{{editor}}'s edit

Replies: 2 comments

{{title}}

{{title}}

Select a reply

Are tweedie models more computationally heavy than alternatives? #301

maxlindmark Feb 15, 2024

Replies: 2 comments

seananderson Feb 15, 2024 Maintainer

seananderson Mar 29, 2024 Maintainer

maxlindmark
Feb 15, 2024

seananderson
Feb 15, 2024
Maintainer

seananderson
Mar 29, 2024
Maintainer