Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

NA/NaN gradient evaluation error encountered when running sdmTMB function with spatial on #288

Open
davjfish opened this issue Jan 12, 2024 · 22 comments

Comments

@davjfish
Copy link

davjfish commented Jan 12, 2024

When working through this demo on a new computer and a fresh install of R (4.3.2), we are running into the following issue:

library(ggplot2)
library(dplyr)
library(sdmTMB)

glimpse(pcod)
mesh <- make_mesh(pcod, c("X", "Y"), cutoff = 10)
plot(mesh)

m <- sdmTMB(
  data = pcod,
  formula = present ~ depth_scaled + depth_scaled2,
  mesh = mesh, # can be omitted for a non-spatial model
  family = binomial(link = "logit"),
  spatial = "on"
)

Produces this error:

Error in stats::nlminb(start = tmb_obj$par, objective = tmb_obj$fn, gradient = tmb_obj$gr,  : 
  NA/NaN gradient evaluation
In addition: Warning message:
In stats::nlminb(start = tmb_obj$par, objective = tmb_obj$fn, gradient = tmb_obj$gr,  :
  NA/NaN function evaluation

When spatial is set to off, we do not get this error. Originally, we suspected this was a problem with running the library on Linux but we have since reproduced this on Windows. This error has also been reproduced on R version 4.2.2 . The error message is the same on Linux but we do receive a few extra warnings:

Error in stats::nlminb(start = tmb_obj$par, objective = tmb_obj$fn, gradient = tmb_obj$gr,  : 
  NA/NaN gradient evaluation
In addition: Warning messages:
1: In Cholesky(h.pattern, super = super) :
  Cholmod warning 'matrix not positive definite' at file ../Supernodal/t_cholmod_super_numeric.c, line 911
2: In stats::nlminb(start = tmb_obj$par, objective = tmb_obj$fn, gradient = tmb_obj$gr,  :
  NA/NaN function evaluation
3: In Cholesky(h.pattern, super = super) :
  Cholmod warning 'matrix not positive definite' at file ../Supernodal/t_cholmod_super_numeric.c, line 911
@seananderson
Copy link
Member

This is likely due to this mismatch between your installed Matrix and the Matrix used to build the version on CRAN. It affects all TMB packages on CRAN. Install from source for now. We'll push a minor update to trigger a rebuild of the binary shortly.

@davjfish
Copy link
Author

davjfish commented Jan 12, 2024

I wiped out all the installed packages and then ran this script on the linux box:

install.packages("Matrix", type = "source")
install.packages("TMB", type = "source")
install.packages("sdmTMB", type = "source")
install.packages("ggplot2")
install.packages("dplyr")

library(ggplot2)
library(dplyr)
library(sdmTMB)

mesh <- make_mesh(pcod, c("X", "Y"), cutoff = 10)

m <- sdmTMB(
  data = pcod,
  formula = present ~ depth_scaled + depth_scaled2,
  mesh = mesh, # can be omitted for a non-spatial model
  family = binomial(link = "logit"),
  spatial = "on"
)

Same error but different warnings

image

@seananderson
Copy link
Member

Have you restarted your R session to ensure the latest package installs are the ones loaded?

If that doesn't fix it, does a basic example with glmmTMB that has random effects run?

And if that works but sdmTMB doesn't, does the GitHub version work?

@davjfish
Copy link
Author

davjfish commented Jan 15, 2024

I confirm that we have tried restarting the R session.

Here is the basic glmmTMB example we ran without any issue:

library(glmmTMB)
library(gamlss.dist)
dat <- data.frame(y =c(rZINBI(100, mu = 10, sigma = .6, nu=0.1),
                       rZINBI(100, mu = 5, sigma = .3, nu=.5)),
                  sites =c(rep("a", 100), rep("b", 100)),
                  year = rep(1:4, each = 10, times = 5),
                  trans = rep(1:40, each = 5, times = 1), 
                  area=rNO(200,20))

m1 <- glmmTMB(y ~ sites + (1|trans),
              zi=~0,
              family=nbinom1, data=dat)

Finally, we are getting the same result when installing the package directly from GitHub (R session was also restarted):

install.packages("Matrix", type = "source")
install.packages("TMB", type = "source")
install.packages("remotes")
remotes::install_github("pbs-assess/sdmTMB")
install.packages("ggplot2")
install.packages("dplyr")

library(ggplot2)
library(dplyr)
library(sdmTMB)

mesh <- make_mesh(pcod, c("X", "Y"), cutoff = 10)

m <- sdmTMB(
  data = pcod,
  formula = present ~ depth_scaled + depth_scaled2,
  mesh = mesh, # can be omitted for a non-spatial model
  family = binomial(link = "logit"),
  spatial = "on"
)

image

@seananderson
Copy link
Member

I'm running out of ideas. I've always seen the 'rebuilding from source with the latest Matrix version'-fix work.

Other information on the Matrix issue:
glmmTMB/glmmTMB#965
https://stat.ethz.ch/pipermail/r-package-devel/2023q4/010054.html
https://stackoverflow.com/a/77504843

One other option would be to install an archived version of Matrix, such as version Matrix_1.6-1.1.tar.gz:
https://cran.r-project.org/src/contrib/Archive/Matrix/
from before the ABI change.

install.packages("/path/to/downloads/Matrix_1.6-1.1.tar.gz", type  = "source", repos = NULL)

Restart R session, then try the binary version of sdmTMB

install.packages("sdmTMB")

I'll get a new version of sdmTMB on CRAN shortly, which should let the binary version work.

Otherwise, maybe it's something about your R algebra setup or C++ compiler Makevars? I don't see why glmmTMB would work and sdmTMB wouldn't, though, if both were built from source. The only thing I've seen cause this for models that should fit otherwise, is this Matrix issue.

Everything seems to be working across all tested systems with continuous integration, including that basic example.

If you post the output of sessionInfo() (after relevant packages are loaded), it's possible I can recreate it in Docker.

@davjfish
Copy link
Author

Yeah, this is strange. It is surprising that the error was reproduced on our end across two separate installs (windows and ubuntu) and the unit tests are running fine.

I tried the above suggestion (i.e., installation of Matrix 1.6-1.1 from zipped tarball) and this did not work either.

Here is the output from sessionInfo:

R version 4.2.2 (2022-10-31)
Platform: x86_64-pc-linux-gnu (64-bit)
Running under: Ubuntu 22.04.3 LTS

Matrix products: default
BLAS:   /usr/lib/x86_64-linux-gnu/openblas-pthread/libblas.so.3
LAPACK: /usr/lib/x86_64-linux-gnu/openblas-pthread/libopenblasp-r0.3.20.so

locale:
 [1] LC_CTYPE=en_CA.UTF-8       LC_NUMERIC=C               LC_TIME=en_CA.UTF-8        LC_COLLATE=en_CA.UTF-8     LC_MONETARY=en_CA.UTF-8   
 [6] LC_MESSAGES=en_CA.UTF-8    LC_PAPER=en_CA.UTF-8       LC_NAME=C                  LC_ADDRESS=C               LC_TELEPHONE=C            
[11] LC_MEASUREMENT=en_CA.UTF-8 LC_IDENTIFICATION=C       

attached base packages:
[1] stats     graphics  grDevices utils     datasets  methods   base     

other attached packages:
[1] sdmTMB_0.4.1  dplyr_1.1.4   ggplot2_3.4.4

loaded via a namespace (and not attached):
 [1] Rcpp_1.0.12        TMB_1.9.10         nloptr_2.0.3       pillar_1.9.0       compiler_4.2.2     class_7.3-20       tools_4.2.2       
 [8] boot_1.3-28        lme4_1.1-35.1      lifecycle_1.0.4    tibble_3.2.1       nlme_3.1-160       gtable_0.3.4       lattice_0.20-45   
[15] mgcv_1.8-41        pkgconfig_2.0.3    rlang_1.1.3        Matrix_1.6-1.1     cli_3.6.2          DBI_1.2.1          rstudioapi_0.15.0 
[22] mvtnorm_1.2-4      e1071_1.7-14       withr_2.5.2        fmesher_0.1.5      generics_0.1.3     vctrs_0.6.5        classInt_0.4-10   
[29] grid_4.2.2         tidyselect_1.2.0   glue_1.7.0         sf_1.0-15          R6_2.5.1           fansi_1.0.6        sp_2.1-2          
[36] minqa_1.2.6        magrittr_2.0.3     MASS_7.3-58.1      units_0.8-5        scales_1.3.0       emmeans_1.9.0      splines_4.2.2     
[43] assertthat_0.2.1   colorspace_2.1-0   xtable_1.8-4       KernSmooth_2.23-20 utf8_1.2.4         proxy_0.4-27       estimability_1.4.1
[50] munsell_0.5.0     

I'll also see if I can get some of my more R-savvy colleagues here at GFC to try and reproduce the issue.

@seananderson
Copy link
Member

It's possible it's related to the libopenblasp here and the more usual Matrix version issue on the Windows machine. I believe I would have the same error on continuous integration without this line:

install.packages("TMB", type = "source")

Regardless, the best path forward is for me to bump the version on CRAN to build a new binary, which I will prioritize doing in the next day or so.

If that doesn't solve things, I'll fire up a Docker image and see if I can debug with that BLAS/LAPACK setup.

@davjfish
Copy link
Author

Ok great. Thanks for your help with troubleshooting this.

@seananderson
Copy link
Member

OK, version 0.4.2 is now on CRAN. The Mac binaries are built. The Windows binaries will probably be built in the next day or so. It occurs to me now that I don't know how Linux and CRAN interact. Maybe they don't build binaries for you?

@davjfish
Copy link
Author

Sorry, still not working.

I tried it on a clean install and I installed the packages as such:

install.packages("ggplot2")
install.packages("dplyr")
install.packages("sdmTMB")

All of the packages are installed from source. I think you are correct that binaries are not built for Linux users; at least not with the way our machine is set up.

Here is the session info:

R version 4.2.2 (2022-10-31)
Platform: x86_64-pc-linux-gnu (64-bit)
Running under: Ubuntu 22.04.3 LTS

Matrix products: default
BLAS:   /usr/lib/x86_64-linux-gnu/openblas-pthread/libblas.so.3
LAPACK: /usr/lib/x86_64-linux-gnu/openblas-pthread/libopenblasp-r0.3.20.so

locale:
 [1] LC_CTYPE=en_CA.UTF-8       LC_NUMERIC=C               LC_TIME=en_CA.UTF-8        LC_COLLATE=en_CA.UTF-8     LC_MONETARY=en_CA.UTF-8   
 [6] LC_MESSAGES=en_CA.UTF-8    LC_PAPER=en_CA.UTF-8       LC_NAME=C                  LC_ADDRESS=C               LC_TELEPHONE=C            
[11] LC_MEASUREMENT=en_CA.UTF-8 LC_IDENTIFICATION=C       

attached base packages:
[1] stats     graphics  grDevices utils     datasets  methods   base     

other attached packages:
[1] sdmTMB_0.4.2  dplyr_1.1.4   ggplot2_3.4.4

loaded via a namespace (and not attached):
 [1] Rcpp_1.0.12        TMB_1.9.10         nloptr_2.0.3       pillar_1.9.0       compiler_4.2.2     class_7.3-20       tools_4.2.2       
 [8] boot_1.3-28        lme4_1.1-35.1      lifecycle_1.0.4    tibble_3.2.1       nlme_3.1-160       gtable_0.3.4       lattice_0.20-45   
[15] mgcv_1.8-41        pkgconfig_2.0.3    rlang_1.1.3        Matrix_1.5-1       cli_3.6.2          DBI_1.2.1          e1071_1.7-14      
[22] withr_3.0.0        fmesher_0.1.5      generics_0.1.3     vctrs_0.6.5        classInt_0.4-10    grid_4.2.2         tidyselect_1.2.0  
[29] glue_1.7.0         sf_1.0-15          R6_2.5.1           fansi_1.0.6        sp_2.1-2           minqa_1.2.6        magrittr_2.0.3    
[36] MASS_7.3-58.1      scales_1.3.0       splines_4.2.2      units_0.8-5        assertthat_0.2.1   colorspace_2.1-0   utf8_1.2.4        
[43] KernSmooth_2.23-20 proxy_0.4-27       munsell_0.5.0     

I will try on my windows computer once the binaries are available.

@davjfish
Copy link
Author

Fresh install on windows and I ran into the same error. I also had a colleague do this on their windows PC and they got the same error. We are both running R 4.2.2

R version 4.2.2 (2022-10-31 ucrt)
Platform: x86_64-w64-mingw32/x64 (64-bit)
Running under: Windows 10 x64 (build 19044)

Matrix products: default

locale:
[1] LC_COLLATE=English_United States.utf8  LC_CTYPE=English_United States.utf8    LC_MONETARY=English_United States.utf8
[4] LC_NUMERIC=C                           LC_TIME=English_United States.utf8    

attached base packages:
[1] stats     graphics  grDevices utils     datasets  methods   base     

other attached packages:
[1] sdmTMB_0.4.2

loaded via a namespace (and not attached):
 [1] Rcpp_1.0.12        TMB_1.9.10         nloptr_2.0.3       pillar_1.9.0       compiler_4.2.2     class_7.3-20      
 [7] tools_4.2.2        boot_1.3-28        lme4_1.1-35.1      lifecycle_1.0.4    tibble_3.2.1       nlme_3.1-160      
[13] lattice_0.20-45    mgcv_1.8-41        pkgconfig_2.0.3    rlang_1.1.3        Matrix_1.5-1       DBI_1.2.1         
[19] cli_3.6.2          e1071_1.7-14       fmesher_0.1.5      dplyr_1.1.4        generics_0.1.3     vctrs_0.6.5       
[25] classInt_0.4-10    grid_4.2.2         tidyselect_1.2.0   glue_1.7.0         sf_1.0-15          R6_2.5.1          
[31] fansi_1.0.6        sp_2.1-2           minqa_1.2.6        magrittr_2.0.3     units_0.8-5        splines_4.2.2     
[37] MASS_7.3-58.1      assertthat_0.2.1   KernSmooth_2.23-20 utf8_1.2.4         proxy_0.4-27 

@seananderson
Copy link
Member

seananderson commented Jan 22, 2024

I just confirmed that the following works on my DFO Windows laptop with several recent Matrix and TMB versions:

library(sdmTMB)
m <- sdmTMB(
  data = pcod,
  formula = present ~ depth_scaled + depth_scaled2,
  mesh = make_mesh(pcod, c("X", "Y"), cutoff = 10),
  family = binomial(link = "logit"),
  spatial = "on"
)

but, the Matrix version above is very old (Matrix_1.5-1 2022-09-13) and may not be compatible with TMB 1.9.10 (depending on if it was built from source?). This breaking Matrix ABI change has been a big pain.

Can you confirm the following still does not work for you given current Matrix and TMB packages?

install.packages("Matrix")
install.packages("TMB")
install.packages("sdmTMB")

# restart R / RStudio to be safe... then

library(sdmTMB)
m <- sdmTMB(
  data = pcod,
  formula = present ~ depth_scaled + depth_scaled2,
  mesh = make_mesh(pcod, c("X", "Y"), cutoff = 10),
  family = binomial(link = "logit"),
  spatial = "on"
)

CRAN checks seem fine and all binaries (except 'patched' linux) are built. Hopefully it's an issue with old Matrix...

@davjfish
Copy link
Author

davjfish commented Jan 22, 2024

When I do the above, it works on the Windows computer! Unfortunately, still no luck on the Linux computer.

When my colleague first tried this on the DFO computer:

install.packages("sdmTMB")
library(sdmTMB)
m <- sdmTMB(
  data = pcod,
  formula = present ~ depth_scaled + depth_scaled2,
  mesh = make_mesh(pcod, c("X", "Y"), cutoff = 10),
  family = binomial(link = "logit"),
  spatial = "on"
)

it worked because he had several dependencies already installed. However after wiping out the C:\Users\USER\AppData\Local\R\win-library\4.2 folder, it was only then that he got the famous error message.

@JoleneSutton
Copy link

Hi, just chiming in to add my support for finding a resolution to using sdmTMB on a Linux computer.

@seananderson
Copy link
Member

@JoleneSutton can you provide more details? Installed from CRAN? Installed from source or binary? GitHub? Matrix and TMB up to date? Can you post the output of sessionInfo()? Anything in your R Makevars file?

There's nothing inherent to Linux systems about why this should happen. I regularly use the package on Linux systems, it's tested on 3 Linux systems with every push to GitHub, and the CRAN servers test it on many Linux systems.

I'd like to get to the bottom of this! It's likely something about a specific setup and maybe with multiple data points we can track this down.

@JoleneSutton
Copy link

Hi @seananderson , yes, sorry I should have been more clear. It is the same machine and thus error messages as described by @davjfish. I'm just hoping to be able to switch my scripts to that machine in order to free up my laptop. We still seem to be having issues with Linux, per the post from Jan. 22. Really appreciate all your help with this!

@seananderson
Copy link
Member

seananderson commented Feb 27, 2024

I just spent a while debugging this with someone (with raw TMB/RTMB code, nothing to do with sdmTMB) who also had R version 4.2.2 installed and even installing Matrix and TMB from source in that order did not fix it (edit: it did fix it, but TMB had to built from source and R had to be restarted).

@seananderson
Copy link
Member

It is still highly likely that the issue is an old Matrix package install. I see above that the installed version of Matrix is old. Current version is 1.6-5. Even for that person with R 4.2.2 I mentioned earlier today, once they installed the latest Matrix, then installed TMB from CRAN from source, the problem fixed itself. In this case (with an older R), you likely then also have to install sdmTMB from source. I can post some RTMB code that could be run to simplify testing a bit by eliminating the sdmTMB layer.

@JoleneSutton
Copy link

We upgraded to R 4.3.2 on the Linux, and installed the updated packages, but unfortunately we are still having the same issues.

Here's the code:

install.packages("Matrix")
install.packages("TMB")
install.packages("sdmTMB")
# restart R / RStudio to be safe... then
library(sdmTMB)
m <- sdmTMB(
  data = pcod,
  formula = present ~ depth_scaled + depth_scaled2,
  mesh = make_mesh(pcod, c("X", "Y"), cutoff = 10),
  family = binomial(link = "logit"),
  spatial = "on")

Here's the error message:
Error in stats::nlminb(start = tmb_obj$par, objective = tmb_obj$fn, gradient = tmb_obj$gr, :
NA/NaN gradient evaluation
In addition: Warning messages:
1: In .local(A, ...) :
CHOLMOD warning 'matrix not positive definite' at file '../Supernodal/t_cholmod_super_numeric.c', line 911
2: In stats::nlminb(start = tmb_obj$par, objective = tmb_obj$fn, gradient = tmb_obj$gr, :
NA/NaN function evaluation
3: In .local(A, ...) :
CHOLMOD warning 'matrix not positive definite' at file '../Supernodal/t_cholmod_super_numeric.c', line 911

And the session info:

sessionInfo()
R version 4.3.2 (2023-10-31)
Platform: x86_64-pc-linux-gnu (64-bit)
Running under: Ubuntu 22.04.4 LTS

Matrix products: default
BLAS: /usr/lib/x86_64-linux-gnu/openblas-pthread/libblas.so.3
LAPACK: /usr/lib/x86_64-linux-gnu/openblas-pthread/libopenblasp-r0.3.20.so; LAPACK version 3.10.0

locale:
[1] LC_CTYPE=en_CA.UTF-8 LC_NUMERIC=C LC_TIME=en_CA.UTF-8
[4] LC_COLLATE=en_CA.UTF-8 LC_MONETARY=en_CA.UTF-8 LC_MESSAGES=en_CA.UTF-8
[7] LC_PAPER=en_CA.UTF-8 LC_NAME=C LC_ADDRESS=C
[10] LC_TELEPHONE=C LC_MEASUREMENT=en_CA.UTF-8 LC_IDENTIFICATION=C

time zone: America/Halifax
tzcode source: system (glibc)

attached base packages:
[1] stats graphics grDevices utils datasets methods base

other attached packages:
[1] sdmTMB_0.4.2

loaded via a namespace (and not attached):
[1] Matrix_1.6-5 dplyr_1.1.4 compiler_4.3.2 tidyselect_1.2.0
[5] Rcpp_1.0.12 assertthat_0.2.1 splines_4.3.2 boot_1.3-28.1
[9] lattice_0.21-9 R6_2.5.1 generics_0.1.3 classInt_0.4-10
[13] sf_1.0-15 MASS_7.3-60 tibble_3.2.1 nloptr_2.0.3
[17] fmesher_0.1.5 units_0.8-5 minqa_1.2.6 DBI_1.2.2
[21] TMB_1.9.10 pillar_1.9.0 rlang_1.1.3 utf8_1.2.4
[25] sp_2.1-3 cli_3.6.2 magrittr_2.0.3 mgcv_1.9-0
[29] class_7.3-22 grid_4.3.2 lme4_1.1-35.1 lifecycle_1.0.4
[33] nlme_3.1-163 vctrs_0.6.5 KernSmooth_2.23-22 proxy_0.4-27
[37] glue_1.7.0 fansi_1.0.6 e1071_1.7-14 tools_4.3.2
[41] pkgconfig_2.0.3

@seananderson
Copy link
Member

I'm running out of ideas. You can confirm these built from source and were not installed from binaries?

install.packages("Matrix")
install.packages("TMB")
install.packages("sdmTMB")

I wondered if it could be the BLAS/LAPACK setup, but I just found someone with the same versions as you and it works for them. Again, you're sure the above installed from source?

As a troubleshooting exercise, does the following code run for you on this server down to the sdmTMB part? i.e., down to line 93 or so.
https://github.com/seananderson/RTMB-TESA-spatial/blob/main/exercises/05-spatiotemporal-spde.R

Then we can isolate if this is an sdmTMB install issue or a more fundamental TMB issue.

@stoyelq
Copy link

stoyelq commented Jun 3, 2024

This issue is still persistent on a fresh install in Ubuntu 22. I tried installing everything from source and ran into the same NA/Nan gradient / matrix not positive definite errors. I also tried a clean install duplicating the steps in the passing github action workflow without any luck.

I tried the troubleshooting exercise and it crashes out on line 90 with the same type of error:

> opt <- nlminb(obj$par, obj$fn, obj$gr)
Error in .local(A, ...) :
  leading principal minor of order 405 is not positive
In addition: Warning message:
In .local(A, ...) :
  CHOLMOD warning 'matrix not positive definite' at file 'Supernodal/t_cholmod_super_numeric_worker.c', line 1114
Error in .local(A, ...) :
  leading principal minor of order 405 is not positive
In addition: Warning messages:
1: In nlminb(obj$par, obj$fn, obj$gr) : NA/NaN function evaluation
2: In .local(A, ...) :
  CHOLMOD warning 'matrix not positive definite' at file 'Supernodal/t_cholmod_super_numeric_worker.c', line 1114
Error in ff(x, order = 1) :
  inner newton optimization failed during gradient calculation
outer mgc:  NaN
Error in nlminb(obj$par, obj$fn, obj$gr) : NA/NaN gradient evaluation
>

@seananderson
Copy link
Member

@stoyelq is this on the same server as above or a different Ubuntu setup? If it's different then maybe we can figure out what's in common?

This shouldn't be a general problem with Ubuntu 22 + sdmTMB or Ubuntu + openBLAS + sdmTMB. Both are regularly tested and used without issue (here on GitHub Actions, on CRAN, by me personally, and by many others). There must be something about this specific system setup. Probably the best hope of solving this is with Docker. If someone can reproduce the problem on Docker and point me to the dockerfile then I can build it and troubleshoot.

It's also worth confirming if this is something unique to sdmTMB or if this happens with other TMB random effects models built locally. E.g., starting with a basic random effects model such as 'thetalog.R', and if that works, also trying an SPDE spatial model as in 'spde.R'. Both are in this examples folder: https://github.com/kaskr/adcomp/blob/master/tmb_examples/

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

4 participants