Recovery from sporadic errors in updaters #123

jarioksa · 2021-10-14T15:02:51Z

Several people have experienced problems with failing updaters that will stop the run (e.g. issues #45, discussion in #96). Typically these are problems in function chol (Cholesky decomposition). Typical error message is the leading minor of order N is not positive definite. Often these errors are rare and only occur after long runs, but then they will stop the run and the whole analysis is lost.

There is now experimental code to recover from these errors. Errors would no longer be fatal and the whole run would not be lost. When an error occurs, the previous parameter values are retained for that updater. Hopefully these also will be updated in next iterations, and the analysis proceeds normally.

The experimental code can be installed directly from github with

devtools::install_github("hmsc-r/HMSC", ref="updater-errors")

When writing this, the code is still very crude but it should work and be usable for testing. We are going to develop the code if it looks useful and seems to fix the problems of sporadically failing updaters. This first version of the fix is loud: the original error message is printed¹, but it is not fatal and the analysis will continue. However, it is uninformative and does not yet tell which updater failed, nor does it collect statistics on failure counts.² In general, if the error messages are not very frequent, the results (with some thinning) are probably trustworthy. A crude first guess is that 1% error rate could be tolerated, but this guess can be improved with experience.

A few words about the errors users have experienced. In principle the errors should not occur: the mathematics are such that all Cholesky decompositions should succeed in theory. However, in numerical analysis some of the matrices may be on the margin of parameter space or so poorly defined that they fail. These are typically errors that occur rarely and after long runs. This also makes the errors difficult to track and fix – and therefore we have this experimental code.

A typical problematic case is to use very large and possibly overparametrized models which easily lead to internal matrices which are numerically "not positive definite" (which is the typical error message). Large number of correlated X-variates or other variables are particularly problematic. In these cases it is often best to simplify the model, at least first (you can increase complexity later if it seems to be necessary).

Spatial models also frequently run into errors. Often the case is wide spatial scaling or high ratio of largest to smallest distances. We try to model spatial structure with a single spatial scale parameter (alpha, α) and this can run into trouble with wide scale of distances. Particularly challenging are spatial clusters with closely packed points, and the clusters far away from each other. In these cases it is often best to use hierarchic model where spatial clusters are a random level and the spatial distances (if needed) are only analysed within these clusters. We are working on developing more robust spatial analysis, but you should still consider spatial scaling in your models.

Please try the new experimental code. We do not run into these problems ourselves and we cannot test the code. It is still best to try to define models so that they do not get into Cholesky errors. However, if these problems occur rarely, the experimental code can save your day (or save your week in long runs), and if the error messages are rare you can trust the results as well as in any other Bayesian MCMC (that is, you need to check the convergence anyway). All feedback will be utilized to improve the code, and we only proceed if this looks useful.

Updates:

The messages will not be printed in parallel processing, and never after commit 2620f99. Updated 21/10/21, 14/12/21.
Failure counts are collected and printed after sampleMcmc, and saved as attribute failedUpdates of each postList and print.Hmsc will permanently show info on failures after commit 011193e. Updated 21/10/21, 14/12/21.
Information: We have had an access to one model with frequent failures. These failures only occurred in one of four chains. Obviously that chain drifted to a corner of the parameter space and could not find its way out. The best action in this situation (one failed, several good chains) is to remove the faulty chain. For example, you can remove chain 3 in model m with m$postList[[3]] <- NULL. Updated 14/12/21.

The text was updated successfully, but these errors were encountered:

jarioksa mentioned this issue Oct 14, 2021

Error in chol.default(iU) : the leading minor of order 2 is not positive definite #45

Open

jarioksa mentioned this issue Dec 2, 2021

error for predict along a gradient with NNGP #96

Open

jarioksa added enhancement New feature or request help wanted Extra attention is needed labels Dec 14, 2021

wwhbeach mentioned this issue Dec 27, 2023

Hmsc Error: 8 nodes produced errors; first error: the leading minor of order 4 is not positive definite #174

Open

amybauer mentioned this issue Jan 16, 2024

sampleMcmc Error in checkForRemoteErrors(val) : one node produced an error: NA/NaN/Inf in foreign function call #176

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Recovery from sporadic errors in updaters #123

Recovery from sporadic errors in updaters #123

jarioksa commented Oct 14, 2021 •

edited

Loading

Recovery from sporadic errors in updaters #123

Recovery from sporadic errors in updaters #123

Comments

jarioksa commented Oct 14, 2021 • edited Loading

Updates:

jarioksa commented Oct 14, 2021 •

edited

Loading