Spatial modelling of both estimate and dispersion? #355

R-KenK · 2024-07-12T01:21:08Z

R-KenK
Jul 12, 2024

I've come across sdmTMB and I love the functionalities it gathers! I'm a novice with the GMRF involved, but what the models estimate feels intuitive and very useful, so I'd like to use it for the following analysis.

Context

I plan to model spatial activity of carnivores based on latrine use (counts n of fresh feces deposited when checking an identified latrine) at ~100 locations. I am more interested in inferring an overall spatial model of their activity, i.e. if neighboring locations have range of n values, I want them to be averaged spatially, and the spatial model to be "smoothed" across space. These spatial estimates would be used as predictors in a second model (hence the interest in estimating both average values and uncertainty).

Goal

In addition to estimating n through space (goal 1), I also wish to model its variability/uncertainty (goal 2) through space. And for this, I am not sure what's the best approach, or if sdmTMB is well suited. Similarly as above, I would like the model to reflect the overall uncertainty of neighboring locations, and quantify where uncertainty is high(er) and low(er) from the data.

Questions

It seems to me the spatial random field estimated by sdmTMB is achieving goal 1. Am I correct?
Is it possible to fit a spatial random field for dispersion to achieve goal 2? Would your otherwise recommend a whole other approach?

What I tried

I supposed that a simple intercept model with a spatial random fields would capture the spatial variations of n, but still assume homoscedasticity of the residual (and therefore not capture that some estimate are more or less uncertain).
I managed to model variable dispersion using glmmTMB()'s dispformula argument, or brmsformula()'s sigma ~ ... specification, so I tried something similar with sdmTMB(). I installed sdmTMB from its dispformula2 branch and fit a model with a spatial random field (using formula = n ~ 1, spatial = "on") and variable dispersion per location $s$ (using dispformula = ~ 0 + s.
This is on a simulated dataset with $n_s \sim N(\mu_s,{\sigma_s}^2)$ at first, before trying with actual counts modelling. This does the job of capturing the variable dispersion at identified point locations (which are assumed iid?), but this does not really capture the spatial variations of dispersion. Further, models start to fail to converge with dispersion if I simulate $n_s \sim Nbinom2(\mu_s,\phi_s)$.

Answered by seananderson

Jul 12, 2024

It seems to me the spatial random field estimated by sdmTMB is achieving goal 1. Am I correct?

Yes

I would like the model to reflect the overall uncertainty of neighboring locations, and quantify where uncertainty is high(er) and low(er) from the data.

This is possible. The fastest way to compute this is with draws from the joint parameter covariance matrix (actually the inverse: the precision matrix). Functionally, this is with the nsim argument in ?predict.sdmTMB. There's currently an example in the package readme. Theoretically you can get this from predict.sdmTMB() with se_fit = TRUE, but in practice this can be very slow when including the random field values.

On the other hand, …

View full answer

seananderson · 2024-07-12T19:46:48Z

seananderson
Jul 12, 2024
Maintainer

It seems to me the spatial random field estimated by sdmTMB is achieving goal 1. Am I correct?

Yes

I would like the model to reflect the overall uncertainty of neighboring locations, and quantify where uncertainty is high(er) and low(er) from the data.

This is possible. The fastest way to compute this is with draws from the joint parameter covariance matrix (actually the inverse: the precision matrix). Functionally, this is with the nsim argument in ?predict.sdmTMB. There's currently an example in the package readme. Theoretically you can get this from predict.sdmTMB() with se_fit = TRUE, but in practice this can be very slow when including the random field values.

On the other hand, if you what you mean by this:

In addition to estimating n through space (goal 1), I also wish to model its variability/uncertainty (goal 2) through space.

is that you want covariates that model the degree of observation error, that's not currently in the main branch. You're right that there's a branch with that functionality. Adding it to model fitting is simple, but I never finished making sure things like the print function worked with the output. If there's interest, I could prioritize bringing that over to the main branch (or at least updating that branch with the main branch).

I have never tried putting a GMRF into the dispersion linear predictor. Theoretically that should be possible. It would be a lot more book keeping though because of all the added parameters.

I guess my first question would be whether there's strong evidence that you need an explicit dispersion formula let alone a latent random field in the dispersion formula.

Counts are often modelled with a negative binomial (usually NB2) or a Poisson with additional lognormal dispersion (functionally, an observation level random intercept). An NB2 is quite flexible and given the zeros and l large counts that can be allowed with a sufficiently large dispersion, I imagine modelling covariates on that dispersion would be challenging—especially with "only" ~100 locations.

To summarize, just in case this isn't clear:

Quantifying uncertainty on the mean prediction or expected value is possible now and commonly done. This assumes a constant level of observation error.
There is a branch for modelling covariates on observation error. The branch is fairly out of date. I could update that. I'm not sure you'll need that here (or be able to estimate it).
Adding a random field in the level of observation error isn't currently possible. I'd be surprised if this was estimable without a lot of data.

1 reply

R-KenK Jul 15, 2024
Author

Thank you so much for your answer and thoughts! I have been exploring this topic pretty much on my own and it is good to read someone else's point of view on the matter.

I'll answer point by point below:

Uncertainty on mean prediction

Quantifying uncertainty on the mean prediction or expected value is possible now and commonly done. This assumes a constant level of observation error.

Yes, my first approach was as described in the readme (nsim and estimating sd through the simulated predictions), but this results indeed in residuals assumed homogeneous through space. I tried my best to go through the detailed documentation you provided before asking, and I must thank you for providing such a nice one around sdmTMB!

Further modelling of dispersion

On the other hand, if you what you mean [...] is that you want covariates that model the degree of observation error, that's not currently in the main branch. You're right that there's a branch with that functionality. [...]

I meant something more along these lines indeed. And this way have the residual observation-level error to be modeled as a continuous spatial process. I imagined dispersion could be modeled either as a simple intercept model (imagine a dispformula = ~ 1, spatial_disp = "on") or as a linear model of covariates (dispformula = ~ x1 + ..., spatial_disp = "on") indeed.

I have never tried putting a GMRF into the dispersion linear predictor. Theoretically that should be possible. It would be a lot more book keeping though because of all the added parameters.

I have not come across anything like that yet either, but it was a potential extension of the SPDE approach that "made sense" in my mind, in order to estimate both mean and dispersion through space from a point process. I actually tried to fit a model like formula = sd ~ 1, familly = lognormal()/tweedie()/student(), spatial = "on" with no success (I have no idea if modelling variance as a GLM is a thing).

Specific case of counts

Counts are often modelled with a negative binomial (usually NB2) or a Poisson with additional lognormal dispersion (functionally, an observation level random intercept). An NB2 is quite flexible and given the zeros and l large counts that can be allowed with a sufficiently large dispersion, I imagine modelling covariates on that dispersion would be challenging—especially with "only" ~100 locations.

These are indeed my usual go-to (NB2 especially). I join you on the fact that estimating a spatially structured dispersion for count on top of the flexible dispersion of such distribution is most likely a non-trivial addition.

Future directions

I guess my first question would be whether there's strong evidence that you need an explicit dispersion formula let alone a latent random field in the dispersion formula.

With the toy dataset I experimented with, it seemed to me that modelling an explicit per location dispersion is needed for predictions to be distributed like the original data (same magnitude of variance). I uploaded these experimentations as a reprex here if you are curious. The predicted values there are simulated following a normal distribution for simplicity.

My follow-up goal is to use the estimated mu and its total uncertainty (mu's standard error + per location dispersion) into a predictor with measurement error (cf. brms's mi(se)) in a second model. Ultimately, this turns n into a latent variable, and my searches suggest that there are several ways to approach such hierarchical models.

I fully understand that even though the thought of a GMRF for dispersion could work, it is likely a lot of effort for a niche and theoretical use case. I'm glad you could share your thought on this though. And otherwise, the dispformula = ~ 0 + loc does capture the per location total uncertainty well, so I think I could go with that.

seananderson · 2024-07-15T18:56:00Z

seananderson
Jul 15, 2024
Maintainer

for predictions to be distributed like the original data (same magnitude of variance)

I think you get this from looking at the code, but just be clear (because it's a frequent point of confusion), predictions (i.e., estimates or expected values) will not be distributed like the original 'data' (where data usually means observations), but simulated values (i.e. predictions + observation error) should be. Predictions should be distributed similarly to some underlying true mean if you're simulating data though (as you are). There are some subtleties though like needing a single draw from the random effects as if they were observed vs. the "empirical Bayes" estimates that are typically used in prediction (what is done in type = "mle-mvn" with residuals.sdmTMB() or simulate.sdmTMB()).

Looking at that example (with some very pretty visuals!), I just want to be sure you're not conflating the standard error on the prediction (SD of the samples with nsim) with the SD of the observation error. From my reading, the first 2 rows of your figure are showing the SD on the prediction (commonly called the SE) but the last row is showing the observation error SD. If you wanted to calculate the SD on new simulated observed values at the locations of the data, the simplest would be to use simulate.sdmTMB() with type = "mle-mvn". But, you'd have to do that repeatedly, i.e. with nsim > 1 in the simulate function. You're getting into the territory of what the DHARMa package does with the simulation output. Or, if you wanted the analytic answer of just the observation error (minus parameter uncertainty), the observation error SD is all you'd need (internally exp(ln_phi)). The thing called phi in tidy(fit, "ran_pars"). Or get_pars(fit)$ln_phi.

While the observation variance is constant with the mean in a GLM with an identity link and Gaussian error, it grows with the mean according to the given mean-variance relationship for other families. E.g., with a log link and the NB2, the variance grows quadratically with the mean. Usually, this is sufficient to represent the data. Checking this spatially would be tricky. One route would be to calculate randomized quantile residuals and look at their variance after binning them by some appropriately coarse grid. Just looking at the raw residuals or looking at the data wouldn't be enough because you have to account for the mean.

1 reply

R-KenK Jul 19, 2024
Author

Sorry for the delay in my response!

From my reading, the first 2 rows of your figure are showing the SD on the prediction (commonly called the SE) but the last row is showing the observation error SD. If you wanted to calculate the SD on new simulated observed values at the locations of the data, the simplest would be to use simulate.sdmTMB() with type = "mle-mvn".

Thanks for pointing this out! I got mixed up between the fitted(), predict(), and simulate() methods indeed and thought that predict() was generating responses that were including both variance components.

I did some experiments using simulate() instead of predict() after this suggestion, as my goal was indeed to generate predictions that emerge from both mu's SE AND residual observation level (ergo at specific locations). Updated results for the reprex are here.

[...] with type = "mle-mvn". But, you'd have to do that repeatedly, i.e. with nsim > 1 in the simulate function. You're getting into the territory of what the DHARMa package does with the simulation output. Or, if you wanted the analytic answer of just the observation error (minus parameter uncertainty), the observation error SD is all you'd need (internally exp(ln_phi)). The thing called phi in tidy(fit, "ran_pars"). Or get_pars(fit)$ln_phi.

That's very useful, thanks! It is something I might need indeed. And for the DHARMa approach details, I'll have a closer look at your article to get more of the simulation parameter intricacies.

I'm marking this question as Answered. Thanks again for sharing your thoughts!
And it would be good indeed if the dispformula2 branch gets merged into the latest main at some point! I can try to prepare a PR for such a merge, although I don't have much experience with this yet.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Spatial modelling of both estimate and dispersion? #355

{{title}}

Replies: 2 comments 2 replies

{{title}}

{{title}}

{{title}}

{{title}}

{{editor}}'s edit

{{editor}}'s edit

Select a reply

Spatial modelling of both estimate and dispersion? #355

R-KenK Jul 12, 2024

Context

Goal

Questions

What I tried

Replies: 2 comments · 2 replies

seananderson Jul 12, 2024 Maintainer

R-KenK Jul 15, 2024 Author

Uncertainty on mean prediction

Further modelling of dispersion

Specific case of counts

Future directions

seananderson Jul 15, 2024 Maintainer

R-KenK Jul 19, 2024 Author

R-KenK
Jul 12, 2024

Replies: 2 comments 2 replies

seananderson
Jul 12, 2024
Maintainer

R-KenK Jul 15, 2024
Author

seananderson
Jul 15, 2024
Maintainer

R-KenK Jul 19, 2024
Author