Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

cross-validation vignette needs updating #259

Open
Lewis-Barnett-NOAA opened this issue Sep 29, 2023 · 3 comments
Open

cross-validation vignette needs updating #259

Lewis-Barnett-NOAA opened this issue Sep 29, 2023 · 3 comments

Comments

@Lewis-Barnett-NOAA
Copy link
Collaborator

There is no elpd object created in most recent version, but this is a metric that the vignette focuses on and expects to be in the model object

@Lewis-Barnett-NOAA
Copy link
Collaborator Author

also, this model won't converge, with a non-positive-definite Hessian:

test_years <- c(2011, 2013, 2015, 2017)

models <- list()
log_lik <- list()
for (i in 1:length(test_years)) {
clust <- rep(1, nrow(pcod))
clust[which(pcod$year < test_years[i])] <- 2

models[[i]] <- sdmTMB_cv(
density ~ 0 + s(depth_scaled),
data = pcod,
mesh = mesh,
spatiotemporal = "off",
fold_ids = clust,
family = tweedie(link = "log"),
k_folds = length(unique(clust))
)

log_lik[[i]] <- sum(m_cv$data$cv_loglik[which(m_cv$data$year == test_years[i])])
}

@seananderson
Copy link
Member

I made some quick fixes and also switched eval=TRUE so these things will be detected in the future. It's probably worth restructuring so that model comparison comes first since that is the main point of cross validation. I removed the old LFOCV section. I started adding in a new example, but then realized that I am totally confused by how these arguments work:

library(sdmTMB)
mesh <- make_mesh(pcod, c("X", "Y"), cutoff = 25)
m_lfocv <- sdmTMB_cv(
  present ~ s(year, k = 3),
  data = pcod,
  mesh = mesh,
  lfo = TRUE, # do LFOCV
  lfo_forecast = 2, # number of time steps to forecast
  lfo_validations = 3, # number of times to validate
  family = binomial(),
  spatiotemporal = "off",
  time = "year" # must be specified
)
#> Running fits with `future.apply()`.
#> Set a parallel `future::plan()` to use parallel processing.

# See how the LFOCV folds were assigned:
example_data <- m_lfocv$models[[2]]$data
table(example_data$cv_fold, example_data$year)
#>    
#>     2003 2004 2005 2007 2009 2011 2013 2015 2017
#>   1  232  230  224  255  233    0    0    0    0
#>   2    0    0    0    0    0  251    0    0    0
#>   3    0    0    0    0    0    0  240    0    0
#>   4    0    0    0    0    0    0    0  238    0
#>   5    0    0    0    0    0    0    0    0  240

Created on 2023-09-29 with reprex v2.0.2

Here is what I would have expected:

An example of LFOCV with 9 time steps, lfo_forecast = 2, and lfo_validations = 3:
Fit data to time steps 1 to 5, predict and validate step 7.
Fit data to time steps 1 to 6, predict and validate step 8.
Fit data to time steps 1 to 7, predict and validate step 9.

One thing I'm not sure is if the time steps are 2007, 2008, 2009 etc. or 2007, 2009, 2011 (the steps provided). But I'm also just not sure how lfo_forecast and lfo_validations correspond to the table() above.

@ericward-noaa
Copy link
Collaborator

ericward-noaa commented Oct 11, 2023

Sorry I was slow here, see #262

For your example, of LFOCV with 9 time steps, lfo_forecast = 2, and lfo_validations = 3, the model is doing:
Fit data to time steps 1 to 5, predict and validate step 7.
Fit data to time steps 1 to 6, predict and validate step 8.
Fit data to time steps 1 to 7, predict and validate step 9.

This can be seen in a couple places in the code, e.g.

dat_fit <- data[data$cv_fold <= k, , drop = FALSE]

The time steps used are those provided (2007, 2009, 2011).

Here's an example of how the table is made, using a snippet from sdmTMB_cv:

data <- pcod
time <- "year"
data$cv_fold <- 1
t_validate <- sort(unique(data[[time]]), decreasing = TRUE)
lfo_validations <- 5
lfo_forecast <- 1
for (t in seq(1, lfo_validations)) {
  # fold id increasing order + forecast
  data$cv_fold[data[[time]] == t_validate[t]] <- lfo_validations - t + 1 + lfo_forecast
}
data$cv_fold <- as.numeric(as.factor(data$cv_fold))

table(data$cv_fold, data$year)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants