Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

get_index potential computation issues for very large prediction grids #334

Open
Lewis-Barnett-NOAA opened this issue Apr 9, 2024 · 8 comments

Comments

@Lewis-Barnett-NOAA
Copy link
Collaborator

Unfortunately, this is another one that is hard to offer an MWE for due to the model complexity, but I've found that get_index is producing NAs for the index and its SD when the prediction grid/prediction object includes too many points. In the example I was working with, if I split the same prediction grid into 2 subareas, the subarea indices compute as expected. If I can pin the issue down more definitively I'll update this issue.

Possible solutions:

  1. Leave it up to the user to coarsen their prediction grid as needed, and perhaps make a note about this somewhere in the docs
  2. Include built-in functionality to coarsen the grid, to an extent based on a user-specified argument
@sebpardo
Copy link
Contributor

I often encounter memory allocation errors due to very large prediction grids and cutting the grid into chunks works well for me; a function that automates this could be a third solution to this issue:

max_chunk_length <- 500000 
grid_chunks <- prediction_grid%>% group_split(group_id = row_number() %/% max_chunk_length) 

predict_list <- map(grid_chunks,
                    ~predict(fit, newdata = .x)$est %>%
                         fit$family$linkinv(.))
unlist(predict_list)

@seananderson
Copy link
Member

Nice idea, @sebpardo. For grids that are still small enough to predict on for a single year at a time, the simplest would be to iterate over each year (or set of years). E.g. #314 (reply in thread)

I'm in the midst of overhauling how extra time slices are handled and as part of that, get_index() should also become a bit more memory efficient with this kind of slicing.

I'm not sure how to catch this, other than maybe we can check for NAs in the index data frame and issue a message suggesting this may be the issue and how to work around it.

We could also do some benchmarking and perhaps it makes sense to do the bias correction by year by default (somewhere I have a script where I was already doing this).

And be sure you're on the latest version because I made this overall more memory efficient recently.

@Lewis-Barnett-NOAA
Copy link
Collaborator Author

Lewis-Barnett-NOAA commented Apr 18, 2024 via email

@seananderson
Copy link
Member

On second though, that approach by @sebpardo above will work for prediction on large grids, but won't work if your goal is to do spatial integration with uncertainty, as in with get_index().

I've been doing some memory profiling of various approaches:
image

It appears using lapply and do.call on a year-by-year prediction is a huge win on memory for large grids at the expense of time (mostly because of all the separate predictions, I imagine).
It might be worth having an argument in get_index that integrates for a specific year or set of years so the prediction could still just be done once.

Interestingly, purrr::map_dfr is very memory hungry compared to map or lapply followed by converting the list to a data frame after.

I'll post code later when I've worked this out more.

@MikkoVihtakari tagging you in this issue too.

@KristaDBaker
Copy link

I have been running into this issue a lot lately too, even when making the prediction grids relatively large. Has there been any movement on a more permanent fix (e.g., argument within get_index())? Thanks.

@seananderson
Copy link
Member

I wrote an experimental function in a new branch 'index-split'. 13a55d4

Try it out and let me know if you have any issues or interface suggestions before I merge into main.

Install with:

pak::pak("pbs-assess/sdmTMB@index-split")

See the function ?get_index_split.

There can't just be an argument in get_index() because the way sdmTMB is set up, the prediction and index calculation are two steps. So, we need something that wraps those.

Note this will be slower if you don't need it. It's a speed memory tradeoff. See the rough figure above.

library(sdmTMB)
mesh <- make_mesh(pcod, c("X", "Y"), n_knots = 60)
m <- sdmTMB(
 data = pcod,
 formula = density ~ 0 + as.factor(year),
 time = "year", mesh = mesh, family = tweedie(link = "log")
)
nd <- replicate_df(qcs_grid, "year", unique(pcod$year))
ind <- get_index_split(m, newdata = nd, nsplit = 2, bias_correct = TRUE)
#> Calculating index in 2 chunks ■■■■■■■■■■■■■■■■                  50% | ETA:  0s
#> Calculating index in 2 chunks ■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■  100% | ETA:  0s
ind
#>   year      est       lwr      upr  log_est        se  type
#> 1 2003 277556.6 198537.77 388025.4 12.53378 0.1709448 index
#> 2 2004 399682.6 311373.58 513036.9 12.89843 0.1273887 index
#> 3 2005 430225.5 327502.38 565168.4 12.97206 0.1391935 index
#> 4 2007 119509.7  86514.07 165089.6 11.69115 0.1648452 index
#> 5 2009 210258.6 151807.71 291215.1 12.25609 0.1661887 index
#> 6 2011 339420.7 262199.67 439384.3 12.73500 0.1317035 index
#> 7 2013 351455.2 258755.05 477365.7 12.76984 0.1562276 index
#> 8 2015 383241.4 285819.07 513870.4 12.85642 0.1496487 index
#> 9 2017 192367.9 136764.93 270576.6 12.16716 0.1740572 index

Created on 2024-06-26 with reprex v2.1.0

@Lewis-Barnett-NOAA
Copy link
Collaborator Author

Lewis-Barnett-NOAA commented Jun 27, 2024 via email

@Lewis-Barnett-NOAA
Copy link
Collaborator Author

Finally got around to testing this. I'm having some issues with how get_index_split() is parsing the area object. If i give it a vector of areas for each row of the prediction grid, I get this error from get_generic() indicating that the area is not the right length:

image

Does this have to do with the transition from how we handled the "fake newdata" padding in earlier versions that didn't get reflected in this branch?

If I give it a single value for area I get a different error message indicating some mismatch in expected object dimensions, the same message as in this issue

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

4 participants