get_index potential computation issues for very large prediction grids #334

Lewis-Barnett-NOAA · 2024-04-09T18:46:25Z

Unfortunately, this is another one that is hard to offer an MWE for due to the model complexity, but I've found that get_index is producing NAs for the index and its SD when the prediction grid/prediction object includes too many points. In the example I was working with, if I split the same prediction grid into 2 subareas, the subarea indices compute as expected. If I can pin the issue down more definitively I'll update this issue.

Possible solutions:

Leave it up to the user to coarsen their prediction grid as needed, and perhaps make a note about this somewhere in the docs
Include built-in functionality to coarsen the grid, to an extent based on a user-specified argument

sebpardo · 2024-04-17T20:27:36Z

I often encounter memory allocation errors due to very large prediction grids and cutting the grid into chunks works well for me; a function that automates this could be a third solution to this issue:

max_chunk_length <- 500000 
grid_chunks <- prediction_grid%>% group_split(group_id = row_number() %/% max_chunk_length) 

predict_list <- map(grid_chunks,
                    ~predict(fit, newdata = .x)$est %>%
                         fit$family$linkinv(.))
unlist(predict_list)

seananderson · 2024-04-17T20:36:52Z

Nice idea, @sebpardo. For grids that are still small enough to predict on for a single year at a time, the simplest would be to iterate over each year (or set of years). E.g. #314 (reply in thread)

I'm in the midst of overhauling how extra time slices are handled and as part of that, get_index() should also become a bit more memory efficient with this kind of slicing.

I'm not sure how to catch this, other than maybe we can check for NAs in the index data frame and issue a message suggesting this may be the issue and how to work around it.

We could also do some benchmarking and perhaps it makes sense to do the bias correction by year by default (somewhere I have a script where I was already doing this).

And be sure you're on the latest version because I made this overall more memory efficient recently.

Lewis-Barnett-NOAA · 2024-04-18T01:38:28Z

Good ideas, thanks fellas. The chunking of the prediction grid has worked better for me in practice than the aggregating, but seems slower. I'll keep experimenting through the latest updates.

…

On Wed, Apr 17, 2024 at 1:37 PM Sean Anderson ***@***.***> wrote: Nice idea, @sebpardo <https://github.com/sebpardo>. For grids that are still small enough to predict on for a single year at a time, the simplest would be to iterate over each year (or set of years). E.g. #314 (reply in thread) <#314 (reply in thread)> I'm in the midst of overhauling how extra time slices are handled and as part of that, get_index() should also become a bit more memory efficient with this kind of slicing. I'm not sure how to catch this, other than maybe we can check for NAs in the index data frame and issue a message suggesting this may be the issue and how to work around it. We could also do some benchmarking and perhaps it makes sense to do the bias correction by year by default (somewhere I have a script where I was already doing this). And be sure you're on the latest version because I made this overall more memory efficient recently. — Reply to this email directly, view it on GitHub <#334 (comment)>, or unsubscribe <https://github.com/notifications/unsubscribe-auth/AAKMJP4IM45J6AFNEMFBDN3Y53MPTAVCNFSM6AAAAABF7DDAB2VHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMZDANRSGIZDANBTHE> . You are receiving this because you authored the thread.Message ID: ***@***.***>

-- Lewis Barnett, PhD (he/him/his) Research Fish Biologist NOAA Fisheries, Alaska Fisheries Science Center 7600 Sand Point Way NE, Bldg 4 Seattle, Washington 98115 Google Voice: (206) 526-4111

seananderson · 2024-04-18T18:59:27Z

On second though, that approach by @sebpardo above will work for prediction on large grids, but won't work if your goal is to do spatial integration with uncertainty, as in with get_index().

I've been doing some memory profiling of various approaches:

It appears using lapply and do.call on a year-by-year prediction is a huge win on memory for large grids at the expense of time (mostly because of all the separate predictions, I imagine).
It might be worth having an argument in get_index that integrates for a specific year or set of years so the prediction could still just be done once.

Interestingly, purrr::map_dfr is very memory hungry compared to map or lapply followed by converting the list to a data frame after.

I'll post code later when I've worked this out more.

@MikkoVihtakari tagging you in this issue too.

KristaDBaker · 2024-06-26T12:40:50Z

I have been running into this issue a lot lately too, even when making the prediction grids relatively large. Has there been any movement on a more permanent fix (e.g., argument within get_index())? Thanks.

seananderson · 2024-06-26T20:16:12Z

I wrote an experimental function in a new branch 'index-split'. 13a55d4

Try it out and let me know if you have any issues or interface suggestions before I merge into main.

Install with:

pak::pak("pbs-assess/sdmTMB@index-split")

See the function ?get_index_split.

There can't just be an argument in get_index() because the way sdmTMB is set up, the prediction and index calculation are two steps. So, we need something that wraps those.

Note this will be slower if you don't need it. It's a speed memory tradeoff. See the rough figure above.

library(sdmTMB)
mesh <- make_mesh(pcod, c("X", "Y"), n_knots = 60)
m <- sdmTMB(
 data = pcod,
 formula = density ~ 0 + as.factor(year),
 time = "year", mesh = mesh, family = tweedie(link = "log")
)
nd <- replicate_df(qcs_grid, "year", unique(pcod$year))
ind <- get_index_split(m, newdata = nd, nsplit = 2, bias_correct = TRUE)
#> Calculating index in 2 chunks ■■■■■■■■■■■■■■■■                  50% | ETA:  0s
#> Calculating index in 2 chunks ■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■  100% | ETA:  0s

ind
#>   year      est       lwr      upr  log_est        se  type
#> 1 2003 277556.6 198537.77 388025.4 12.53378 0.1709448 index
#> 2 2004 399682.6 311373.58 513036.9 12.89843 0.1273887 index
#> 3 2005 430225.5 327502.38 565168.4 12.97206 0.1391935 index
#> 4 2007 119509.7  86514.07 165089.6 11.69115 0.1648452 index
#> 5 2009 210258.6 151807.71 291215.1 12.25609 0.1661887 index
#> 6 2011 339420.7 262199.67 439384.3 12.73500 0.1317035 index
#> 7 2013 351455.2 258755.05 477365.7 12.76984 0.1562276 index
#> 8 2015 383241.4 285819.07 513870.4 12.85642 0.1496487 index
#> 9 2017 192367.9 136764.93 270576.6 12.16716 0.1740572 index

^{Created on 2024-06-26 with reprex v2.1.0}

Lewis-Barnett-NOAA · 2024-06-27T17:46:51Z

Thanks Sean, I'll do some testing with this for the Bering groundfish biomass indices and let you know if I notice anything about how the speed/memory tradeoff works for our servers. We are also exploring some cloud computing options that could of course help solve these issues as well.

…

On Wed, Jun 26, 2024 at 1:16 PM Sean Anderson ***@***.***> wrote: I wrote an experimental function in a new branch 'index-split'. 13a55d4 <13a55d4> Try it out and let me know if you have any issues or interface suggestions before I merge into main. Install with: ***@***.***") See the function ?get_index_split. There can't just be an argument in get_index() because the way sdmTMB is set up, the prediction and index calculation are two steps. So, we need something that wraps those. Note this will be slower if you don't need it. It's a speed memory tradeoff. See the rough figure above. library(sdmTMB)mesh <- make_mesh(pcod, c("X", "Y"), n_knots = 60)m <- sdmTMB( data = pcod, formula = density ~ 0 + as.factor(year), time = "year", mesh = mesh, family = tweedie(link = "log") )nd <- replicate_df(qcs_grid, "year", unique(pcod$year))ind <- get_index_split(m, newdata = nd, nsplit = 2, bias_correct = TRUE)#> Calculating index in 2 chunks ■■■■■■■■■■■■■■■■ 50% | ETA: 0s#> Calculating index in 2 chunks ■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■ 100% | ETA: 0s ind#> year est lwr upr log_est se type#> 1 2003 277556.6 198537.77 388025.4 12.53378 0.1709448 index#> 2 2004 399682.6 311373.58 513036.9 12.89843 0.1273887 index#> 3 2005 430225.5 327502.38 565168.4 12.97206 0.1391935 index#> 4 2007 119509.7 86514.07 165089.6 11.69115 0.1648452 index#> 5 2009 210258.6 151807.71 291215.1 12.25609 0.1661887 index#> 6 2011 339420.7 262199.67 439384.3 12.73500 0.1317035 index#> 7 2013 351455.2 258755.05 477365.7 12.76984 0.1562276 index#> 8 2015 383241.4 285819.07 513870.4 12.85642 0.1496487 index#> 9 2017 192367.9 136764.93 270576.6 12.16716 0.1740572 index Created on 2024-06-26 with reprex v2.1.0 <https://reprex.tidyverse.org> — Reply to this email directly, view it on GitHub <#334 (comment)>, or unsubscribe <https://github.com/notifications/unsubscribe-auth/AAKMJP62D2YIEB5PDNAFWKDZJMOSDAVCNFSM6AAAAABF7DDAB2VHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMZDCOJSGU2TANBUHA> . You are receiving this because you authored the thread.Message ID: ***@***.***>

-- Lewis Barnett, PhD (he/him) Research Fish Biologist NOAA Fisheries, Alaska Fisheries Science Center 7600 Sand Point Way NE, Bldg 4 Seattle, Washington 98115 Google Voice: (206) 526-4111

Lewis-Barnett-NOAA · 2024-11-13T23:04:19Z

Finally got around to testing this. I'm having some issues with how get_index_split() is parsing the area object. If i give it a vector of areas for each row of the prediction grid, I get this error from get_generic() indicating that the area is not the right length:

Does this have to do with the transition from how we handled the "fake newdata" padding in earlier versions that didn't get reflected in this branch?

If I give it a single value for area I get a different error message indicating some mismatch in expected object dimensions, the same message as in this issue

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

get_index potential computation issues for very large prediction grids #334

get_index potential computation issues for very large prediction grids #334

Lewis-Barnett-NOAA commented Apr 9, 2024

sebpardo commented Apr 17, 2024

seananderson commented Apr 17, 2024

Lewis-Barnett-NOAA commented Apr 18, 2024 via email

seananderson commented Apr 18, 2024

KristaDBaker commented Jun 26, 2024

seananderson commented Jun 26, 2024

Lewis-Barnett-NOAA commented Jun 27, 2024 via email

Lewis-Barnett-NOAA commented Nov 13, 2024

get_index potential computation issues for very large prediction grids #334

get_index potential computation issues for very large prediction grids #334

Comments

Lewis-Barnett-NOAA commented Apr 9, 2024

sebpardo commented Apr 17, 2024

seananderson commented Apr 17, 2024

Lewis-Barnett-NOAA commented Apr 18, 2024 via email

seananderson commented Apr 18, 2024

KristaDBaker commented Jun 26, 2024

seananderson commented Jun 26, 2024

Lewis-Barnett-NOAA commented Jun 27, 2024 via email

Lewis-Barnett-NOAA commented Nov 13, 2024