-
Notifications
You must be signed in to change notification settings - Fork 27
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
get_index potential computation issues for very large prediction grids #334
Comments
I often encounter memory allocation errors due to very large prediction grids and cutting the grid into chunks works well for me; a function that automates this could be a third solution to this issue: max_chunk_length <- 500000
grid_chunks <- prediction_grid%>% group_split(group_id = row_number() %/% max_chunk_length)
predict_list <- map(grid_chunks,
~predict(fit, newdata = .x)$est %>%
fit$family$linkinv(.))
unlist(predict_list) |
Nice idea, @sebpardo. For grids that are still small enough to predict on for a single year at a time, the simplest would be to iterate over each year (or set of years). E.g. #314 (reply in thread) I'm in the midst of overhauling how extra time slices are handled and as part of that, I'm not sure how to catch this, other than maybe we can check for NAs in the index data frame and issue a message suggesting this may be the issue and how to work around it. We could also do some benchmarking and perhaps it makes sense to do the bias correction by year by default (somewhere I have a script where I was already doing this). And be sure you're on the latest version because I made this overall more memory efficient recently. |
Good ideas, thanks fellas. The chunking of the prediction grid has worked
better for me in practice than the aggregating, but seems slower. I'll
keep experimenting through the latest updates.
…On Wed, Apr 17, 2024 at 1:37 PM Sean Anderson ***@***.***> wrote:
Nice idea, @sebpardo <https://github.com/sebpardo>. For grids that are
still small enough to predict on for a single year at a time, the simplest
would be to iterate over each year (or set of years). E.g. #314 (reply in
thread)
<#314 (reply in thread)>
I'm in the midst of overhauling how extra time slices are handled and as
part of that, get_index() should also become a bit more memory efficient
with this kind of slicing.
I'm not sure how to catch this, other than maybe we can check for NAs in
the index data frame and issue a message suggesting this may be the issue
and how to work around it.
We could also do some benchmarking and perhaps it makes sense to do the
bias correction by year by default (somewhere I have a script where I was
already doing this).
And be sure you're on the latest version because I made this overall more
memory efficient recently.
—
Reply to this email directly, view it on GitHub
<#334 (comment)>,
or unsubscribe
<https://github.com/notifications/unsubscribe-auth/AAKMJP4IM45J6AFNEMFBDN3Y53MPTAVCNFSM6AAAAABF7DDAB2VHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMZDANRSGIZDANBTHE>
.
You are receiving this because you authored the thread.Message ID:
***@***.***>
--
Lewis Barnett, PhD (he/him/his)
Research Fish Biologist
NOAA Fisheries, Alaska Fisheries Science Center
7600 Sand Point Way NE, Bldg 4
Seattle, Washington 98115
Google Voice: (206) 526-4111
|
On second though, that approach by @sebpardo above will work for prediction on large grids, but won't work if your goal is to do spatial integration with uncertainty, as in with I've been doing some memory profiling of various approaches: It appears using lapply and do.call on a year-by-year prediction is a huge win on memory for large grids at the expense of time (mostly because of all the separate predictions, I imagine). Interestingly, purrr::map_dfr is very memory hungry compared to map or lapply followed by converting the list to a data frame after. I'll post code later when I've worked this out more. @MikkoVihtakari tagging you in this issue too. |
I have been running into this issue a lot lately too, even when making the prediction grids relatively large. Has there been any movement on a more permanent fix (e.g., argument within get_index())? Thanks. |
I wrote an experimental function in a new branch 'index-split'. 13a55d4 Try it out and let me know if you have any issues or interface suggestions before I merge into main. Install with: pak::pak("pbs-assess/sdmTMB@index-split") See the function There can't just be an argument in Note this will be slower if you don't need it. It's a speed memory tradeoff. See the rough figure above. library(sdmTMB)
mesh <- make_mesh(pcod, c("X", "Y"), n_knots = 60)
m <- sdmTMB(
data = pcod,
formula = density ~ 0 + as.factor(year),
time = "year", mesh = mesh, family = tweedie(link = "log")
)
nd <- replicate_df(qcs_grid, "year", unique(pcod$year))
ind <- get_index_split(m, newdata = nd, nsplit = 2, bias_correct = TRUE)
#> Calculating index in 2 chunks ■■■■■■■■■■■■■■■■ 50% | ETA: 0s
#> Calculating index in 2 chunks ■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■ 100% | ETA: 0s ind
#> year est lwr upr log_est se type
#> 1 2003 277556.6 198537.77 388025.4 12.53378 0.1709448 index
#> 2 2004 399682.6 311373.58 513036.9 12.89843 0.1273887 index
#> 3 2005 430225.5 327502.38 565168.4 12.97206 0.1391935 index
#> 4 2007 119509.7 86514.07 165089.6 11.69115 0.1648452 index
#> 5 2009 210258.6 151807.71 291215.1 12.25609 0.1661887 index
#> 6 2011 339420.7 262199.67 439384.3 12.73500 0.1317035 index
#> 7 2013 351455.2 258755.05 477365.7 12.76984 0.1562276 index
#> 8 2015 383241.4 285819.07 513870.4 12.85642 0.1496487 index
#> 9 2017 192367.9 136764.93 270576.6 12.16716 0.1740572 index Created on 2024-06-26 with reprex v2.1.0 |
Thanks Sean, I'll do some testing with this for the Bering groundfish
biomass indices and let you know if I notice anything about how the
speed/memory tradeoff works for our servers. We are also exploring some
cloud computing options that could of course help solve these issues as
well.
…On Wed, Jun 26, 2024 at 1:16 PM Sean Anderson ***@***.***> wrote:
I wrote an experimental function in a new branch 'index-split'. 13a55d4
<13a55d4>
Try it out and let me know if you have any issues or interface suggestions
before I merge into main.
Install with:
***@***.***")
See the function ?get_index_split.
There can't just be an argument in get_index() because the way sdmTMB is
set up, the prediction and index calculation are two steps. So, we need
something that wraps those.
Note this will be slower if you don't need it. It's a speed memory
tradeoff. See the rough figure above.
library(sdmTMB)mesh <- make_mesh(pcod, c("X", "Y"), n_knots = 60)m <- sdmTMB(
data = pcod,
formula = density ~ 0 + as.factor(year),
time = "year", mesh = mesh, family = tweedie(link = "log")
)nd <- replicate_df(qcs_grid, "year", unique(pcod$year))ind <- get_index_split(m, newdata = nd, nsplit = 2, bias_correct = TRUE)#> Calculating index in 2 chunks ■■■■■■■■■■■■■■■■ 50% | ETA: 0s#> Calculating index in 2 chunks ■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■ 100% | ETA: 0s
ind#> year est lwr upr log_est se type#> 1 2003 277556.6 198537.77 388025.4 12.53378 0.1709448 index#> 2 2004 399682.6 311373.58 513036.9 12.89843 0.1273887 index#> 3 2005 430225.5 327502.38 565168.4 12.97206 0.1391935 index#> 4 2007 119509.7 86514.07 165089.6 11.69115 0.1648452 index#> 5 2009 210258.6 151807.71 291215.1 12.25609 0.1661887 index#> 6 2011 339420.7 262199.67 439384.3 12.73500 0.1317035 index#> 7 2013 351455.2 258755.05 477365.7 12.76984 0.1562276 index#> 8 2015 383241.4 285819.07 513870.4 12.85642 0.1496487 index#> 9 2017 192367.9 136764.93 270576.6 12.16716 0.1740572 index
Created on 2024-06-26 with reprex v2.1.0 <https://reprex.tidyverse.org>
—
Reply to this email directly, view it on GitHub
<#334 (comment)>,
or unsubscribe
<https://github.com/notifications/unsubscribe-auth/AAKMJP62D2YIEB5PDNAFWKDZJMOSDAVCNFSM6AAAAABF7DDAB2VHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMZDCOJSGU2TANBUHA>
.
You are receiving this because you authored the thread.Message ID:
***@***.***>
--
Lewis Barnett, PhD (he/him)
Research Fish Biologist
NOAA Fisheries, Alaska Fisheries Science Center
7600 Sand Point Way NE, Bldg 4
Seattle, Washington 98115
Google Voice: (206) 526-4111
|
Finally got around to testing this. I'm having some issues with how get_index_split() is parsing the area object. If i give it a vector of areas for each row of the prediction grid, I get this error from get_generic() indicating that the area is not the right length: Does this have to do with the transition from how we handled the "fake newdata" padding in earlier versions that didn't get reflected in this branch? If I give it a single value for area I get a different error message indicating some mismatch in expected object dimensions, the same message as in this issue |
Unfortunately, this is another one that is hard to offer an MWE for due to the model complexity, but I've found that get_index is producing NAs for the index and its SD when the prediction grid/prediction object includes too many points. In the example I was working with, if I split the same prediction grid into 2 subareas, the subarea indices compute as expected. If I can pin the issue down more definitively I'll update this issue.
Possible solutions:
The text was updated successfully, but these errors were encountered: