Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Faster prior variance calculation #306

Merged
merged 7 commits into from
Nov 14, 2023
Merged

Conversation

nspope
Copy link
Contributor

@nspope nspope commented Aug 3, 2023

Use a recursive approach to calculating prior variances that is a couple of orders of magnitude faster:

so it's possible to get the exact variances for 100,000 tips in a few minutes (by comparison, it's infeasible to get exact variances for more than 2,000 tips or so with the original algorithm).

Some indication that the two algorithms are equivalent, showing sqrt(variance) for 1280 tips:

Still trying to get my head around the interpolation scheme used for the approximation, and if this algorithm will be useful there (it recurses backwards from the maximum number of tips).

@nspope
Copy link
Contributor Author

nspope commented Aug 3, 2023

Some interesting (but possibly useless) applications of this algorithm:

  • Get logarithmic moments -- this would allow matching lognormal/gamma via sufficient statistics rather than mean/variance
  • Adjust for variable population size at the level of the hypoexponential components (e.g. do the time rescaling before marginalizing over number of extant ancestors, and mixing over node spans) -- this should give more accurate moments for the gamma/lognormal approximations, but maybe not enough to prioritize?

@hyanwong
Copy link
Member

hyanwong commented Aug 3, 2023

Really neat. Thanks for this Nate.

@nspope
Copy link
Contributor Author

nspope commented Aug 25, 2023

Darn, this is working but fried some unit tests.

@nspope nspope marked this pull request as ready for review November 13, 2023 06:48
@nspope
Copy link
Contributor Author

nspope commented Nov 13, 2023

This is good to go. I moved the original, slow variance calculation into utility_functions and test vs fast implementation.

@nspope nspope requested a review from hyanwong November 13, 2023 06:51
Copy link
Member

@hyanwong hyanwong left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM. Maybe it's worth keeping a few tests for the simpler version? Also, using the approximation only when N>20,000 seems quite a high threshold. I assume it's so fast that this is fine? And I assume that's when you removed the "Calculating Prior Variances" tqdm feedback?

tests/test_functions.py Show resolved Hide resolved
tests/test_functions.py Show resolved Hide resolved
@hyanwong
Copy link
Member

Also, for very large inferences, if e.g. N=1e6 and the user has set approx_priors=False, is it worth outputting some sort of progress bar? Perhaps we only display a progress bar for this stage when progress=True and approx_priors=False?

@nspope
Copy link
Contributor Author

nspope commented Nov 13, 2023

Hmm, I removed the progress bar because the variance calculation is done recursively in numba -- so, no way to output progress without slowing things down dramatically.

@nspope
Copy link
Contributor Author

nspope commented Nov 13, 2023

using the approximation only when N>20,000 seems quite a high threshold

I think it takes around 1/2 a minute, according to the benchmark above? That's really when the cost starts kicking in.

@hyanwong
Copy link
Member

Hmm, I removed the progress bar because the variance calculation is done recursively in numba -- so, no way to output progress without slowing things down dramatically.

Oh, well, if there's no obvious trick there then just drop it.

We could output a logging warning that this might take a while if approx_priors=False and N is so large that it will take of the order of an hour or more?

@nspope
Copy link
Contributor Author

nspope commented Nov 13, 2023

Yes, that's a good idea

@nspope
Copy link
Contributor Author

nspope commented Nov 14, 2023

Ok I think this is done-- except I'm not sure about the default for using the approximate prior. I suppose that if there's missing data, the calculation has to repeated for every realized # of tips across the tree sequence? So, I set the default to 10,000 which should take on the order of a second according to benchmark above. Seem OK?

@hyanwong
Copy link
Member

hyanwong commented Nov 14, 2023

A second sounds OK to me, as does a default of 10,000, so I'll just merge this. Thanks a lot Nate!

@hyanwong hyanwong merged commit 697d80a into tskit-dev:main Nov 14, 2023
3 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants