Skip to content

Two locus branch stats python prototype #2934

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 1 commit into from
Apr 30, 2024

Conversation

lkirk
Copy link
Contributor

@lkirk lkirk commented Apr 23, 2024

I've re-opened a PR to avoid some issues with build caching. Please see the original discussion on #2912.

Currently, this algorithm creates a matrix of LD, performing a pairwise comparison of all trees in the tree sequence.

This implementation lacks windows/positions, sample sets and polarisation. The outputs of the code produce results in units of branch length, needing to be multiplied by mu^2 or divided by product of the total branch length of the two trees.

This algorithm works by keeping a running sum of the statistic between two trees, updating each time we encounter a branch addition or removal. The tricky part is that we have to remove or add LD contributed by samples that already existed or that will remain under a given node after the addition/removal of branches.

We include a validation against the original formulation of this problem, by including an implementation that was described in McVean
2002. The original formulation computing the covariance of tMRCAs of 2, 3, and 4 samples of individuals on the trees in question. This implementation has several limitations 1) it is very slow and 2) it does not work for trees that are decapitated, as certain samples do not have MRCAs.

@lkirk
Copy link
Contributor Author

lkirk commented Apr 23, 2024

@jeromekelleher I'm still seeing issues with the build cache. Is there anything I can do to help here? I'm not in a huge rush for what it's worth, just wanted to close this out.

Copy link

codecov bot commented Apr 23, 2024

Codecov Report

All modified and coverable lines are covered by tests ✅

Project coverage is 89.62%. Comparing base (3349fdd) to head (343a89b).

Additional details and impacted files
@@            Coverage Diff             @@
##             main    #2934      +/-   ##
==========================================
- Coverage   89.63%   89.62%   -0.01%     
==========================================
  Files          29       29              
  Lines       30174    30174              
  Branches     5873     5873              
==========================================
- Hits        27046    27043       -3     
- Misses       1789     1791       +2     
- Partials     1339     1340       +1     
Flag Coverage Δ
c-tests 86.20% <ø> (ø)
lwt-tests 80.78% <ø> (ø)
python-c-tests 88.72% <ø> (ø)
python-tests 99.00% <ø> (-0.03%) ⬇️

Flags with carried forward coverage won't be shown. Click here to find out more.

see 2 files with indirect coverage changes

@jeromekelleher
Copy link
Member

@benjeffery any advice here?

@lkirk lkirk changed the title Adds a py prototype for two-locus branch stats Two locus branch stats python prototype Apr 23, 2024
@benjeffery
Copy link
Member

@benjeffery any advice here?

I'm in progress with this at #2935

@jeromekelleher
Copy link
Member

We've pushed through a few changes in the CI setup, can you rebase now please @lkirk? Hopefully things will go a bit more smoothly now.

@lkirk lkirk force-pushed the two-locus-branch-stats branch 5 times, most recently from 445707c to 25a8ddc Compare April 25, 2024 18:33
@lkirk
Copy link
Contributor Author

lkirk commented Apr 25, 2024

Thank you @benjeffery for fixing that!

@jeromekelleher This is ready now, I chopped many more slow tests from this. I realized that there are only 2 cores being used on the test servers and I hadn't anticipated how the slow tests would compound in runtime. As written, things were taking ~1.5 hours and I've adjusted things so now only important tests are considered, adding about 2 minutes to the test runtime.

Does this seem reasonable?

@jeromekelleher
Copy link
Member

2 minutes is still quite a bit. You could just use the naive version on a handful of small cases to verify things are lining up as expected, and defer doing tests on the full range of examples until the C version is in place?

@benjeffery
Copy link
Member

Any slow tests should be marked with @pytest.mark.slow.

@lkirk lkirk force-pushed the two-locus-branch-stats branch from 25a8ddc to acd0d47 Compare April 26, 2024 15:00
@lkirk
Copy link
Contributor Author

lkirk commented Apr 26, 2024

@jeromekelleher Agreed, I've cut a couple more tests to reach a minimal subset and now the runtime is the same -- at least in the number of minutes, I did a quick check of the runtimes on main and compared to these and everything seems to be running in the same amount of time, within a minute at least. Slowest added test runs in ~.2 seconds on my relatively fast computer.

Here's the times from the latest run on main:
image

So, it appears that the difference in runtime is sub-minute now. Is that reasonable?

If not, I can just mark as slow.

@benjeffery
Copy link
Member

@Mergifyio rebase

Currently, this algorithm creates a matrix of LD, performing a pairwise
comparison of all trees in the tree sequence.

This implementation lacks windows/positions, sample sets and
polarisation. The outputs of the code produce results in units of branch
length, needing to be multiplied by mu^2 or divided by product of the
total branch length of the two trees.

This algorithm works by keeping a running sum of the statistic between
two trees, updating each time we encounter a branch addition or removal.
The tricky part is that we have to remove or add LD contributed by
samples that already existed or that will remain under a given node
after the addition/removal of branches.

We include a validation against the original formulation of this
problem, by including an implementation that was described in McVean
2002. The original formulation computing the covariance of tMRCAs of
2, 3, and 4 samples of individuals on the trees in question. This
implementation has several limitations 1) it is very slow and 2) it does
not work for trees that are decapitated, as certain samples do not have
MRCAs.
Copy link
Contributor

mergify bot commented Apr 30, 2024

rebase

✅ Branch has been successfully rebased

@benjeffery benjeffery force-pushed the two-locus-branch-stats branch from acd0d47 to 343a89b Compare April 30, 2024 09:53
@mergify mergify bot merged commit 1b84b63 into tskit-dev:main Apr 30, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants