Two locus branch stats python prototype #2934

lkirk · 2024-04-23T16:14:39Z

I've re-opened a PR to avoid some issues with build caching. Please see the original discussion on #2912.

Currently, this algorithm creates a matrix of LD, performing a pairwise comparison of all trees in the tree sequence.

This implementation lacks windows/positions, sample sets and polarisation. The outputs of the code produce results in units of branch length, needing to be multiplied by mu^2 or divided by product of the total branch length of the two trees.

This algorithm works by keeping a running sum of the statistic between two trees, updating each time we encounter a branch addition or removal. The tricky part is that we have to remove or add LD contributed by samples that already existed or that will remain under a given node after the addition/removal of branches.

We include a validation against the original formulation of this problem, by including an implementation that was described in McVean
2002. The original formulation computing the covariance of tMRCAs of 2, 3, and 4 samples of individuals on the trees in question. This implementation has several limitations 1) it is very slow and 2) it does not work for trees that are decapitated, as certain samples do not have MRCAs.

lkirk · 2024-04-23T16:17:04Z

@jeromekelleher I'm still seeing issues with the build cache. Is there anything I can do to help here? I'm not in a huge rush for what it's worth, just wanted to close this out.

codecov · 2024-04-23T16:52:06Z

Codecov Report

All modified and coverable lines are covered by tests ✅

Project coverage is 89.62%. Comparing base (3349fdd) to head (343a89b).

Additional details and impacted files

@@            Coverage Diff             @@
##             main    #2934      +/-   ##
==========================================
- Coverage   89.63%   89.62%   -0.01%     
==========================================
  Files          29       29              
  Lines       30174    30174              
  Branches     5873     5873              
==========================================
- Hits        27046    27043       -3     
- Misses       1789     1791       +2     
- Partials     1339     1340       +1

Flag	Coverage Δ
c-tests	`86.20% <ø> (ø)`
lwt-tests	`80.78% <ø> (ø)`
python-c-tests	`88.72% <ø> (ø)`
python-tests	`99.00% <ø> (-0.03%)`	⬇️

Flags with carried forward coverage won't be shown. Click here to find out more.

see 2 files with indirect coverage changes

jeromekelleher · 2024-04-23T20:55:48Z

@benjeffery any advice here?

benjeffery · 2024-04-24T21:26:47Z

@benjeffery any advice here?

I'm in progress with this at #2935

jeromekelleher · 2024-04-25T08:11:23Z

We've pushed through a few changes in the CI setup, can you rebase now please @lkirk? Hopefully things will go a bit more smoothly now.

lkirk · 2024-04-25T18:54:49Z

Thank you @benjeffery for fixing that!

@jeromekelleher This is ready now, I chopped many more slow tests from this. I realized that there are only 2 cores being used on the test servers and I hadn't anticipated how the slow tests would compound in runtime. As written, things were taking ~1.5 hours and I've adjusted things so now only important tests are considered, adding about 2 minutes to the test runtime.

Does this seem reasonable?

jeromekelleher · 2024-04-26T10:06:34Z

2 minutes is still quite a bit. You could just use the naive version on a handful of small cases to verify things are lining up as expected, and defer doing tests on the full range of examples until the C version is in place?

benjeffery · 2024-04-26T10:50:09Z

Any slow tests should be marked with @pytest.mark.slow.

lkirk · 2024-04-26T15:45:28Z

@jeromekelleher Agreed, I've cut a couple more tests to reach a minimal subset and now the runtime is the same -- at least in the number of minutes, I did a quick check of the runtimes on main and compared to these and everything seems to be running in the same amount of time, within a minute at least. Slowest added test runs in ~.2 seconds on my relatively fast computer.

Here's the times from the latest run on main:

So, it appears that the difference in runtime is sub-minute now. Is that reasonable?

If not, I can just mark as slow.

benjeffery · 2024-04-30T09:53:36Z

@Mergifyio rebase

Currently, this algorithm creates a matrix of LD, performing a pairwise comparison of all trees in the tree sequence. This implementation lacks windows/positions, sample sets and polarisation. The outputs of the code produce results in units of branch length, needing to be multiplied by mu^2 or divided by product of the total branch length of the two trees. This algorithm works by keeping a running sum of the statistic between two trees, updating each time we encounter a branch addition or removal. The tricky part is that we have to remove or add LD contributed by samples that already existed or that will remain under a given node after the addition/removal of branches. We include a validation against the original formulation of this problem, by including an implementation that was described in McVean 2002. The original formulation computing the covariance of tMRCAs of 2, 3, and 4 samples of individuals on the trees in question. This implementation has several limitations 1) it is very slow and 2) it does not work for trees that are decapitated, as certain samples do not have MRCAs.

mergify · 2024-04-30T09:53:42Z

rebase

✅ Branch has been successfully rebased

lkirk changed the title ~~Adds a py prototype for two-locus branch stats~~ Two locus branch stats python prototype Apr 23, 2024

lkirk force-pushed the two-locus-branch-stats branch 5 times, most recently from 445707c to 25a8ddc Compare April 25, 2024 18:33

lkirk force-pushed the two-locus-branch-stats branch from 25a8ddc to acd0d47 Compare April 26, 2024 15:00

jeromekelleher added the AUTOMERGE-REQUESTED label Apr 29, 2024

jeromekelleher approved these changes Apr 29, 2024

View reviewed changes

benjeffery force-pushed the two-locus-branch-stats branch from acd0d47 to 343a89b Compare April 30, 2024 09:53

mergify bot merged commit 1b84b63 into tskit-dev:main Apr 30, 2024

mergify bot removed the AUTOMERGE-REQUESTED label Apr 30, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Two locus branch stats python prototype #2934

Two locus branch stats python prototype #2934

Uh oh!

lkirk commented Apr 23, 2024 •

edited

Loading

Uh oh!

lkirk commented Apr 23, 2024

Uh oh!

codecov bot commented Apr 23, 2024 •

edited

Loading

Uh oh!

jeromekelleher commented Apr 23, 2024

Uh oh!

benjeffery commented Apr 24, 2024

Uh oh!

jeromekelleher commented Apr 25, 2024

Uh oh!

lkirk commented Apr 25, 2024

Uh oh!

jeromekelleher commented Apr 26, 2024

Uh oh!

benjeffery commented Apr 26, 2024

Uh oh!

lkirk commented Apr 26, 2024 •

edited

Loading

Uh oh!

benjeffery commented Apr 30, 2024

Uh oh!

mergify bot commented Apr 30, 2024

Uh oh!

Uh oh!

Two locus branch stats python prototype #2934

Two locus branch stats python prototype #2934

Uh oh!

Conversation

lkirk commented Apr 23, 2024 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

lkirk commented Apr 23, 2024

Uh oh!

codecov bot commented Apr 23, 2024 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Codecov Report

Uh oh!

jeromekelleher commented Apr 23, 2024

Uh oh!

benjeffery commented Apr 24, 2024

Uh oh!

jeromekelleher commented Apr 25, 2024

Uh oh!

lkirk commented Apr 25, 2024

Uh oh!

jeromekelleher commented Apr 26, 2024

Uh oh!

benjeffery commented Apr 26, 2024

Uh oh!

lkirk commented Apr 26, 2024 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

benjeffery commented Apr 30, 2024

Uh oh!

mergify bot commented Apr 30, 2024

✅ Branch has been successfully rebased

Uh oh!

Uh oh!

lkirk commented Apr 23, 2024 •

edited

Loading

codecov bot commented Apr 23, 2024 •

edited

Loading

lkirk commented Apr 26, 2024 •

edited

Loading