Skip to content

Two-way two-locus statistics (python proto) #3179

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
wants to merge 1 commit into
base: main
Choose a base branch
from

Conversation

lkirk
Copy link
Contributor

@lkirk lkirk commented May 28, 2025

This PR implements two-way LD statistics, specified between sample sets. During the development of this functionality, a number of issues with the designation of state_dims/result_dims were discovered. These have been resolved, testing clean for existing code and providing the proper behavior for this new code.

The mechanism by which users will specify a multi-population (or two-way) statistic is by providing the index argument. This helps us avoid creating another ld_matrix method for the TreeSequence object.

In other words, for a one-way statistic, a user would specify:

ts.ld_matrix(stat="D2", sample_sets=[ss1, ss2])

Which would output a 3D ndarray containing one LD matrix per sample set.

ts.ld_matrix(stat="D2", sample_sets=[ss1, ss2], indexes=[(0, 1)])

Which would output a 2D ndarray containing one LD matrix for the index pair. This would use our D2_ij_summary_func, instead of the D2_summary_func. Finally, if a user provided

ts.ld_matrix(stat="D2", sample_sets=[ss1, ss2], indexes=[(0, 1), (1, 1)])

We would output a 3D ndarray containing one LD matrix per index pair provided.

Since these are two-way statistics, the indexes must be length 2. We plan on enabling users to implement k-way via a "general_stat" api. We did not implement anything more than two-way statistics here because of the combinatoric explosion of logic required for indexes > 2.

I added some basic tests to demonstrate that things were working properly. If we compute two-way statistics on identical sample sets, they should be equal to the one-way statistics. Unfortunately, this does not apply to unbiased statistics, which I've validated manually.

I've also cleaned up the docstrings a bit and fixed a bug with the D_prime statistic, which should not be weighted by haplotype frequency.

Fixes #2832

@lkirk lkirk requested a review from jeromekelleher May 28, 2025 20:54
@lkirk
Copy link
Contributor Author

lkirk commented May 28, 2025

Since @apragsdale isn't in the tskit organization, pinging him here to add as reviewer.

Copy link

codecov bot commented May 28, 2025

Codecov Report

All modified and coverable lines are covered by tests ✅

Project coverage is 89.62%. Comparing base (9d2ff8e) to head (dc86340).
Report is 2 commits behind head on main.

Additional details and impacted files
@@            Coverage Diff             @@
##             main    #3179      +/-   ##
==========================================
- Coverage   98.91%   89.62%   -9.29%     
==========================================
  Files          17       28      +11     
  Lines        7542    31969   +24427     
  Branches     1258     5873    +4615     
==========================================
+ Hits         7460    28653   +21193     
- Misses         33     1886    +1853     
- Partials       49     1430    +1381     
Flag Coverage Δ
c-tests 86.66% <ø> (?)
lwt-tests 80.38% <ø> (?)
python-c-tests 88.18% <ø> (?)
python-tests 98.85% <ø> (ø)

Flags with carried forward coverage won't be shown. Click here to find out more.

Files with missing lines Coverage Δ
c/tskit/trees.c 90.61% <ø> (ø)

... and 11 files with indirect coverage changes

🚀 New features to boost your workflow:
  • ❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.

@lkirk lkirk force-pushed the two-locus-multipopulation-py branch from b77bc1b to 9f8222e Compare May 29, 2025 05:24
Copy link

@apragsdale apragsdale left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks @lkirk.

If we compute two-way statistics on identical sample sets, they should be equal to the one-way statistics. Unfortunately, this does not apply to unbiased statistics, which I've validated manually.

This is ok (even expected) for the unbiased statistics, since the formulation assumes sampling without replacement, but having overlapping sample sets would essentially copy those repeated sample nodes. I don't think we should worry about it.

@lkirk lkirk force-pushed the two-locus-multipopulation-py branch 4 times, most recently from 9af9695 to 834f0c2 Compare June 2, 2025 07:51
This PR implements two-way LD statistics, specified between sample sets.
During the development of this functionality, a number of issues with
the designation of state_dims/result_dims were discovered. These have
been resolved, testing clean for existing code and providing the proper
behavior for this new code.

The mechanism by which users will specify a multi-population (or
two-way) statistic is by providing the `index` argument. This helps us
avoid creating another `ld_matrix` method for the TreeSequence object.

In other words, for a one-way statistic, a user would specify:
```python
ts.ld_matrix(stat="D2", sample_sets=[ss1, ss2])
```
Which would output a 3D ndarray containing one LD matrix per sample set.

```python
ts.ld_matrix(stat="D2", sample_sets=[ss1, ss2], indexes=[(0, 1)])
```
Which would output a 2D ndarray containing one LD matrix for the index
pair. This would use our `D2_ij_summary_func`, instead of the
`D2_summary_func`. Finally, if a user provided
```python
ts.ld_matrix(stat="D2", sample_sets=[ss1, ss2], indexes=[(0, 1), (1, 1)])
```
We would output a 3D ndarray containing one LD matrix _per_ index pair
provided.

Since these are two-way statistics, the indexes must be length 2. We
plan on enabling users to implement k-way via a "general_stat" api. We did
not implement anything more than two-way statistics here because of the
combinatoric explosion of logic required for indexes > 2.

I added some basic tests to demonstrate that things were working
properly. If we compute two-way statistics on identical sample sets,
they should be equal to the one-way statistics. Unfortunately, this does
not apply to unbiased statistics, which I've validated manually.

I've also cleaned up the docstrings a bit and fixed a bug with the
D_prime statistic, which should not be weighted by haplotype frequency.
@lkirk lkirk force-pushed the two-locus-multipopulation-py branch from 834f0c2 to dc86340 Compare June 2, 2025 07:52
Copy link

@apragsdale apragsdale left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for making that change. This looks good to me then.

Copy link
Member

@jeromekelleher jeromekelleher left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I've had a look through and the code all LGTM. The changes seem sensible, but I'd need to spend a lot longer on it to really digest the details and understand the finer points.

I'm going to ping @petrelharp here for a review as he's much better at this sort of stuff

@jeromekelleher jeromekelleher requested a review from petrelharp June 5, 2025 11:01
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Multipopulation two-site statistics
3 participants