-
Notifications
You must be signed in to change notification settings - Fork 76
divmat tree by tree #2736
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
divmat tree by tree #2736
Conversation
f274999
to
8d7a42f
Compare
This is ready for a look @petrelharp - most of the stuff is implemented I think, except for the Questions:
|
8d7a42f
to
e8a3e0d
Compare
I think this is more-or-less done now @petrelharp - the only thing I'm not sure about is the I guess that would cause problems for tree sequences that contain missing data though (say |
Codecov Report
@@ Coverage Diff @@
## main #2736 +/- ##
==========================================
+ Coverage 89.83% 89.84% +0.01%
==========================================
Files 30 30
Lines 28624 29026 +402
Branches 5590 5669 +79
==========================================
+ Hits 25713 26079 +366
- Misses 1655 1676 +21
- Partials 1256 1271 +15
Flags with carried forward coverage won't be shown. Click here to find out more.
Continue to review full report in Codecov by Sentry.
|
Tests failing because of #2745 |
Aha, turns out there isn't a problem at all, it's just a subtle numerical precision problem around 0. Ready for a look now I think! |
Notes from the meeting now:
|
I think having sample_sets, samples and individuals as mutually exclusive parameters that do what we want would be fine? Specifying a list of n singleton sets is quite annoying and ineffecient, and having another parameter to me is better than overloading sample_sets. So, keep the API as is for now, but ultimately implement following the standard sample sets approach in C, and then add support in python for individuals and sample_sets? |
e323e05
to
cf5269f
Compare
How does this look as a basic definition of the site-mode version @petrelharp? If we agree on this as what we're trying to compute, I'll have a think about how to do things a bit more efficiently (maintaining cumulative mutation counts per node should be possible so that we can use the efficient MRCA calculation, but will involve writing a full-fat incremental algorithm). def rootward_path(tree, u, v):
while u != v:
yield u
u = tree.parent(u)
def site_divergence_matrix_naive(ts, windows=None, samples=None):
windows_specified = windows is not None
windows = [0, ts.sequence_length] if windows is None else windows
num_windows = len(windows) - 1
samples = ts.samples() if samples is None else samples
n = len(samples)
D = np.zeros((num_windows, n, n))
tree = tskit.Tree(ts)
for i in range(num_windows):
left = windows[i]
right = windows[i + 1]
tree.seek(left)
# Iterate over the trees in this window
while tree.interval.left < right and tree.index != -1:
span_left = max(tree.interval.left, left)
span_right = min(tree.interval.right, right)
mutations_per_node = collections.Counter()
for site in tree.sites():
if span_left <= site.position < span_right:
for mutation in site.mutations:
mutations_per_node[mutation.node] += 1
for j in range(n):
u = samples[j]
for k in range(j + 1, n):
v = samples[k]
w = tree.mrca(u, v)
if w != tskit.NULL:
wu = w
wv = w
else:
wu = local_root(tree, u)
wv = local_root(tree, v)
du = sum(mutations_per_node[x] for x in rootward_path(tree, u, wu))
dv = sum(mutations_per_node[x] for x in rootward_path(tree, v, wv))
# NOTE: we're just accumulating the raw mutation counts, not
# multiplying by span
D[i, j, k] += du + dv
tree.next()
# Fill out symmetric triangle in the matrix
for j in range(n):
for k in range(j + 1, n):
D[i, k, j] = D[i, j, k]
if not windows_specified:
D = D[0]
return D |
cf5269f
to
a5f9756
Compare
I pushed through the changes here for the site mode divmat @petrelharp, and I think we're almost API complete enough to go for the initial merge. Annoyingly, the basic tree-by-tree site mode version that I have here is much slower than I thought it would be, and we need a different algorithm. In some quick testing, we currently have the branch mode divmat is at least 10X faster than the general-stat based method, whereas the site mode is at least 10X slower. I don't really know why this is exactly - it might just be that doing the MRCAs by traversal really is that much slower than using Schieber-Vishkin. But it's clear we'll need to do some more work here. What I'd suggest is that we fill out the testing here on this much, to make sure we're all happy that this really is what we want to compute, and then revisit after this PR is merged, with a better algorithm for site-mode. Are you happy with this? Is the current API a good forwards-compatible starting point (pending the change of default mode to "stat")? I'll go ahead and finalise the testing if so. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I see no issues - LGTM!
OK great, I'll ping you when it's ready for final review. |
da0ea6e
to
7b11fae
Compare
I think this is ready for a final look and merge. Would you mind having a quick scan please @benjeffery? |
7b11fae
to
741164e
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks great!
I note that numba
is mentioned but not used - was the intent to use it in the test suite?
c/tskit/trees.c
Outdated
} | ||
|
||
if ((options & TSK_STAT_POLARISED) || (options & TSK_STAT_SPAN_NORMALISE)) { | ||
/* TODO better error */ |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Was this a TODO for this PR or future?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I've added the specific errors.
Well spotted. It's handy to have a version of this code lying around that you can run numba on directly, but I don't think we need to bother with adding numba support here for functions in the test suite. I've added a comment to clarify |
741164e
to
21be777
Compare
I've tied up a few loose ends, this is ready to go now. Also opened #2791 to track the missing support for string-window specs. |
Implement the basic version of the divergence matrix operation using tree-by-tree algorithms, and provide interface for parallelising along the genome.
21be777
to
f84f5e3
Compare
This is a simple implementation of the divergence matrix, based on the infrastructure developed in
#2710.
The idea is that we do the very straighforward and naive thing of going pairwise through the samples,
but compute the MRCAs using a constant time algorithm. It's slightly faster than the current AVL
tree based approach in #2710, while using far less memory. Another advantage is that this implementation
is simple to parallelise by adding left and right coordinate arguments
and taking advantage of the new quick seeking in #2661. I think this is probably
more important in practise now than good algorithms, as we'll need to throw lots of threads
at these calculations whatever happens.
I think we'll probably want to keep this implementation around as an option in the long run anyway,
since it's likely to be faster than more complex algorithms for small n, and it uses very little
memory (which may also be important in practise).
So, I vote we push ahead with fleshing out the API for this?