-
Notifications
You must be signed in to change notification settings - Fork 76
filter_nodes for simplify #2619
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
Codecov Report
@@ Coverage Diff @@
## main #2619 +/- ##
=======================================
Coverage ? 93.92%
=======================================
Files ? 28
Lines ? 27731
Branches ? 1271
=======================================
Hits ? 26046
Misses ? 1648
Partials ? 37
Flags with carried forward coverage won't be shown. Click here to find out more.
Continue to review full report at Codecov.
|
I just realised that the easiest way to allow any nodes to be marked as samples is to switch the NODE_IS_SAMPLE flags before simplifying. Since we require the node order to remain the same, I think this is all we need to do. |
Actually, it's even easier than that. Recent commit included: I think this works fine:
|
9da35ab
to
ac20541
Compare
I think this now needs to be implemented in the C version of |
0cb7cbc
to
b32c3bc
Compare
I have coded up the C equivalent, but there are 2 remaining issues:
|
b32c3bc
to
89d38fe
Compare
I am trying to figure out what this does as I think it collides with something that one may need for (one approach to) parallel simplification. The comments in #2605 are concerning to me re: marking all nodes as samples. That would completely break the stuff that I am presenting in December by rendering it not possible. (Or not possible w/o another flag to simplify that suppresses the behavior described in #2605, but I'd consider that a serious design failure.) Could someone provide a simple description of what this is doing -- I can't find one by by tracking the linked issues, etc.. |
I don't think we mark all nodes as samples. The current implementation of
Well, I think the key is here (code lightly edited for clarity)
So normally ( Does that make sense? Is my explanation too simplistic @molpopgen ? I feel you probably understand this much better than I do anyway. |
Yes, I think that this makes sense -- the work "all" in the comment elsewhere was ambiguous to me. So, if filter nodes is False:
For this to make sense, all nodes that get recorded by have no referents after simplification should be marked as NULL in the output idmap? Is that the case? |
Yes, with the proviso that the sample nodes are defined as the ones passed in to simplify (which by default is all the nodes flagged with tskit.NODE_IS_SAMPLE in the original tree sequence)
The idmap doesn't tell you whether the node is used or not, only if the node Id has changed. Even in the case of nodes that have no referents after simplification, the nodes are retained in the table (although not referenced anywhere), and the idmap is simply the identify map. In fact, by definition, the output idmap when |
This isn't quite right -- a node that "simplifies out" of the tables (is not longer part of the ancestry anywhere) has -1/null in the idmap. For what I envision, I require this to be true for this feature, too. |
This is not what I read in the code that you wrote above -- what is "passed in node list"? |
This is the sticking point, and breaks what I need this feature to do. I require a way to: keep the input and output node tables the same length but also to know which nodes are no longer used. The implementation of this that I have done takes this into account by entering NULL into idmap for the latter case. |
My implementtion is here: https://github.com/molpopgen/tskit-rust/tree/parallel_experiment. I cannot make it easy to see diffs b/c that would trigger a PR against tskit-rust, but you can look at the commit history for the branch and see the changes to simplification (in the C code) |
The user calls
I'm sure we could find a way to report this. We could revise the meaning of the idmap in this case, or (I guess) we could search through the edge table (but that's not very performant). @jeromekelleher might have a sensible suggestion here. My previous hack would work for this, as a matter of fact: do a normal |
I think it could be reasonable to return an id_map with NULL for unused nodes in the |
All this sounds great to me. Two suggestions:
|
I think it's better to keep this as NO_FILTER following the population etc tables because the KEEP flags (in retrospect) are more about keeping edges than nodes. We have:
We keep the edges, and therefore the corresponding nodes are not filtered. The trouble with |
OK, this should implement the "don't touch the table pointers" semantics when we don't filter unreferenced nodes, sites, populations or individuals. I had to jiggle things around a bit to do it, but I think it was worth the refactoring. Hopefully it's a bit clearer how the copies are used now and it opens the door for future memory reductions as well. I still need to document and add the NO_UPDATE_SAMPLE_FLAGS. I'm also hoping to add an example with multiple table collections using the same nodes and individuals, to see convince myself that this all works in practise but we'll see how that goes time-wise. |
4f49c76
to
62d72ef
Compare
I think this is ready for review and merge - I'll implement TSK_SIMPLIFY_NO_UPDATE_SAMPLE_FLAGS separately as this PR has ended up getting quite big. Turns out making this memory safe required quite a bit of fiddling! I think the code it much better now though, so hopefully worth the effort. |
1688323
to
3f8e970
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM - I only found a couple of typos
3f8e970
to
649b271
Compare
Added a few minor things from review from @petrelharp and @benjeffery - ready to go I think! |
Opened #2676 for the CircleCI build failure (which looks non-trivial) |
649b271
to
cdb9fa6
Compare
|
@Mergifyio rebase |
Revert to old implementation Simple tset Update changelog and add C tests Done? Low level tests More tests Fix error-handling error in simplifier_init We were clearing the input tables before checking sample errors Fixup broken tests Add test for mutations Don't clear the node table when not filtering nodes Refactor Modernise simplify test Fix provenance bug Removed unused samples member Implement filter-populations with no-touch semantics updates Refactor finalise references path Finished no-touch semantics on the non-filter casese Remove unused simplify_t struct member Make simplify thread-safe in no-filter case Update changelog
To get the future behavior now, you can configure Or you can create a dedicated github account for squash and rebase operations, and use it in different |
✅ Branch has been successfully rebased |
cdb9fa6
to
7ea2b59
Compare
Description
A follow up to #2606 with tests added. Currently intentionally failing until we decide whether to allow the passed-in
samples
to be anything other thants.samples()
. Will also need to implement this in the C version of simplify, and switchcompare_lib=True
Fixes #2605.
PR Checklist: