filter_nodes for simplify #2619

hyanwong · 2022-10-30T22:00:33Z

Description

A follow up to #2606 with tests added. Currently intentionally failing until we decide whether to allow the passed-in samples to be anything other than ts.samples(). Will also need to implement this in the C version of simplify, and switch compare_lib=True

Fixes #2605.

PR Checklist:

Tests that fully cover new/changed functionality.
Documentation including tutorial content if appropriate.
Changelogs, if there are API changes.

codecov · 2022-10-30T22:13:35Z

Codecov Report

❗ No coverage uploaded for pull request base (main@d4efac9). Click here to learn what that means.
The diff coverage is 90.81%.

❗ Current head 4f49c76 differs from pull request most recent head 7ea2b59. Consider uploading reports for the commit 7ea2b59 to get more accurate results

@@           Coverage Diff           @@
##             main    #2619   +/-   ##
=======================================
  Coverage        ?   93.92%           
=======================================
  Files           ?       28           
  Lines           ?    27731           
  Branches        ?     1271           
=======================================
  Hits            ?    26046           
  Misses          ?     1648           
  Partials        ?       37

Flag	Coverage Δ
c-tests	`92.26% <90.44%> (?)`
lwt-tests	`89.05% <ø> (?)`
python-c-tests	`72.29% <85.71%> (?)`
python-tests	`98.97% <100.00%> (?)`

Flags with carried forward coverage won't be shown. Click here to find out more.

Impacted Files	Coverage Δ
python/tskit/trees.py	`98.75% <ø> (ø)`
c/tskit/tables.c	`90.24% <90.44%> (ø)`
python/_tskitmodule.c	`92.73% <100.00%> (ø)`
python/tskit/tables.py	`98.96% <100.00%> (ø)`

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update d4efac9...7ea2b59. Read the comment docs.

hyanwong · 2022-10-31T07:59:07Z

I just realised that the easiest way to allow any nodes to be marked as samples is to switch the NODE_IS_SAMPLE flags before simplifying. Since we require the node order to remain the same, I think this is all we need to do.

hyanwong · 2022-10-31T08:45:14Z

I just realised that the easiest way to allow any nodes to be marked as samples is to switch the NODE_IS_SAMPLE flags before simplifying. Since we require the node order to remain the same, I think this is all we need to do.

Actually, it's even easier than that. Recent commit included: I think this works fine:

sample = set(sample)
for node in ts.nodes():
    is_sample = node.id in sample
    self.record_node(node.id, is_sample=is_sample)
    if is_sample:
        self.add_ancestry(node.id, 0, self.sequence_length, node.id)

hyanwong · 2022-10-31T12:08:41Z

I think this now needs to be implemented in the C version of simplify(), right?

hyanwong · 2022-10-31T21:37:06Z

I have coded up the C equivalent, but there are 2 remaining issues:

In merge_labeled_ancestors a new line that @jeromekelleher introduced says is_sample = input_id in self.samples # HACK (and I have coded up a C equivalent). The HACK comment worries me slightly.
The default C call will have TSK_SIMPLIFY_FILTER_NODES set to 0, so the defaults will change to not filter nodes, which I think is causing the C tests to fail. This seems like a rather large change in default behaviour. I'm not sure what the best options are here.

molpopgen · 2022-10-31T22:07:31Z

I am trying to figure out what this does as I think it collides with something that one may need for (one approach to) parallel simplification.

The comments in #2605 are concerning to me re: marking all nodes as samples. That would completely break the stuff that I am presenting in December by rendering it not possible. (Or not possible w/o another flag to simplify that suppresses the behavior described in #2605, but I'd consider that a serious design failure.)

Could someone provide a simple description of what this is doing -- I can't find one by by tracking the linked issues, etc..

hyanwong · 2022-10-31T22:25:45Z

The comments in #2605 are concerning to me re: marking all nodes as samples.

I don't think we mark all nodes as samples. The current implementation of simplify() takes a list of nodes, and turns all those nodes into samples, keeping their ancestry WRT the other nodes, which are marked as non-samples. By default, the list of passed-in nodes is simply all the existing samples in the TS, but you could provide a different list of nodes, in which case these become the samples instead, and any nodes in their genetic ancestry are kept but are flagged as non-sample nodes. As far as I know, simplify() has always worked like this.

Could someone provide a simple description of what this is doing -- I can't find one by by tracking the linked issues, etc..

Well, I think the key is here (code lightly edited for clarity)

if filter_nodes:
    for sample_id in passed_in_node_list:
        output_id = self.record_node(sample_id, is_sample=True)
        self.add_ancestry(sample_id, 0, self.sequence_length, output_id)
else:
    for node in ts.nodes():
        is_sample = node.id in passed_in_node_list
        self.record_node(node.id, is_sample=is_sample)
        if is_sample:
           self.add_ancestry(node.id, 0, self.sequence_length, node.id)

So normally (filter_nodes==True) we record the passed-in node list, treating all of these nodes as samples, and add the ancestry of those nodes into the TS. If we don't filter nodes, we record all the nodes in the normal order (meaning that the node IDs don't change), but only the ones passed in are marked as samples, and we retain the ancestry of those. That means a number of the non-sample nodes in the ts may end up not being referenced in the node table.

Does that make sense? Is my explanation too simplistic @molpopgen ? I feel you probably understand this much better than I do anyway.

molpopgen · 2022-10-31T22:37:22Z

Yes, I think that this makes sense -- the work "all" in the comment elsewhere was ambiguous to me.

So, if filter nodes is False:

all nodes get recorded
only sample nodes have their ancestry preserved.

For this to make sense, all nodes that get recorded by have no referents after simplification should be marked as NULL in the output idmap? Is that the case?

hyanwong · 2022-10-31T22:43:58Z

So, if filter nodes is False:

all nodes get recorded

only sample nodes have their ancestry preserved.

Yes, with the proviso that the sample nodes are defined as the ones passed in to simplify (which by default is all the nodes flagged with tskit.NODE_IS_SAMPLE in the original tree sequence)

For this to make sense, all nodes that get recorded by have no referents after simplification should be marked as NULL in the output idmap? Is that the case?

The idmap doesn't tell you whether the node is used or not, only if the node Id has changed. Even in the case of nodes that have no referents after simplification, the nodes are retained in the table (although not referenced anywhere), and the idmap is simply the identify map. In fact, by definition, the output idmap when filter_nodes=False should be 0..num_nodes-1 and not contain any NULLs (because no nodes are actually removed).

molpopgen · 2022-10-31T22:49:19Z

The idmap doesn't tell you whether the node is used or not, only if the node Id has changed.

This isn't quite right -- a node that "simplifies out" of the tables (is not longer part of the ancestry anywhere) has -1/null in the idmap. For what I envision, I require this to be true for this feature, too.

molpopgen · 2022-10-31T22:50:23Z

Yes, with the proviso that the sample nodes are defined as the ones passed in to simplify (which by default is all the nodes flagged with tskit.NODE_IS_SAMPLE in the original tree sequence)

This is not what I read in the code that you wrote above -- what is "passed in node list"?

molpopgen · 2022-10-31T22:53:47Z

In fact, by definition, the output idmap when filter_nodes=False should be 0..num_nodes-1 and not contain any NULLs (because no nodes are actually removed).

This is the sticking point, and breaks what I need this feature to do. I require a way to: keep the input and output node tables the same length but also to know which nodes are no longer used. The implementation of this that I have done takes this into account by entering NULL into idmap for the latter case.

molpopgen · 2022-10-31T22:58:59Z

My implementtion is here: https://github.com/molpopgen/tskit-rust/tree/parallel_experiment.

I cannot make it easy to see diffs b/c that would trigger a PR against tskit-rust, but you can look at the commit history for the branch and see the changes to simplification (in the C code)

hyanwong · 2022-10-31T23:13:26Z

This is not what I read in the code that you wrote above -- what is "passed in node list"?

The user calls ts.simplify(samples=[10,11,12]). The nodes 10, 11, and 12 become the samples in the output tree sequence. If not give, the samples parameter default to ts.samples(), so that the samples in the input tree sequence remain samples in the output tree sequence (albeit with changed node IDs).

I require a way to: keep the input and output node tables the same length but also to know which nodes are no longer used.

I'm sure we could find a way to report this. We could revise the meaning of the idmap in this case, or (I guess) we could search through the edge table (but that's not very performant). @jeromekelleher might have a sensible suggestion here.

My previous hack would work for this, as a matter of fact: do a normal simplify (removing unreferenced nodes) to create a new tree sequence, look at the nodes with NULL in the map, then edit the original tree sequence by porting the edges over from the new tree sequence back into the old one, with a suitable remapping of parent and child IDs. Maybe that's too hack for you, though?

hyanwong · 2022-10-31T23:42:22Z

I think it could be reasonable to return an id_map with NULL for unused nodes in the filter_nodes=False case, but I don't know if it's actually very easy to implement. I will ask @jeromekelleher .

petrelharp · 2022-12-10T04:21:26Z

All this sounds great to me. Two suggestions:

how about TSK_SIMPLIFY_KEEP_NODES or TSK_SIMPLIFY_KEEP_ALL_NODES to match other flags
how about TSK_SIMPLIFY_NO_MARK_SAMPLES as more concise than NO_UPDATE_SAMPLE_FLAGS (and maybe more clear, because why is it "update"?)

jeromekelleher · 2022-12-12T16:04:59Z

I think it's better to keep this as NO_FILTER following the population etc tables because the KEEP flags (in retrospect) are more about keeping edges than nodes. We have:

TSK_SIMPLIFY_FILTER_SITES
TSK_SIMPLIFY_FILTER_POPULATIONS
TSK_SIMPLIFY_FILTER_INDIVIDUALS
TSK_SIMPLIFY_REDUCE_TO_SITE_TOPOLOGY
TSK_SIMPLIFY_KEEP_UNARY
TSK_SIMPLIFY_KEEP_INPUT_ROOTS
TSK_SIMPLIFY_KEEP_UNARY_IN_INDIVIDUALS

We keep the edges, and therefore the corresponding nodes are not filtered.

The trouble with TSK_SIMPLIFY_NO_MARK_SAMPLES is that we also unmark things that are not samples, so UPDATE is clearer I think.

jeromekelleher · 2022-12-13T20:40:16Z

OK, this should implement the "don't touch the table pointers" semantics when we don't filter unreferenced nodes, sites, populations or individuals. I had to jiggle things around a bit to do it, but I think it was worth the refactoring. Hopefully it's a bit clearer how the copies are used now and it opens the door for future memory reductions as well.

I still need to document and add the NO_UPDATE_SAMPLE_FLAGS. I'm also hoping to add an example with multiple table collections using the same nodes and individuals, to see convince myself that this all works in practise but we'll see how that goes time-wise.

jeromekelleher · 2022-12-14T16:17:17Z

I think this is ready for review and merge - I'll implement TSK_SIMPLIFY_NO_UPDATE_SAMPLE_FLAGS separately as this PR has ended up getting quite big.

Turns out making this memory safe required quite a bit of fiddling! I think the code it much better now though, so hopefully worth the effort.

c/tskit/tables.c

benjeffery

LGTM - I only found a couple of typos

c/tskit/tables.h

jeromekelleher · 2023-01-09T19:30:37Z

Added a few minor things from review from @petrelharp and @benjeffery - ready to go I think!

jeromekelleher · 2023-01-09T19:39:29Z

Opened #2676 for the CircleCI build failure (which looks non-trivial)

benjeffery · 2023-01-10T10:47:58Z

git pushing here to trigger CI as for some reason circle CI won't let me click the button.

jeromekelleher · 2023-01-10T11:41:13Z

@Mergifyio rebase

Revert to old implementation Simple tset Update changelog and add C tests Done? Low level tests More tests Fix error-handling error in simplifier_init We were clearing the input tables before checking sample errors Fixup broken tests Add test for mutations Don't clear the node table when not filtering nodes Refactor Modernise simplify test Fix provenance bug Removed unused samples member Implement filter-populations with no-touch semantics updates Refactor finalise references path Finished no-touch semantics on the non-filter casese Remove unused simplify_t struct member Make simplify thread-safe in no-filter case Update changelog

mergify · 2023-01-10T11:41:51Z

⚠️ This pull request got rebased on behalf of a random user of the organization.
This behavior will change on the 1st February 2023, Mergify will pick the author of the pull request instead.

To get the future behavior now, you can configure bot_account options (e.g.: bot_account: { author } or update_bot_account: { author }.

Or you can create a dedicated github account for squash and rebase operations, and use it in different bot_account options.

mergify · 2023-01-10T11:41:52Z

rebase

✅ Branch has been successfully rebased

hyanwong marked this pull request as draft October 30, 2022 22:03

hyanwong changed the title ~~Simplify filter nodes~~ filter_nodes for simplify Oct 30, 2022

hyanwong mentioned this pull request Oct 30, 2022

filter_nodes for simplify #2606

Closed

hyanwong force-pushed the simplify-filter-nodes branch 2 times, most recently from 9da35ab to ac20541 Compare October 31, 2022 11:23

hyanwong force-pushed the simplify-filter-nodes branch 6 times, most recently from 0cb7cbc to b32c3bc Compare October 31, 2022 21:25

hyanwong force-pushed the simplify-filter-nodes branch from b32c3bc to 89d38fe Compare October 31, 2022 21:51

jeromekelleher force-pushed the simplify-filter-nodes branch from 4f49c76 to 62d72ef Compare December 14, 2022 16:15

jeromekelleher marked this pull request as ready for review December 14, 2022 16:15

jeromekelleher requested review from benjeffery and petrelharp December 14, 2022 16:17

jeromekelleher reviewed Dec 14, 2022

View reviewed changes

c/tskit/tables.c Show resolved Hide resolved

jeromekelleher force-pushed the simplify-filter-nodes branch 2 times, most recently from 1688323 to 3f8e970 Compare December 14, 2022 16:34

This was referenced Dec 14, 2022

Add TSK_SIMPLIFY_NO_UPDATE_SAMPLE_FLAGS #2662

Closed

Don't update sample flags in simplify #2663

Merged

Multichrom example #2665

Merged

benjeffery approved these changes Dec 21, 2022

View reviewed changes

c/tskit/tables.h Outdated Show resolved Hide resolved

c/tskit/tables.h Outdated Show resolved Hide resolved

benjeffery added this to the Python 0.5.4 milestone Jan 4, 2023

jeromekelleher force-pushed the simplify-filter-nodes branch from 3f8e970 to 649b271 Compare January 9, 2023 19:30

benjeffery force-pushed the simplify-filter-nodes branch from 649b271 to cdb9fa6 Compare January 10, 2023 10:47

hyanwong and others added 2 commits January 10, 2023 11:41

Add TSK_SIMPLIFY_NO_FILTER_NODES

e5d2376

AdminBot-tskit force-pushed the simplify-filter-nodes branch from cdb9fa6 to 7ea2b59 Compare January 10, 2023 11:41

jeromekelleher added the AUTOMERGE-REQUESTED Ask Mergify to merge this PR label Jan 10, 2023

mergify bot merged commit 2b61dfd into tskit-dev:main Jan 10, 2023

mergify bot removed the AUTOMERGE-REQUESTED Ask Mergify to merge this PR label Jan 10, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

filter_nodes for simplify #2619

filter_nodes for simplify #2619

hyanwong commented Oct 30, 2022

codecov bot commented Oct 30, 2022 •

edited

Loading

hyanwong commented Oct 31, 2022

hyanwong commented Oct 31, 2022

hyanwong commented Oct 31, 2022

hyanwong commented Oct 31, 2022 •

edited

Loading

molpopgen commented Oct 31, 2022

hyanwong commented Oct 31, 2022 •

edited

Loading

molpopgen commented Oct 31, 2022

hyanwong commented Oct 31, 2022 •

edited

Loading

molpopgen commented Oct 31, 2022

molpopgen commented Oct 31, 2022

molpopgen commented Oct 31, 2022

molpopgen commented Oct 31, 2022 •

edited

Loading

hyanwong commented Oct 31, 2022

hyanwong commented Oct 31, 2022

petrelharp commented Dec 10, 2022

jeromekelleher commented Dec 12, 2022 •

edited

Loading

jeromekelleher commented Dec 13, 2022

jeromekelleher commented Dec 14, 2022

benjeffery left a comment

jeromekelleher commented Jan 9, 2023

jeromekelleher commented Jan 9, 2023

benjeffery commented Jan 10, 2023

jeromekelleher commented Jan 10, 2023

mergify bot commented Jan 10, 2023

mergify bot commented Jan 10, 2023

filter_nodes for simplify #2619

filter_nodes for simplify #2619

Conversation

hyanwong commented Oct 30, 2022

Description

PR Checklist:

codecov bot commented Oct 30, 2022 • edited Loading

Codecov Report

hyanwong commented Oct 31, 2022

hyanwong commented Oct 31, 2022

hyanwong commented Oct 31, 2022

hyanwong commented Oct 31, 2022 • edited Loading

molpopgen commented Oct 31, 2022

hyanwong commented Oct 31, 2022 • edited Loading

molpopgen commented Oct 31, 2022

hyanwong commented Oct 31, 2022 • edited Loading

molpopgen commented Oct 31, 2022

molpopgen commented Oct 31, 2022

molpopgen commented Oct 31, 2022

molpopgen commented Oct 31, 2022 • edited Loading

hyanwong commented Oct 31, 2022

hyanwong commented Oct 31, 2022

petrelharp commented Dec 10, 2022

jeromekelleher commented Dec 12, 2022 • edited Loading

jeromekelleher commented Dec 13, 2022

jeromekelleher commented Dec 14, 2022

benjeffery left a comment

Choose a reason for hiding this comment

jeromekelleher commented Jan 9, 2023

jeromekelleher commented Jan 9, 2023

benjeffery commented Jan 10, 2023

jeromekelleher commented Jan 10, 2023

mergify bot commented Jan 10, 2023

mergify bot commented Jan 10, 2023

✅ Branch has been successfully rebased

codecov bot commented Oct 30, 2022 •

edited

Loading

hyanwong commented Oct 31, 2022 •

edited

Loading

hyanwong commented Oct 31, 2022 •

edited

Loading

hyanwong commented Oct 31, 2022 •

edited

Loading

molpopgen commented Oct 31, 2022 •

edited

Loading

jeromekelleher commented Dec 12, 2022 •

edited

Loading