Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Tutorial: explain how to represent ARGs #43

Open
jeromekelleher opened this issue Jan 29, 2021 · 20 comments
Open

Tutorial: explain how to represent ARGs #43

jeromekelleher opened this issue Jan 29, 2021 · 20 comments

Comments

@jeromekelleher
Copy link
Member

We can simulate some ARGs in msprime, explain how we can losslessly represent ARGs and get all the information you could want.

@jeromekelleher
Copy link
Member Author

Reminder to myself: I wrote a presentation about this for the PopSim workshop which could be adapted for a first pass on this. Some good examples in there which would be a good starting point.

@hyanwong
Copy link
Member

Could you possibly post the presentation, or a link to it, in this issue @jeromekelleher ?

@a-ignatieva
Copy link

Hello, just adding a suggestion as discussed with @hyanwong . It would be handy to see how to extract the number of recombination events from the ARG representation, and if possible split these into those that are undetectable / change only the branch lengths / are detectable with the right mutations. This would be useful for assessing the performance of ARG inference methods.

@jeromekelleher
Copy link
Member Author

jeromekelleher commented Feb 18, 2021

Here it is: arg-ts-presentation.pdf

This was the presentation I gave at the PopSim consortium meeting in Aussois, Oct 2019, so some of the information will be a little out of date.

@jeromekelleher
Copy link
Member Author

jeromekelleher commented Feb 18, 2021

Hello, just adding a suggestion as discussed with @hyanwong . It would be handy to see how to extract the number of recombination events from the ARG representation, and if possible split these into those that are undetectable / change only the branch lengths / are detectable with the right mutations. This would be useful for assessing the performance of ARG inference methods.

Hi @a-ignatieva, welcome! 👋 I'm sure we can figure out how to compute the things you're interested in. Here's an example:

ts = msprime.sim_ancestry(
    2, recombination_rate=0.1, sequence_length=8,
    record_full_arg=True, random_seed=1234)
print(ts.draw_text())

num_re_nodes = 0
for node in ts.nodes():
    if node.flags == msprime.NODE_IS_RE_EVENT:
        num_re_nodes += 1
# Divide by 2 because each RE event has *two* nodes
print("num RE events = ", num_re_nodes / 2)
print(ts.tables.nodes)

giving

3.45┊         ┊         ┊    18   ┊         ┊  
    ┊         ┊         ┊   ┏━┻━┓ ┊         ┊  
2.76┊    17   ┊         ┊   ┃  17 ┊    17   ┊  
    ┊   ┏━┻━┓ ┊         ┊   ┃   ┃ ┊   ┏━┻━┓ ┊  
2.37┊  16   ┃ ┊         ┊   ┃   ┃ ┊  16   ┃ ┊  
    ┊   ┃   ┃ ┊         ┊   ┃   ┃ ┊   ┃   ┃ ┊  
2.34┊   ┃   ┃ ┊         ┊  14   ┃ ┊  15   ┃ ┊  
    ┊   ┃   ┃ ┊         ┊   ┃   ┃ ┊   ┃   ┃ ┊  
1.93┊  12   ┃ ┊         ┊  13   ┃ ┊  13   ┃ ┊  
    ┊   ┃   ┃ ┊         ┊   ┃   ┃ ┊   ┃   ┃ ┊  
1.44┊   ┃  11 ┊         ┊   ┃  11 ┊   ┃  11 ┊  
    ┊   ┃   ┃ ┊         ┊   ┃   ┃ ┊   ┃   ┃ ┊  
1.35┊  10   ┃ ┊    10   ┊  10   ┃ ┊  10   ┃ ┊  
    ┊   ┃   ┃ ┊   ┏━┻━┓ ┊  ┏┻━┓ ┃ ┊  ┏┻━┓ ┃ ┊  
0.79┊   ┃   8 ┊   ┃   9 ┊  ┃  9 ┃ ┊  ┃  9 ┃ ┊  
    ┊   ┃   ┃ ┊   ┃   ┃ ┊  ┃  ┃ ┃ ┊  ┃  ┃ ┃ ┊  
0.70┊   7   ┃ ┊   7   ┃ ┊  7  ┃ ┃ ┊  7  ┃ ┃ ┊  
    ┊  ┏┻━┓ ┃ ┊  ┏┻━┓ ┃ ┊ ┏┻┓ ┃ ┃ ┊ ┏┻┓ ┃ ┃ ┊  
0.19┊  6  ┃ ┃ ┊  6  ┃ ┃ ┊ 6 ┃ ┃ ┃ ┊ 6 ┃ ┃ ┃ ┊  
    ┊ ┏┻┓ ┃ ┃ ┊ ┏┻┓ ┃ ┃ ┊ ┃ ┃ ┃ ┃ ┊ ┃ ┃ ┃ ┃ ┊  
0.08┊ ┃ 4 ┃ ┃ ┊ ┃ 4 ┃ ┃ ┊ ┃ ┃ ┃ 5 ┊ ┃ ┃ ┃ 5 ┊  
    ┊ ┃ ┃ ┃ ┃ ┊ ┃ ┃ ┃ ┃ ┊ ┃ ┃ ┃ ┃ ┊ ┃ ┃ ┃ ┃ ┊  
0.00┊ 0 1 2 3 ┊ 0 1 2 3 ┊ 0 2 3 1 ┊ 0 2 3 1 ┊  
  0.00      1.00      4.00      6.00      8.00 

num RE events = 4.0
id      flags   population      individual      time    metadata
0       1       0       0       0.00000000000000
1       1       0       0       0.00000000000000
2       1       0       1       0.00000000000000
3       1       0       1       0.00000000000000
4       131072  0       -1      0.07592809191144
5       131072  0       -1      0.07592809191144
6       0       0       -1      0.19108191052930
7       0       0       -1      0.69575449796515
8       131072  0       -1      0.79179484017695
9       131072  0       -1      0.79179484017695
10      0       0       -1      1.35455502875617
11      262144  0       -1      1.43685779200617
12      131072  0       -1      1.93338594556679
13      131072  0       -1      1.93338594556679
14      131072  0       -1      2.33538022397736
15      131072  0       -1      2.33538022397736
16      262144  0       -1      2.36985359356004
17      0       0       -1      2.75957664444117
18      0       0       -1      3.44661056408756

So I guess the RE event in nodes 14/15 is undetectable because the trees below both they just change the branch length, but the event in 8/9 is detectable because the trees change "above" these nodes? (I'm just trying to get a precise idea of what the condition is so we can write down an efficient way to detecting the events in question)

@a-ignatieva
Copy link

I think here the only recombination changing the unrooted topology is 4/5, while 8/9, 12/13 and 14/15 keep the unrooted topologies the same but change the branch lengths (and the third type would be where neither the topology nor the branch lengths change for local trees around the breakpoint).

Is there anything ready-made in tskit for "gluing" these tree sequences into an ARG? Just wondering, as sometimes it's nice to see what's going on by looking at the graph (though easy to do by hand for small examples).

@hyanwong
Copy link
Member

hyanwong commented Feb 18, 2021

It would be nice to have a way to "convert" (or perhaps simply display) the tree-sequence-with-recombination-nodes as a graph. This is not dissimilar to the animations I did with record_full_arg simulations (below)

Incorporating this into tskit itself would mean solving tskit-dev/tskit#1209 so that tskit had an idea of what a recombination node meant, rather than it being only a specific extension provided by msprime. We might, however, be able to throw together some code that would work on an msprime-produced TS.

For larger args, to avoid lots of crossing lines, this might require some sort of graph layout package. Not only is that rather a burdensome dependency, but we have had problems locating a suitable one (see discussions at tskit-dev/tskit#621). @a-ignatieva - do you use any specific viz packages for looking at ARG graphs?

SPR.mov

@hyanwong
Copy link
Member

Also, have you tried just feeding the edges into e.g. graphviz? I'm not sure if that would work: perhaps you can experiment? You can get the parent and child node IDs for each edge very easily:

for e in ts.edges():
    print(e.parent, e.child)

@a-ignatieva
Copy link

I just use graphviz (dot), as KwARG doesn't care about the y-positions of the nodes.

@jeromekelleher
Copy link
Member Author

Hmmm. So, one way we can get at topology changes is to look at the tree ranks in the simplified tree sequence

sts = ts.simplify()
for tree in sts.trees():
    print(tree.draw_text())
    print(tree.interval, tree.rank())

giving

    7  
  ┏━┻━┓
  5   ┃
 ┏┻━┓ ┃
 4  ┃ ┃
┏┻┓ ┃ ┃
0 1 2 3

Interval(left=0.0, right=1.0) (3, 11)
    6  
  ┏━┻━┓
  5   ┃
 ┏┻━┓ ┃
 4  ┃ ┃
┏┻┓ ┃ ┃
0 1 2 3

Interval(left=1.0, right=4.0) (3, 11)
    8  
  ┏━┻━┓
  6   ┃
 ┏┻━┓ ┃
 5  ┃ ┃
┏┻┓ ┃ ┃
0 2 3 1

Interval(left=4.0, right=6.0) (3, 5)
    7  
  ┏━┻━┓
  6   ┃
 ┏┻━┓ ┃
 5  ┃ ┃
┏┻┓ ┃ ┃
0 2 3 1

Interval(left=6.0, right=8.0) (3, 5)

So, we have two distinct topologies, which change at position 4, and this corresponds to the RE node 4/5. Would this definition work for you @a-ignatieva? If so, I'm sure we can work backwards from knowing where the RE event occured along the sequence to pin-pointing it to the event itself.

@a-ignatieva
Copy link

Yes, that works I think! Though I guess is it complicated when we have two recombination events with the same breakpoint? Eg it looks like 12/13 could also have a breakpoint at position 4 here.
If one of the recombination events was completely "invisible" (no branch length or topology change), then am I right in thinking we would see just one tree instead of two to either side of the breakpoint in the simplified tree sequence?

@jeromekelleher
Copy link
Member Author

If one of the recombination events was completely "invisible" (no branch length or topology change), then am I right in thinking we would see just one tree instead of two to either side of the breakpoint in the simplified tree sequence?

Yes, exactly. You're right that things get more complicated if there are multiple recombinations at the same genome position, but I think that's something we assume is negligible in most models. It depends on exactly what you're looking for - if we just want a count of the detectable recombinations, then I think the number of distinct topologies is as good a definition as you'll get (?). Pinpointing the actual events that give detectable topology changes will make use do a bit more work, but I'm sure we could figure it out.

@a-ignatieva
Copy link

I'm happy with just having a count, so this works for me! Thanks.

@hyanwong
Copy link
Member

When you are plotting out using Graphviz, or whatever, do you somehow merge the 2 corresponding recombination nodes into a single graphviz node, @a-ignatieva ? Or do you have them as 2 separate ones, and have the recombination node below? Have you got it working, with an image you could post here?

@a-ignatieva
Copy link

a-ignatieva commented Feb 27, 2021

This is what I got by throwing in the nodes and edges (I've edited this to add node ranks to keep the correct time ordering):

from graphviz import Digraph

dot = Digraph(strict=False)

with dot.subgraph() as s:
    s.attr(rank = 'same')
    s.attr('node', shape='doublecircle')
    for n in ts.samples(): 
        s.node(str(n))

dot.attr('node', shape='circle')
itr = iter(range(ts.num_samples, ts.num_nodes))
for i in itr:
    n = ts.node(i) 
    if n.flags == msprime.NODE_IS_RE_EVENT:
        # Assign same rank to recombination nodes
        with dot.subgraph() as s:
            s.attr(rank = 'same')
            s.node(str(n.id))
            s.node(str(ts.node(i+1).id))
            next(itr, None)
    else:
        dot.node(str(n.id))
    # Add invisible edges to fix ranks
    dot.edge(str(i), str(i-1), _attributes={'style': 'invis'})

check = [0,0]
for e in ts.edges():
    # Preventing duplicate edges
    if check == [e.parent, e.child]:
        continue
    check = [e.parent, e.child]
    if ts.node(e.parent).flags==msprime.NODE_IS_RE_EVENT:
        dot.edge(str(e.parent), str(e.child), label="["+str(int(e.left))+","+str(int(e.right))+")")
    else:
        dot.edge(str(e.parent), str(e.child))
        
dot.render('pic', format='png', view=True)

pic

It might be nice to add separate recombination nodes and label them by the recombination breakpoint, like the below (which is what we get with kwarg), but I wasn't sure how to extract the breakpoint positions (eg for the 12/13 recombination into node 10) - though it probably doesn't matter, as long as the label is between the two new intervals.
pic

@a-ignatieva
Copy link

a-ignatieva commented Feb 27, 2021

This is a bit tidier, merging pairs of recombination nodes together:

from graphviz import Digraph

dot = Digraph(strict=False)

# Add sample nodes
with dot.subgraph() as s:
    s.attr(rank = 'same')
    for n in ts.samples(): 
        s.node(str(n), **{'shape': 'doublecircle'})

# Internal nodes
itr = iter(range(ts.num_samples, ts.num_nodes))
for i in itr:
    if ts.node(i).flags == msprime.NODE_IS_RE_EVENT:
        # Only add one of the recombination nodes and make the other one invisible
        with dot.subgraph() as s:
            s.attr(rank = 'same')
            s.node(str(i), **{'shape': 'rect'}, label = str(i) + "/" + str(i+1))
            s.node(str(i + 1),  **{'style': 'invis'})
            next(itr, None)
    else:
        dot.node(str(i), **{'shape': 'circle'})
    # Add invisible edges to fix the ranks
    dot.edge(str(i), str(i-1), **{'style': 'invis'})

# Add edges
check = [0,0]
for e in ts.edges():
    ch = e.child
    pa = e.parent
    
    # Preventing duplicate edges
    if check == [pa, ch]:
        continue
    check = [pa, ch]
    
    # Check which edge to draw if there is a recombination
    if e.parent >= 1 and ts.node(pa).time == ts.node(pa-1).time and ts.node(pa).flags == ts.node(pa-1).flags == msprime.NODE_IS_RE_EVENT:
        ch = pa = -1
    elif e.child >= 1 and ts.node(ch).time == ts.node(ch-1).time and ts.node(ch).flags == ts.node(ch-1).flags == msprime.NODE_IS_RE_EVENT:
        ch -= 1
        
    lab = ''
    if ch >= 0 and ts.node(ch).flags == msprime.NODE_IS_RE_EVENT:
        lab = "[" + str(int(e.left)) + "," + str(int(e.right)) + ")"
    
    if(ch >= 0 and pa >= 0):
        dot.edge(str(pa), str(ch), label=lab)
        
dot.render('pic', format='png', view=True)

pic

@hyanwong
Copy link
Member

This is a bit tidier, merging pairs of recombination nodes together:

Oh yes, really nice. Thanks @a-ignatieva . We should have something like that in the ARG tut I think.

@jeromekelleher
Copy link
Member Author

Yes, this is really nice, thanks @a-ignatieva.

@hyanwong
Copy link
Member

hyanwong commented Apr 2, 2021

@a-ignatieva: you might be interested in my recent PR: tskit-dev/tskit#1296. This should allow us to (easily) import a tree sequence into the Python graph software NetworkX. Not sure if this actually helps with viz, but I guess it might be useful for other things? I'd be interested to see how it copes with msprime's record_full_arg option.

hyanwong added a commit to hyanwong/tutorials that referenced this issue Jun 22, 2021
@hyanwong
Copy link
Member

hyanwong commented Nov 9, 2023

A simple example from graphviz and a more complex one of reading it back into networkx are now at https://tskit.dev/tutorials/viz.html#graph-representations. The ARG tutorial contains the code for networkx conversion (https://tskit.dev/tutorials/args.html#sec-args-other-analysis) so I think we can close this issue.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants