Ages of extant mutations only #3331
-
|
Hi, for a sample of nodes at a time, I'd like to get the ages of extant mutations, but I'm not sure how to separate the extant mutations (most recent mutation at a site for a nodes) from ancestral mutations that have been replaced. The mutations table includes all of these with no distinction. Might be more complicated because I'm using the binary mutation model, so the state of the allele isn't informative as to the mutation from which it got its derived state. Example simple tree sequence, 1-locus chromosome, ending in 5 diploid individuals: This gives me a treesequence with ~8 mutations. At present time (generation 0, or 'present'), some of those mutations have gone extinct (fully replaced by other mutations), and some exist on one or a few nodes. If I want to calculate the average age of mutations: The only possible solution I can think of is using the node to which each mutation is associated, and getting the most recent ancestor of each sample nodes from those mutation-nodes, which would tell me which mutations are extant. But, I'd have to do that separately for each site, which would be extremely slow. I'd really appreciate advice on how to get the ages (or just mean age) of extant mutations! Hopefully I'm not just missing an obvious function. |
Beta Was this translation helpful? Give feedback.
Replies: 2 comments 2 replies
-
|
Hi @danielpelletier116-prog - just for replicability issues, it might be worth adding a random_seed to the msprime issues. I don't think there's a built-in way of doing this, but perhaps there should be (see e.g. #260 (comment)). I had a think, and there are some complex cases when e.g. there is a mutation M above node 10, but also mutations at the same site above all the children of node 10. In this case M is never seen. Here's a hacky way that replaces the derived state with the mutation ID, then collects the states using the variants() iterator. It would be nice to have a version of There's probably a better way to do this by collecting the set of all samples below each mutation and intersecting them somehow, but it'll be a bit more complicated. tables = ts.dump_tables()
tables.sites.packset_ancestral_state(["" for _ in range(ts.num_sites)])
tables.mutations.packset_derived_state([str(m) for m in range(ts.num_mutations)])
used_muts = []
for v in tables.tree_sequence().variants():
for allele in (v.alleles[i] for i in np.unique(v.genotypes)):
if allele != '':
used_muts.append(int(allele))
print(used_muts)
print("Times", ts.mutations_time[used_muts]) |
Beta Was this translation helpful? Give feedback.
-
|
I think the right answer to this is that we need to implement the Note the example code in that issue. |
Beta Was this translation helpful? Give feedback.
Hi @danielpelletier116-prog - just for replicability issues, it might be worth adding a random_seed to the msprime issues.
I don't think there's a built-in way of doing this, but perhaps there should be (see e.g. #260 (comment)). I had a think, and there are some complex cases when e.g. there is a mutation M above node 10, but also mutations at the same site above all the children of node 10. In this case M is never seen. Here's a hacky way that replaces the derived state with the mutation ID, then collects the states using the variants() iterator. It would be nice to have a version of
variants()that, rather than returning an array of genotypes, simply returned an array of mutation IDs, …