Ages of extant mutations only #3331

danielpelletier116-prog · 2025-11-20T20:08:33Z

danielpelletier116-prog
Nov 20, 2025

Hi, for a sample of nodes at a time, I'd like to get the ages of extant mutations, but I'm not sure how to separate the extant mutations (most recent mutation at a site for a nodes) from ancestral mutations that have been replaced. The mutations table includes all of these with no distinction. Might be more complicated because I'm using the binary mutation model, so the state of the allele isn't informative as to the mutation from which it got its derived state.

Example simple tree sequence, 1-locus chromosome, ending in 5 diploid individuals:
ats=msprime.sim_ancestry(5, sequence_length=1)
Binary (0 or 1) mutations, with state-dependence so no silent mutations:
mts=msprime.sim_mutations(ats, rate=0.5, model=msprime.BinaryMutationModel(state_independent=False))

This gives me a treesequence with ~8 mutations. At present time (generation 0, or 'present'), some of those mutations have gone extinct (fully replaced by other mutations), and some exist on one or a few nodes. If I want to calculate the average age of mutations:
np.mean(mts.tables.mutations.time)
The ages of all mutations that have happened on the tree are included, rather than just those that are the most recent mutation for at least 1 sample node.

The only possible solution I can think of is using the node to which each mutation is associated, and getting the most recent ancestor of each sample nodes from those mutation-nodes, which would tell me which mutations are extant. But, I'd have to do that separately for each site, which would be extremely slow.

I'd really appreciate advice on how to get the ages (or just mean age) of extant mutations! Hopefully I'm not just missing an obvious function.
Thanks,
Daniel

Answered by hyanwong

Nov 20, 2025

Hi @danielpelletier116-prog - just for replicability issues, it might be worth adding a random_seed to the msprime issues.

I don't think there's a built-in way of doing this, but perhaps there should be (see e.g. #260 (comment)). I had a think, and there are some complex cases when e.g. there is a mutation M above node 10, but also mutations at the same site above all the children of node 10. In this case M is never seen. Here's a hacky way that replaces the derived state with the mutation ID, then collects the states using the variants() iterator. It would be nice to have a version of variants() that, rather than returning an array of genotypes, simply returned an array of mutation IDs, …

View full answer

hyanwong · 2025-11-20T23:06:33Z

hyanwong
Nov 20, 2025
Maintainer

Hi @danielpelletier116-prog - just for replicability issues, it might be worth adding a random_seed to the msprime issues.

I don't think there's a built-in way of doing this, but perhaps there should be (see e.g. #260 (comment)). I had a think, and there are some complex cases when e.g. there is a mutation M above node 10, but also mutations at the same site above all the children of node 10. In this case M is never seen. Here's a hacky way that replaces the derived state with the mutation ID, then collects the states using the variants() iterator. It would be nice to have a version of variants() that, rather than returning an array of genotypes, simply returned an array of mutation IDs, but we don't have such a thing as far as I'm aware.

There's probably a better way to do this by collecting the set of all samples below each mutation and intersecting them somehow, but it'll be a bit more complicated.

tables = ts.dump_tables()
tables.sites.packset_ancestral_state(["" for _ in range(ts.num_sites)])
tables.mutations.packset_derived_state([str(m) for m in range(ts.num_mutations)])
used_muts = []
for v in tables.tree_sequence().variants():
    for allele in (v.alleles[i] for i in np.unique(v.genotypes)):
        if allele != '':
            used_muts.append(int(allele))

print(used_muts)
print("Times", ts.mutations_time[used_muts])

1 reply

danielpelletier116-prog Nov 21, 2025
Author

Thank you, this seems to work! I worked out a way to do it by keeping only the most recent mutation associated with each node in the mutations table, then using the link_ancestors table to get the most recent ancestors of the sample nodes out of the mutation-nodes, but hadn't figured out how to do it for all sites yet.
It would definitely be handy to have some sort of mutation IDs to help linking variants to mutations, and a method to remove mutations from a treesequence when they contribute to no derived states in a sample set, as discussed in the issue you linked.
Thanks!
Daniel

petrelharp · 2025-11-21T00:07:47Z

petrelharp
Nov 21, 2025
Maintainer

I think the right answer to this is that we need to implement the mutation_frequncy function; then you can filter by mutations at frequency > 0 and < 1. There was discussion of that here: #504

Note the example code in that issue.

1 reply

hyanwong Nov 21, 2025
Maintainer

Oh yes, counting descendants and then subtracting the parent mutation counts is much better. How extremely stupid not to have thought of that.

mutation_counts = np.zeros(ts.num_mutations, dtype=int)
for tree in ts.trees():
    for site in tree.sites():
        for m in site.mutations:
            mutation_counts[m.id] = tree.num_samples(m.node)
            if m.parent != tskit.NULL:
                # assumes parents visited first, required by tskit mutation order
                mutation_counts[m.parent] -= mutation_counts[m.id]

used_muts = np.where(mutation_counts > 0)[0]

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Ages of extant mutations only #3331

Uh oh!

{{title}}

Uh oh!

Replies: 2 comments 2 replies

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{editor}}'s edit

{{editor}}'s edit

Uh oh!

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{editor}}'s edit

{{editor}}'s edit

Uh oh!

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{editor}}'s edit

{{editor}}'s edit

Uh oh!

Select a reply

Uh oh!

Ages of extant mutations only #3331

Uh oh!

danielpelletier116-prog Nov 20, 2025

Replies: 2 comments · 2 replies

Uh oh!

Uh oh!

hyanwong Nov 20, 2025 Maintainer

Uh oh!

danielpelletier116-prog Nov 21, 2025 Author

Uh oh!

Uh oh!

petrelharp Nov 21, 2025 Maintainer

Uh oh!

Uh oh!

hyanwong Nov 21, 2025 Maintainer

danielpelletier116-prog
Nov 20, 2025

Replies: 2 comments 2 replies

hyanwong
Nov 20, 2025
Maintainer

danielpelletier116-prog Nov 21, 2025
Author

petrelharp
Nov 21, 2025
Maintainer

hyanwong Nov 21, 2025
Maintainer