Skip to content

Ancestral states when sample data is missing #270

@hyanwong

Description

@hyanwong

Something I just realised, after implementing a unit test for tsinfer. If we have samples that are missing a left hand or right hand chunk of information, then my recent push (tskit-dev/tsinfer#169) allows us to build a decent tree sequence, and a samples that has a particular region missing will simply have its node number unconnected to any other node in that region of the tree sequence (it will just be an unconnected point when plotted). But when we generate a haplotype from this tree sequence, there will be zeroes not tskit.MISSING_DATA for the missing regions. That’s because the ancestral state is zero in these regions, and since they aren’t linked by an edge to anywhere, they just get the ancestral state.

There are a few reasonable ways around this. One way (1) is to say that any completely unconnected nodes should always get a missing data flag. Another (2) is to say that for these sites, the ancestral state is tskit.MISSING_DATA, then place a mutation to ‘0’ above the root of the main tree. Yet another (3), which is more generic but has a wider impact is to set the ancestral state to tskit.MISSING_DATA for all sites, and always require a mutation to 0 to be placed above the main tree root.

I see this as a general problem that if we have a position in the tree sequence where there are multiple roots, there might be different ancestral states for each root.

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions