-
Notifications
You must be signed in to change notification settings - Fork 79
Description
Something I just realised, after implementing a unit test for tsinfer. If we have samples that are missing a left hand or right hand chunk of information, then my recent push (tskit-dev/tsinfer#169) allows us to build a decent tree sequence, and a samples that has a particular region missing will simply have its node number unconnected to any other node in that region of the tree sequence (it will just be an unconnected point when plotted). But when we generate a haplotype from this tree sequence, there will be zeroes not tskit.MISSING_DATA for the missing regions. That’s because the ancestral state is zero in these regions, and since they aren’t linked by an edge to anywhere, they just get the ancestral state.
There are a few reasonable ways around this. One way (1) is to say that any completely unconnected nodes should always get a missing data flag. Another (2) is to say that for these sites, the ancestral state is tskit.MISSING_DATA, then place a mutation to ‘0’ above the root of the main tree. Yet another (3), which is more generic but has a wider impact is to set the ancestral state to tskit.MISSING_DATA for all sites, and always require a mutation to 0 to be placed above the main tree root.
I see this as a general problem that if we have a position in the tree sequence where there are multiple roots, there might be different ancestral states for each root.