Description
At the moment we require people to order their allele lists such that the first allele is the ancestral state. This can be a little fiddly, and often the ancestral state isn't known anyway, so it would be good to have a way for users to indicate this, while keeping the site in the final TS. Currently the way to incorporate a site with an unknown ancestral state is to pick one at random as the first one to put, and include the site in the exclude_positions
parameter of generate_ancestors()
, which seems pretty obscure to me.
I suggest that a non-breaking change that could allow this would be to have an extra parameter "ancestral_state_index", which defaults to 0, meaning that the zeroth allele in the alleles
parameter is treated as the ancestral state. This would be similar to the sort of format we expect to see when we switch to using sgkit as the input format. We could allow tskit.NULL
to be passed to mean that we don't know the ancestral state, and it should be placed using parsimony.
This seems pretty intuitive to me, but there are some decisions to be made here. When we add a site with (say) ancestral_state_index=1
, do we record the genotypes and allele list before sticking them into the sample data file (which seems a bit hacky, and is not likely to be portable to sgkit). Or do we switch the genotypes when we build the ancestors, in which case we will need to swap the genotypes when we match against the ancestors using match_samples
or match_ancestors
. If the latter, I suspect that it might be helpful for sanity checking purposes to store the meaning of the genotype values (i.e. the reordered alleles list) in the AncestorData instance.