Description
It would be really nice if we could access the ancestral and derived states as numpy string arrays on the tree sequence like
ts.mutations_derived_state
ts.sites_ancestral_state
which return the actual string states (rather than bytes and offsets, like the Tables API).
In the common single-allele case, we can easily do this:
ancestral_state = tables.sites.ancestral_state.view("S1")
Out[64]: array([b'T', b'G', b'A', ..., b'T', b'T', b'A'], dtype='|S1')
However, this is a byte array and "not recommended". I guess we should try to return the values as proper numpy str
types?
Can easily do this but it makes a copy (I assume?)
derived_state.astype(str)
Out[63]:
array(['T', 'G', 'A', ..., 'T', 'T', 'A'], dtype='<U1')
I haven't managed to make a direct view of the memory that gets treated as a unicode string.
As a first pass we could just error out of the states aren't all single chars (ascii chars even), and I guess we could just copy the data in a way that's numpy string friendly when we have variable length alleles? I guess it would be good to know if this is possible before painting ourselves into a corner.
Related: #2631