Skip to content

Numpy access to ancestral and derived state columns in TreeSequence #2632

Open
@jeromekelleher

Description

@jeromekelleher

It would be really nice if we could access the ancestral and derived states as numpy string arrays on the tree sequence like

ts.mutations_derived_state
ts.sites_ancestral_state

which return the actual string states (rather than bytes and offsets, like the Tables API).

In the common single-allele case, we can easily do this:

ancestral_state = tables.sites.ancestral_state.view("S1")
Out[64]: array([b'T', b'G', b'A', ..., b'T', b'T', b'A'], dtype='|S1')

However, this is a byte array and "not recommended". I guess we should try to return the values as proper numpy str types?

Can easily do this but it makes a copy (I assume?)

derived_state.astype(str)
Out[63]:
array(['T', 'G', 'A', ..., 'T', 'T', 'A'], dtype='<U1')

I haven't managed to make a direct view of the memory that gets treated as a unicode string.

As a first pass we could just error out of the states aren't all single chars (ascii chars even), and I guess we could just copy the data in a way that's numpy string friendly when we have variable length alleles? I guess it would be good to know if this is possible before painting ourselves into a corner.

Related: #2631

Metadata

Metadata

Assignees

No one assigned

    Labels

    Python APIIssue is about the Python APIenhancementNew feature or request

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions