Open
Description
I'm not so sure about that name but it seems like the most obvious choice given how our other counting functions are named.
This should do what https://scikit-allel.readthedocs.io/en/stable/model/ndarray.html#allel.Genotypes.to_n_alt does.
Likely tasks:
- Ensure that
count_call_alleles
has been run - Set dtype so counts aren't int64
- Use the above to sum non-reference allele counts (e.g.
ds.call_allele_count[:, 1:].sum(dim='alleles')
)- For handling missing data, it may make sense to add another numba gufunc or have an option that tells
call_allele_count
to count missing alleles too. That could get tricky with https://github.com/pystatgen/sgkit/issues/243 though, so it is probably even better if the function relies on the missingness mask instead (= slightly less efficient but more readable code).
- For handling missing data, it may make sense to add another numba gufunc or have an option that tells
- Document the fact that we always assume the first allele is the reference allele in this function
- Decide what to do with missing data, my preference is:
- Default to the behavior of
allel.Genotypes.to_n_alt(fill=-1)
, meaning that partially or completely missed calls result in a -1 count - Not even have an option to fill missing values with 0 -- I'm not sure why that's the default in scikit-allel. @alimanfoo is there an important application for that?
- Default to the behavior of
- Change reference in PCA https://github.com/pystatgen/sgkit/pull/262