Skip to content

Add count_call_alternate_alleles function #282

Open
@eric-czech

Description

@eric-czech

I'm not so sure about that name but it seems like the most obvious choice given how our other counting functions are named.

This should do what https://scikit-allel.readthedocs.io/en/stable/model/ndarray.html#allel.Genotypes.to_n_alt does.

Likely tasks:

  • Ensure that count_call_alleles has been run
  • Set dtype so counts aren't int64
  • Use the above to sum non-reference allele counts (e.g. ds.call_allele_count[:, 1:].sum(dim='alleles'))
    • For handling missing data, it may make sense to add another numba gufunc or have an option that tells call_allele_count to count missing alleles too. That could get tricky with https://github.com/pystatgen/sgkit/issues/243 though, so it is probably even better if the function relies on the missingness mask instead (= slightly less efficient but more readable code).
  • Document the fact that we always assume the first allele is the reference allele in this function
  • Decide what to do with missing data, my preference is:
    • Default to the behavior of allel.Genotypes.to_n_alt(fill=-1), meaning that partially or completely missed calls result in a -1 count
    • Not even have an option to fill missing values with 0 -- I'm not sure why that's the default in scikit-allel. @alimanfoo is there an important application for that?
  • Change reference in PCA https://github.com/pystatgen/sgkit/pull/262

Metadata

Metadata

Assignees

No one assigned

    Labels

    core operationsIssues related to domain-specific functionality such as LD pruning, PCA, association testing, etc.

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions