Open
Description
Raising this issue to discuss implementing a function to convert a genotype call array to a dosage array.
In general, the output array should have at least two dimensions, with the first two dimensions being (variants, samples). The array elements give the dosage of each allele, i.e., how many copies of an allele are carried by the individual.
Some questions for discussion
- How do we handle biallelic and multiallelic variants?
- How do we handle missing genotype calls and avoid creating a bias towards a particular allele (e.g., the reference allele)?
- What should the output dtype be? (int or float or either)? (Some programs think of dosage as a continuous variable.)
Some related functions (not necessarily what we want to copy, but for reference):
- scikit-allel has to_n_ref, to_n_alt, to_allele_counts
- skallel prototype has genotypes_3d_to_allele_counts, genotypes_3d_to_allele_counts_melt, genotypes_3d_to_major_allele_counts