Open
Description
To run a basic GWAS on UKB data, here are some of the operations we'll need support for:
- bgen reader (BGEN reader implementation using bgen_reader sgkit-bgen#1)
- plink reader (Pysnptools reader implementation sgkit-plink#1, Suboptimal parallelism sgkit-plink#6)
- Variant allele frequency/count (https://github.com/pystatgen/sgkit/issues/29)
- Variant call rate/count (https://github.com/pystatgen/sgkit/issues/29)
- Variant HWE test (https://github.com/pystatgen/sgkit/issues/28)
- Sample call rate/count (https://github.com/pystatgen/sgkit/issues/29)
- An
is_autosome
function to filter variants by - A function to convert genotype probabilities to hard calls (https://github.com/pystatgen/sgkit/issues/346)
- A linear regression function (https://github.com/pystatgen/sgkit/pull/52)
- A variant annotation function like vep. There are plenty of other ways to get this but an internal function would be great.
- A phenotype normalization pipeline. I don't expect much of this to become part of sgkit, but there might be some generalizable phenotype-specific functions that are worth considering for inclusion.
There may be a few more beyond that, but I think anything remaining should be reasonable with Xarray/Dask alone.