-
Notifications
You must be signed in to change notification settings - Fork 32
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Implement G123 selection statistic #980
Comments
Thanks @sanjaynagi, just to support that having an implementation of G123 would be very useful. I imagine that the implementation would be very similar to H12, except that we need to calculate diplotype frequencies instead of haplotype frequencies. For H12, computing of haplotype frequencies is achieved via hashing of haplotypes, and so we'd need something similar for diplotypes. Noting that a diplotype is a sequencing if diploid genotypes from a single individual, and so is an array of shape (n_variants, 2). I imagine there should be a simple way to do this, e.g., flatten the diplotype array then apply the same hash function as used for haplotypes. |
Also just to note that whatever approach is taken it should be able to handle multiallelic variants, i.e., variants where the allele indices are not just 0 or 1. |
This would be great @sanjaynagi! Linking to some prior discussion on the topic: https://github.com/pystatgen/sgkit/discussions/711 |
Thanks @alimanfoo , @hammer ! I've made the Garuds G statistics implementation now, but before I make a PR, what should I do in regard to testing? I see the results from other functions in the sgkit.popgen module are compared to either scikit-allel or tskit implementations. But of course, there are no G123 implementations in either library. The only thing I was going to do was fix the seed for the simulated data, and ensure the values returned are the same as I'm getting now, but I don't know if that is considered sufficient. Thank you. |
@sanjaynagi is there a G123 implementation in any library? We compare against R libraries sometimes... |
There is this actually, a python script which can calculate it https://github.com/ngarud/SelectionHapStats. It's quite verbose so I assume we can't include the actual method itself - could you please point me to an example of how the R library evaluations work? I assume we don't run actual R code within pytest? Thanks in advance @hammer |
Thanks for opening this @sanjaynagi!
That sounds good to me.
One way we do it is to run the R code manually, and save the output as a data file like CSV, which is then used in the test. The test documents how the R code was run so it can be reproduced if necessary. Here's an example: https://github.com/pystatgen/sgkit/blob/0a51f902d4ba8ccb794e24ef756af23ab4c28ac7/sgkit/tests/test_grm.py |
Thanks @tomwhite, that's awesome. Will open a PR soon. |
Suggestion for myself to implement the G123 statistic, which is analogous to H12 but works with unphased diplotypes instead of haplotypes.
In my experience, its a very powerful selection statistic for unphased data, and the loss in sensitivity compared to H12 seems to be negligible (at least for strong selection signals as we see in Anopheles gambiae).
I will essentially mimic the H12 implementation, which @alimanfoo tells me was done by @tomwhite. I'm not yet familiar with the Sgkit API/codebase so will let you know if I have any problems :)
The text was updated successfully, but these errors were encountered: