Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Implement G123 selection statistic #980

Open
sanjaynagi opened this issue Dec 20, 2022 · 8 comments
Open

Implement G123 selection statistic #980

sanjaynagi opened this issue Dec 20, 2022 · 8 comments

Comments

@sanjaynagi
Copy link

Suggestion for myself to implement the G123 statistic, which is analogous to H12 but works with unphased diplotypes instead of haplotypes.

In my experience, its a very powerful selection statistic for unphased data, and the loss in sensitivity compared to H12 seems to be negligible (at least for strong selection signals as we see in Anopheles gambiae).

I will essentially mimic the H12 implementation, which @alimanfoo tells me was done by @tomwhite. I'm not yet familiar with the Sgkit API/codebase so will let you know if I have any problems :)

@alimanfoo
Copy link
Collaborator

Thanks @sanjaynagi, just to support that having an implementation of G123 would be very useful.

I imagine that the implementation would be very similar to H12, except that we need to calculate diplotype frequencies instead of haplotype frequencies. For H12, computing of haplotype frequencies is achieved via hashing of haplotypes, and so we'd need something similar for diplotypes. Noting that a diplotype is a sequencing if diploid genotypes from a single individual, and so is an array of shape (n_variants, 2). I imagine there should be a simple way to do this, e.g., flatten the diplotype array then apply the same hash function as used for haplotypes.

@alimanfoo
Copy link
Collaborator

Also just to note that whatever approach is taken it should be able to handle multiallelic variants, i.e., variants where the allele indices are not just 0 or 1.

@hammer
Copy link
Contributor

hammer commented Dec 20, 2022

This would be great @sanjaynagi! Linking to some prior discussion on the topic: https://github.com/pystatgen/sgkit/discussions/711

@sanjaynagi
Copy link
Author

sanjaynagi commented Dec 21, 2022

Thanks @alimanfoo , @hammer !

I've made the Garuds G statistics implementation now, but before I make a PR, what should I do in regard to testing?

I see the results from other functions in the sgkit.popgen module are compared to either scikit-allel or tskit implementations. But of course, there are no G123 implementations in either library.

The only thing I was going to do was fix the seed for the simulated data, and ensure the values returned are the same as I'm getting now, but I don't know if that is considered sufficient. Thank you.

@hammer
Copy link
Contributor

hammer commented Dec 21, 2022

@sanjaynagi is there a G123 implementation in any library? We compare against R libraries sometimes...

@sanjaynagi
Copy link
Author

sanjaynagi commented Dec 22, 2022

There is this actually, a python script which can calculate it https://github.com/ngarud/SelectionHapStats.

It's quite verbose so I assume we can't include the actual method itself - could you please point me to an example of how the R library evaluations work? I assume we don't run actual R code within pytest?

Thanks in advance @hammer

@tomwhite
Copy link
Collaborator

tomwhite commented Jan 3, 2023

Thanks for opening this @sanjaynagi!

The only thing I was going to do was fix the seed for the simulated data, and ensure the values returned are the same as I'm getting now, but I don't know if that is considered sufficient. Thank you.

That sounds good to me.

It's quite verbose so I assume we can't include the actual method itself - could you please point me to an example of how the R library evaluations work? I assume we don't run actual R code within pytest?

One way we do it is to run the R code manually, and save the output as a data file like CSV, which is then used in the test. The test documents how the R code was run so it can be reproduced if necessary.

Here's an example: https://github.com/pystatgen/sgkit/blob/0a51f902d4ba8ccb794e24ef756af23ab4c28ac7/sgkit/tests/test_grm.py

@sanjaynagi
Copy link
Author

Thanks @tomwhite, that's awesome. Will open a PR soon.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

4 participants