-
Notifications
You must be signed in to change notification settings - Fork 5
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Use HAPNEST data for gwas demo? #43
Comments
Also includes phenotypes, btw |
The advantages of a fully reproducible analysis pipeline to go along with the paper seems compelling to me. Working with something like UKB inevitably introduces friction. This synthetic dataset has been carefully curated for realism, and I'm not sure what extra we'd be showing by working with actual data. There's a neatness to demonstrating that we can work with two different synthetic datasets at the 1 million sample scale, through both VCF and plink. If we make it a requirement that all of the things that go into the paper are fully reproducible (which chimes well with the overall philosophy of openness), and we want to do something at the largest scale, then this seems like a great way to go. |
I will have a look this week! I've been using GitHub Codespaces so far for my explorations and will need to think about how scaling experiments. We hit some scalability issues last time we tried to do a GWAS at the UKB scale (https://github.com/pystatgen/sgkit/issues/390) so I may also need to get some help resolving those issues. A quick look at the S-BSST936 listing shows the
@jeromekelleher forgive my ignorance but do we have a VCF synthetic data set at this scale as well? |
Some places to look for this data on cloud storage already:
|
Yep - our data/basic compute task scaling figure goes up to a million samples, taken as subsets of the 1.4M in the simulations provided in this paper (Note: @benjeffery and I are planning to add another line for the SAV file format/C++ toolkit here. Fig is also quite drafty, obvs) |
Okay figured out their FTP structure, everything is under For my reference, I'm using a command like:
Transfer speeds not so bad, seeing around 27 MiB/s, will take about 17 minutes for chr21 and probably 2 hours or so for chr1. Will kick off a big transfer tomorrow. |
Okay I've gotten our GWAS demo running using one chromosome and one phenotype of the example (600 subjects) data. Notebook is at https://github.com/hammer/sgkitpub/blob/main/hapnest_gwas.ipynb Some thoughts:
I will next try to scale to all chromosomes and all phenotypes on the example data, then go to the big dataset. |
Just noting for myself that tools from other language ecosystems that might be fun to try out in this section would be https://github.com/privefl/bigsnpr (GWAS docs) and https://github.com/OpenMendel/MendelGWAS.jl |
The 1 million sample HAPNEST dataset (https://github.com/pystatgen/sgkit/discussions/1144#discussioncomment-7654640 ) seems ideal for our purposes.
Larger thank ukb, and no messing with data access problems. Also lets us showcase our plink format support.
Any thoughts @hammer ?
The text was updated successfully, but these errors were encountered: