-
Notifications
You must be signed in to change notification settings - Fork 5
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Statistical genetics section #78
Comments
@jdstamp I managed to gather the HAPNEST example data PLINK files and convert them into a single Zarr store on GCS using the code at hammer/sgkitpub#1. I can share this bucket with you if you email me a Google account you'd like to use. |
Hi Jeff,
I was traveling so I had limited access to internet. I am gonna spend the next week focused on moving the stat gen section forward.
My brown university gmail account is the best for this: ***@***.*** ***@***.***>.
Thank you!
Best,
Julian
… On Jan 6, 2024, at 5:58 PM, Jeff Hammerbacher ***@***.***> wrote:
@jdstamp <https://github.com/jdstamp> I managed to gather the HAPNEST example data PLINK files and put them into a single Zarr store on GCS using the code at hammer/sgkitpub#1 <hammer/sgkitpub#1>. I can share this bucket with you if you email me a Google account you'd like to use.
—
Reply to this email directly, view it on GitHub <#78 (comment)>, or unsubscribe <https://github.com/notifications/unsubscribe-auth/AH4PUEQ5JFEUAKFWQ2FO5KTYNF7CLAVCNFSM6AAAAABBLQ76JOVHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMYTQNZZG42TIMRRHA>.
You are receiving this because you were mentioned.
|
Hey @jdstamp, Sounds great! GitHub has obfuscated your email address in this comment, please email it to me directly. Note that in our call yesterday we decided a better way to proceed here is to download chr20 for the full HAPNEST dataset and see if we can get these associations to run at biobank scale. |
Yep - the key question we want to answer is "can the techologies sgkit is built on do the kind of operations we need for GWAS at the million sample scale?" We want to illustrate the kind of thing sgkit can do to demonstrate potential, rather than try to exhaustively demonstrate its current set of features. |
A few updates from my initial comment:
|
Also given this focus on scalability I would like to warm up the cache of @tomwhite and @eric-czech. Over the next week or two if y'all have any time and are able to remind yourselves of the work done at https://github.com/pystatgen/sgkit/issues/390 and https://github.com/pystatgen/sgkit/issues/448#issuecomment-780655217, it would be really valuable to have a sanity check on whatever we end up writing up here. Understanding scalability bottlenecks can be hard and you expert input will be much appreciated! |
Tom's summary at https://github.com/pystatgen/sgkit/issues/390#issuecomment-775822748 seems particularly useful for understanding how we got to scale for this workload |
Forgot about Rafal's experiments at https://github.com/pystatgen/sgkit/issues/437 |
Update on working with
|
Dask really doesn't seem to like our large file x-to-zarr conversion patterns - I wonder if it would be simpler to make our own concurrent.futures based approach? You can get a long way on a single node. File conversion really needs to be robust... |
Alternatively, we could try running the GWAS without converting to Zarr - in principle we should be able to work directly from the plink files. |
Yes but I fear we'd run out of resources trying to do the GWAS on the instance types I can access. Also, I'd like to do the conversion so that I can scale out. If I can't get access to the high memory instances on GCP or AWS, I'm a bit stuck. I did not realize they would just deny quota requests! I fear I'm going to have to talk to a salesperson somehow. I do have a suspicion that we could hard-wire a saner task graph with distributed primitives than the one Dask is producing right now. One challenge for me is the implicit nature of how the Dask task graph is constructed. It looks nice in code but it's all a bit magical and hard to debug. I don't really understand the BED format well enough to understand what our |
Yeah, Dask is awesome when it works - pretty mysterious when it doesn't though. |
Well after all that I managed to get an The only wrinkle is that my Dask dashboard didn't load as our installation didn't pick up the I'll proceed with the association test now and see what happens. |
Working through the full GWAS tutorial:
I need to hit the gym now but will hopefully have some time this afternoon to keep digging. |
Okay trying full GWAS tutorial again on a Coiled cluster and using
import os
os.environ["SGKIT_DISABLE_NUMBA_CACHE"] = "1"
os.environ["NUMBA_CACHE_DIR"] = "/tmp/numba_cache"
def set_env():
import os
os.environ["SGKIT_DISABLE_NUMBA_CACHE"] = "1"
os.environ["NUMBA_CACHE_DIR"] = "/tmp/numba_cache"
client.run(set_env)
client.run_on_scheduler(set_env)
ds.compute()
import coiled
cluster = coiled.Cluster(
n_workers=20,
worker_memory="64 GiB",
show_widget=False,
environ={"SGKIT_DISABLE_NUMBA_CACHE": "1", "NUMBA_CACHE_DIR": "/tmp"},
)
client = cluster.get_client()
|
Back at it today! I managed to get my quotas up to launch a real monster of a machine, a I was able to get to the A few thoughts:
|
Just knowing whether the association test is remotely feasible would be a great data point right now. |
I went back to the r7i.8xlarge instance on EC2 with 32 vCPUs and 256 GB RAM and tried to run a minimal path to get to the linear regression and use the results to make a Manhattan plot (which uses the After this completed I tried to examine another output of the linear regression, So, it seems the answer is that it is feasible, but given Dask's lazy execution model I need to spend some time to understand how to isolate and profile the work done just for the linear regression. |
I think a good goal here would be to try and reproduce the Manhattan plots in Fig s11 of the HAPNEST paper: They did this with only 50K samples, though. It's not totally obvious that they are using the same phenotypes here as they have in the released dataset, but we should see qualitatively similar patterns based on the heritability and polygenicity parameters. I don't think there's much to learn from running QC steps here, as it's synthetic data. What we want to see/show is that we can get Manhattan plots out. Other QC steps are hopefully being illustrated by other parts of the paper. If merging the datasets across the chromosomes is a pain, I think it's fine working chromosome by chromosome. |
@jeromekelleher do you know anyone from the HAPNEST paper who we can ask to sanity check our findings given their generative model? |
I know four of the authors, so yes, happy to ping them |
As far as the UKB replication goes, here is my recollection of what happened that would be worth keeping in mind:
Overall, I left this work with the impression that chunking in the samples dimension might be a nonstarter for really scaling GWAS to a UKB-sized dataset. I would recommend trying that @hammer. I'm not sure what to say on the cluster issues you're hitting though .. I don't recognize any of those. I hit my fair share of similar problems with https://github.com/dask/dask-cloudprovider. To some of your other issues @hammer:
That makes sense. It shouldn't compute anything until you reference one of the resulting variables.
It's rerunning the whole regression. The best way to start working with the results, IIRC, is There are more examples of this in that UKB notebook I mentioned like this: That progress bar is quite helpful too. You might find some other useful examples like that in there. Footnotes
|
Here's an overview of the work I'd like to do with @jdstamp to get this section together. I started writing it up as an email but figured I'd post it here for visibility.
I think a first milestone could be getting the HAPNEST notebook working locally. Then you can try to run the GWAS regression with all phenotypes and examine its output to be sure it's sensible. Once that's sorted, you can run and validate REGENIE and GENE-ε in whatever order seems best to you.
If all of the above is accomplished and you still want to keep going, we can consider several additional directions of work.
The text was updated successfully, but these errors were encountered: