Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Statistical genetics section #78

Open
hammer opened this issue Jan 3, 2024 · 23 comments
Open

Statistical genetics section #78

hammer opened this issue Jan 3, 2024 · 23 comments

Comments

@hammer
Copy link

hammer commented Jan 3, 2024

Here's an overview of the work I'd like to do with @jdstamp to get this section together. I started writing it up as an email but figured I'd post it here for visibility.

  • High-level goal: enumerate and evaluate the correctness and performance of methods in sgkit useful for statistical genetics researchers, then write up a few paragraphs in the paper describing these methods.
  • The paper is on GitHub at https://github.com/pystatgen/sgkit-publication. I've put it up on Overleaf as well if you'd like to have a look. It's in a bit of a rough state right now but it's worth a skim to see the how the statistical genetics section fits into the larger narrative.
  • We essentially have 3 statistical genetics methods, listed under the Genetic Association and Regression section of the docs: gwas_linear_regression (tests), regenie (tests), and genee (tests). The first of these is simple, so I'd start there.
  • To evaluate these methods, we'd like to use the HAPNEST data, as discussed at Use HAPNEST data for gwas demo? #43. You can download this data from BioStudies study S-BSST936. Start with the example data which is restricted to 600 individuals.
  • I put a Jupyter notebook on GitHub with some Python code to download the HAPNEST data and run a GWAS on it. I modified the code from the sgkit GWAS Tutorial.
  • We also have some Docker environments and code at https://github.com/pystatgen/sgkit/tree/main/validation/gwas to compare REGENIE results to other implementations. I don't know much about what's in there but @eric-czech might be able to answer questions if you look it over and think it could be useful.
  • If you need compute or storage resources to do this work, please let me know and I can add you to the Google Cloud project I've been using.

I think a first milestone could be getting the HAPNEST notebook working locally. Then you can try to run the GWAS regression with all phenotypes and examine its output to be sure it's sensible. Once that's sorted, you can run and validate REGENIE and GENE-ε in whatever order seems best to you.

If all of the above is accomplished and you still want to keep going, we can consider several additional directions of work.

@hammer
Copy link
Author

hammer commented Jan 6, 2024

@jdstamp I managed to gather the HAPNEST example data PLINK files and convert them into a single Zarr store on GCS using the code at hammer/sgkitpub#1. I can share this bucket with you if you email me a Google account you'd like to use.

@jdstamp
Copy link

jdstamp commented Jan 9, 2024 via email

@hammer
Copy link
Author

hammer commented Jan 9, 2024

Hey @jdstamp,

Sounds great!

GitHub has obfuscated your email address in this comment, please email it to me directly.

Note that in our call yesterday we decided a better way to proceed here is to download chr20 for the full HAPNEST dataset and see if we can get these associations to run at biobank scale.

@jeromekelleher
Copy link
Collaborator

Yep - the key question we want to answer is "can the techologies sgkit is built on do the kind of operations we need for GWAS at the million sample scale?" We want to illustrate the kind of thing sgkit can do to demonstrate potential, rather than try to exhaustively demonstrate its current set of features.

@hammer
Copy link
Author

hammer commented Jan 9, 2024

@jdstamp

A few updates from my initial comment:

  • Jerome has put the manuscript on Overleaf at https://www.overleaf.com/project/659c19afc265913df38df1ab
  • The starting point for your work remains the same: I think it would be valuable to ensure you can run all 3 association tests on the example data and that they produce the expected output. This will ensure we're shipping working code to users.
  • The primary change is that for writing up the section we'd like to focus on scaling a single method. So, the bullet for "Try to scale these methods to the full HAPNEST data" is no longer a nice to have, it's a must. Scaling up gwas_linear_regression on chr20 should be sufficient.

@hammer
Copy link
Author

hammer commented Jan 9, 2024

Also given this focus on scalability I would like to warm up the cache of @tomwhite and @eric-czech. Over the next week or two if y'all have any time and are able to remind yourselves of the work done at https://github.com/pystatgen/sgkit/issues/390 and https://github.com/pystatgen/sgkit/issues/448#issuecomment-780655217, it would be really valuable to have a sanity check on whatever we end up writing up here. Understanding scalability bottlenecks can be hard and you expert input will be much appreciated!

@hammer
Copy link
Author

hammer commented Jan 9, 2024

Tom's summary at https://github.com/pystatgen/sgkit/issues/390#issuecomment-775822748 seems particularly useful for understanding how we got to scale for this workload

@hammer
Copy link
Author

hammer commented Jan 9, 2024

Forgot about Rafal's experiments at https://github.com/pystatgen/sgkit/issues/437

@hammer
Copy link
Author

hammer commented Jan 11, 2024

Update on working with chr20 from the full HAPNEST data:

  • The synthetic_v1_chr-20.bed file is 36 GB
  • It took about 30 minutes to get out of FTP onto GitHub Codespaces VM
  • It took only 3 minutes to write from the VM to GCS
  • I am using the largest Codespaces VM: 16-core, 64GB RAM, 128GB storage
  • I kicked off the PLINK to Zarr conversion with 8 workers. This generated 2,356 bed_reader::read_plink tasks, which seems high?
  • Dask immediately hit me with this warning: UserWarning: Sending large graph of size 9.55 MiB. This may cause some slowdown. Consider scattering data ahead of time and using futures.
  • I'm now 15 minutes into the job and 188 of the read_plink tasks have completed, so if it continues at this pace I guess it will take 3 hours ish? I regret not running this as a background task...
  • About 20 minutes into the job I lost 2 workers? RIP guys
  • A few minutes later 1 more worker has died and I have a new warning: WARNING - Unmanaged memory use is high. This may indicate a memory leak or the memory may not be released to the OS
  • Killed it after 30 minutes and only 4 workers remaining. Moving backwards on tasks completed was not fun to watch.
  • Trying again with a 10 node Coiled cluster...
  • Of course we can't read from the cloud so need to find some way to mount a filesystem on the workers blerg
  • Decided to just grab a high memory instance from GCP
  • Quota increase request denied on GCP for M2 and M3 instances...
  • AWS also won't let me launch high memory instances calling it a night.

@jeromekelleher
Copy link
Collaborator

Dask really doesn't seem to like our large file x-to-zarr conversion patterns - I wonder if it would be simpler to make our own concurrent.futures based approach? You can get a long way on a single node. File conversion really needs to be robust...

@jeromekelleher
Copy link
Collaborator

Alternatively, we could try running the GWAS without converting to Zarr - in principle we should be able to work directly from the plink files.

@hammer
Copy link
Author

hammer commented Jan 11, 2024

Yes but I fear we'd run out of resources trying to do the GWAS on the instance types I can access. Also, I'd like to do the conversion so that I can scale out. If I can't get access to the high memory instances on GCP or AWS, I'm a bit stuck. I did not realize they would just deny quota requests! I fear I'm going to have to talk to a salesperson somehow.

I do have a suspicion that we could hard-wire a saner task graph with distributed primitives than the one Dask is producing right now. One challenge for me is the implicit nature of how the Dask task graph is constructed. It looks nice in code but it's all a bit magical and hard to debug. I don't really understand the BED format well enough to understand what our bed_reader code is doing to generate 2,356 separate tasks. If I have time I may start trying to page in the bed file format and Dask task graph construction but it's not really what I care to be doing to be honest...

@jeromekelleher
Copy link
Collaborator

Yeah, Dask is awesome when it works - pretty mysterious when it doesn't though.

@hammer
Copy link
Author

hammer commented Jan 11, 2024

Well after all that I managed to get an r7i.8xlarge instance on EC2 with 32 vCPUs and 256 GB RAM without needing a quota upgrade and it did the conversion from PLINK to Zarr for chr20 in 12 minutes! At an hourly rate of $2.1168/hr, that's roughly $0.42.

The only wrinkle is that my Dask dashboard didn't load as our installation didn't pick up the bokeh package which is apparently necessary for the dashboard.

I'll proceed with the association test now and see what happens.

@hammer
Copy link
Author

hammer commented Jan 11, 2024

Working through the full GWAS tutorial:

  • xr.plot.hist runs for a long time and throws some timeouts so I killed it and kept moving.
  • hardy_weinberg_test died with KilledWorker: Attempted to run task ('arange-genotype_as_bytes-index_as_genotype-astype-0d9ea21cc0c15c028899cea1fa9b05c4', 0) on 4 different workers, but all those workers died while running it.
  • gwas_linear_regression issued some warnings (PerformanceWarning: Increasing number of chunks) and returned right away, so I think it's not getting computed until the manhattan_plot call later? That call gave the familiar warning UserWarning: Sending large graph of size 22.80 MiB. This may cause some slowdown. Consider scattering data ahead of time and using futures.

I need to hit the gym now but will hopefully have some time this afternoon to keep digging.

@hammer
Copy link
Author

hammer commented Jan 11, 2024

Okay trying full GWAS tutorial again on a Coiled cluster and using compute() to force some work to happen:

  • ds = sg.sample_stats(ds) --> ds.compute() gives the usual warning UserWarning: Sending large graph of size 13.13 MiB. This may cause some slowdown. Consider scattering data ahead of time and using futures. but then when it runs I get RuntimeError: cannot cache function 'count_hom': no locator available for file '/workspaces/codespaces-jupyter/sgkit/sgkit/stats/aggregation_numba_fns.py' which leads to RuntimeError: Error during deserialization of the task graph. This frequently occurs if the Scheduler and Client have different environments..
  • I thought the error above might be due to the Numba cache as I've heard y'all complain about it previously. I tried disabling the Numba cache on the launching machine, the scheduler, and the workers. This didn't help. I then saw that others have gotten this error when the NUMBA_CACHE_DIR is not writable, so I tried setting that without any luck. At least I'm learning how to run code on Dask cluster machines?
import os
os.environ["SGKIT_DISABLE_NUMBA_CACHE"] = "1"
os.environ["NUMBA_CACHE_DIR"] = "/tmp/numba_cache"

def set_env():
    import os
    os.environ["SGKIT_DISABLE_NUMBA_CACHE"] = "1"
    os.environ["NUMBA_CACHE_DIR"] = "/tmp/numba_cache"

client.run(set_env)

client.run_on_scheduler(set_env)

ds.compute()
  • I'm starting to think .compute() is not the right way to force these computations? I need to wind down for the day tho. Hopefully AWS approves a high(er) memory instance for me soon and I can just scale up as scaling out does not seem to be going well...
  • Got some help from the Coiled team on Slack, now launching a cluster with
import coiled
cluster = coiled.Cluster(
    n_workers=20,
    worker_memory="64 GiB",
    show_widget=False,
    environ={"SGKIT_DISABLE_NUMBA_CACHE": "1", "NUMBA_CACHE_DIR": "/tmp"},
)
client = cluster.get_client()

@hammer
Copy link
Author

hammer commented Jan 12, 2024

Back at it today! I managed to get my quotas up to launch a real monster of a machine, a c3d-highmem-360, with over 2 TB of RAM.

I was able to get to the sample_stats(ds) call in the GWAS Tutorial without incident. Unfortunately when I tried to examine a data variable computed by this method such as ds.sample_call_rate I got worker communication errors as well as errors from Dask about CPUs spending too much time in garbage collection.

A few thoughts:

  • The communication errors may be due to bad firewall rules? I tried opening up a lot of things but could not resolve them.
  • I'm not sure what could be causing the CPU to spend too much time in garbage collection. The instance clearly had tons of RAM, and each worker had 16 GB. There were a lot of transpose tasks on the Dask dashboard, and I know that's an expensive operation, but I'm not sure what to do next to debug.
  • I realize I'm getting stuck on QC steps rather than just the association test. I can move on to just looking at the association test but it feels a bit concerning that we can't get through the GWAS Tutorial workload by either scaling up or out. It could be user error, of course!

@jeromekelleher
Copy link
Collaborator

Just knowing whether the association test is remotely feasible would be a great data point right now.

@hammer
Copy link
Author

hammer commented Jan 12, 2024

I went back to the r7i.8xlarge instance on EC2 with 32 vCPUs and 256 GB RAM and tried to run a minimal path to get to the linear regression and use the results to make a Manhattan plot (which uses the variant_linreg_p_value output of the regression). Remarkably given my prior failures this ran to completion in like 20 minutes, so I think the answer is yes? Memory usage peaked around 140 GiB or so, from what I saw. I have no idea how I avoided the UserWarning: Sending large graph of size I got last time though!

After this completed I tried to examine another output of the linear regression, variant_linreg_beta. This is not a large array, as it only has 153,988 values in it. When I called ds_lr.variant_linreg_beta.values, I was surprised that this too took about 20 minutes to complete. I'm not quite sure what is happening with Dask: does it need to re-run gwas_linear_regression again for some reason? Further, I can repeatedly call ds_lr.variant_linreg_beta.values and it takes ~20 minutes each time; something is happening that I do not understand.

So, it seems the answer is that it is feasible, but given Dask's lazy execution model I need to spend some time to understand how to isolate and profile the work done just for the linear regression.

@jeromekelleher
Copy link
Collaborator

I think a good goal here would be to try and reproduce the Manhattan plots in Fig s11 of the HAPNEST paper:

Screenshot from 2024-01-15 09-53-07

They did this with only 50K samples, though.

It's not totally obvious that they are using the same phenotypes here as they have in the released dataset, but we should see qualitatively similar patterns based on the heritability and polygenicity parameters.

I don't think there's much to learn from running QC steps here, as it's synthetic data. What we want to see/show is that we can get Manhattan plots out. Other QC steps are hopefully being illustrated by other parts of the paper.

If merging the datasets across the chromosomes is a pain, I think it's fine working chromosome by chromosome.

@hammer
Copy link
Author

hammer commented Jan 15, 2024

@jeromekelleher do you know anyone from the HAPNEST paper who we can ask to sanity check our findings given their generative model?

@jeromekelleher
Copy link
Collaborator

I know four of the authors, so yes, happy to ping them

@eric-czech
Copy link
Collaborator

it would be really valuable to have a sanity check on whatever we end up writing up here.

As far as the UKB replication goes, here is my recollection of what happened that would be worth keeping in mind:

  1. I was struggling to get a single chromosome GWAS to run 1 and then had a small breakthrough after realizing how much faster the workflow was with short-fat chunks 2
  2. I also regularly hit a number of intermittent issues 3 that forced me to implement some retries in my own code like this as well as have to rebuild clusters somewhat regularly (maybe 10% of the time)
  3. Then I at least worked out that matmul in dask was the problem 4 since it was scaling linearly (or worse) in memory usage with matrix size while using constant size chunks
  4. @ravwojdyla fixed this upstream in dask 5
  5. When @tomwhite and I were working on this together, we noted that not chunking in the samples dimension was pretty key to keeping runtimes sane and getting in the ballpark of cost parity with Hail 6
  6. @tomwhite summarized most things we had to do to keep moving forward with UKB nicely in 7 and 8
  7. I originally converted the UKB bgen data to Zarr with chunking in both dimensions and we never rechunked it to remove chunking in samples 9
  8. Nevertheless, I was still able to get a few chromosomes to run successfully through the full workflow
  9. The main reproduction repo pipeline is quite difficult to setup and run, but there was this notebook (from our internal ukb-analysis repo) that I regularly used to speed up that process -- it's a much simpler demonstration of how get something somewhat close to the Neale Lab results on a single chromosome and sounds a lot like what you're doing @hammer
  10. The p-values I got weren't way off from p-values I was pulling from Open Targets sumstats, but they weren't strikingly similar either and you can see an example of that in the plot at the very end of the notebook
  11. I believe this issue contains at least a couple ideas I had on where differences might come from

Overall, I left this work with the impression that chunking in the samples dimension might be a nonstarter for really scaling GWAS to a UKB-sized dataset. I would recommend trying that @hammer.

I'm not sure what to say on the cluster issues you're hitting though .. I don't recognize any of those. I hit my fair share of similar problems with https://github.com/dask/dask-cloudprovider.

To some of your other issues @hammer:

gwas_linear_regression issued some warnings (PerformanceWarning: Increasing number of chunks) and returned right away, so I think it's not getting computed until the manhattan_plot call later?

That makes sense. It shouldn't compute anything until you reference one of the resulting variables.

I can repeatedly call ds_lr.variant_linreg_beta.values and it takes ~20 minutes each time; something is happening that I do not understand.

It's rerunning the whole regression. The best way to start working with the results, IIRC, is ds[['variant_linreg_beta', 'variant_linreg_p_value']] = ds[['variant_linreg_beta', 'variant_linreg_p_value']].compute(). That will compute both arrays simultaneously and put the materialized versions back in your dataset. You could also just save them in a new dataset like gwas_results = ds[['variant_linreg_beta', 'variant_linreg_p_value']].compute().

There are more examples of this in that UKB notebook I mentioned like this:

Screenshot 2024-01-25 at 4 49 29 PM

That progress bar is quite helpful too. You might find some other useful examples like that in there.

Footnotes

  1. https://github.com/pystatgen/sgkit/issues/390

  2. https://github.com/pystatgen/sgkit/issues/390#issuecomment-730660134

  3. https://github.com/related-sciences/ukb-gwas-pipeline-nealelab/issues/20

  4. https://github.com/pystatgen/sgkit/issues/390#issuecomment-730449672

  5. https://github.com/dask/dask/pull/7000

  6. https://github.com/pystatgen/sgkit/issues/448#issuecomment-777376979

  7. https://github.com/pystatgen/sgkit/issues/390#issuecomment-775822748

  8. https://github.com/pystatgen/sgkit/issues/390#issuecomment-781503522

  9. https://github.com/related-sciences/ukb-gwas-pipeline-nealelab/issues/35

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

4 participants