-
Notifications
You must be signed in to change notification settings - Fork 32
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Create a public GWAS workflow for collaborative optimization #438
Comments
Simulated genotypes from msprime should be a good dataset perf wise; let me know if I can help here. Possibly we could reproduce some of the sub-optimal behaviour on smaller datasets also, to make things a bit more nimble. |
Would calculating LD scores be a reasonable goal ? |
I think it's probably best to avoid that one because its calculation often involves domain-specific shortcuts. I think it will be more compelling and easier for all of us if we were to start with a workflow that uses only general-purpose operators. Here is a GWAS Simulation (gist) example that writes somewhat representative data to cloud storage and reads it out to do linear regressions (w/ several algebraic shortcuts). This is effectively all a GWAS is. How does that look at a glance @quasiben? As far as I know all of the operations in use there are supposed to be implemented in cupy. You could also cut out the part that uses
It would be terrific if we could generate allele dosages with a more accurate distribution. In the case of UKB, the uncertainty in allele count comes from imputation so I'm not sure how to simulate that. It may not be important either -- simulating alternate allele counts alone like in https://github.com/pystatgen/sgkit/issues/23#issue-653340614 might give this data more similar block compressibility. I think that's likely to be the most impactful attribute of the genetic data as far as performance is concerned, at least for this example. Let me know if you have any immediate ideas, otherwise I think we could probably start with this simpler example and add more nuance if it ends up being easy to optimize. |
Ah, good points. I don't have code to simulate dosages @eric-czech. I think what you're doing in the link above is going to be a pretty good proxy for perf purposes. |
I poked at this a bit recently. @beckernick reminded me that there is a cuml/dask linear-regression but that probably won't work here. With the broken out least squares we need to fix a scipy/cupy issues within dask: I don't think this should be too hard to resolve once that's done we can come back to testing |
Probably -- the code is solving a large number of individual OLS regressions at once so my assumption was that packing it all into linear algebra operators was going to be faster than iterating over a dimension of the input matrix and doing the regressions individually (i.e. by calling cuml/LinearRegression.fit thousands or millions of times).
That's a bummer! |
It's not too bad. I think most of the parts we care about are working. However, the next issue might take a bit more work. I don't think there is much support in CuPy for distribution functions:
|
I also benchmarked what was currently supported and saw a modest improvement:
Data was pre-generated before running the example and executed on a single GPU
|
I ran with the Biobank settings:
And the computation took ~100 seconds. This was done on a DGX2 with 16x 32GB V100s |
Very cool @quasiben! While this isn't a completely compatible comparison, the times I saw for this on real data were ~150 seconds using 60 n1-highmem-16 GCE nodes. That is fairly downwardly biased though since the number of individuals used varies by phenotype and for those settings in the simulation, I chose a number on the high side. Remind me again what the storage medium is? Is it possible to run the benchmark while reading from cloud storage? |
Demonstrating GPU cost efficiency and progressing on issues raised in https://github.com/pystatgen/sgkit/issues/390 may both be aided by creating a representative public workflow for UKB GWAS. A single notebook that executes a simplified version of this code (with an initial rechunking) should be sufficient if run on simulated data (as zarr in Google Storage) with reasonably similar levels of block compressibility to real data.
Hey @quasiben, we'll update this issue with any progress towards a shared workflow for us all to look at.
The text was updated successfully, but these errors were encountered: