-
Notifications
You must be signed in to change notification settings - Fork 22
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Compute LD-scores for each SNP bin using .bcor files #34
Comments
Hi Joni, I can add support for .bcor files but I need to think about what the interface should look like. Can you please explain how the data is arranged in your case? e.g. do you have a single chromosome-wide bcor file, or a bunch of bcor files that cover an entire chromosome? |
The latter, which may not be ideal. I have a .bcor for each 3Mb segment in the genome, akin to what you've generated for UK Biobank (although in a different format obviously) |
Hi, I added support for computing LD-scores from an arbitrary set of input .bcor files. Please see the end of the updated wiki page for details, and please let me know if it works for you! Cheers, Omer |
This seems to work for generating LD scores - thank you! |
Hi, sorry to reopen this one, but I've run into a memory issue with the script - it works fine for most of my data, but running on chr 1-4 takes more than 5 days on a 90Gb node. The script runs fine - it's just that the read-in of the bcors takes too long. I have a couple of solutions that I'd like to explore: Conversion of .bcor to .npz files - the UK Biobank data default in PolyFun seems to run efficiently, so I'm wondering if converting the .bcor files from LDStore into .npzs might work. Would the script be adaptable to running with .npzs rather than .bcors? |
Hi Joni, Not a problem, but I'm surprised that the loading of .bcor files is the memory bottleneck. .bcor files are packed pretty efficiently, I don't think they have any significant disadvantage vs .npz files. Could it be that you simply have more SNPs per LD region than in our UKB data? We applied a MAF>0.1% filter, which left us with ~20K SNPs per region. If that's the case, simple solutions are (1) apply some more stringent SNP filter; or (2) use smaller regions (i.e. <3Mb long). The first option in particular should be easy to try out --- you won't have to recompute the LD, we can just apply a filter to ignore some of the SNPs that don't pass the stringent threshold. What do you think? |
Ok - I was also surprised at how long the read-in took. It may be that there is something wrong with the way I am running the script, so I'll have a further exploration and get back to you. I haven't got especially high SNP density in any regions (max is about 15K) so I don't think that should be an issue. Thanks! |
An update:
|
Hi Joni, Sorry for this. I didn't write the .bcor code (it was written by the author of LDstore), so I don't know it that well. However, looking at the code I found that it always created an LD matrix of 64bit floats (instead of 32bit) which might be the source of the memory hog. This might slow down the code if the computer runs out of memory and has to swap some stuff to disk. So I just applied a very simple fix that changes it back to 32bit, can you please try that? If it doesn't help, I'll try to further speed it up by rewriting the .bcor code in C instead of python, but let's try the simplest solution first. Separately from this, I'm not sure what's the source of the error you sent me, so I updated the code to verify that the LD matrix only contains valid numbers. If the code crashes because it doesn't, that might mean that the LD file is corrupt. Can you please try out the updated code and let me know how it goes? Thanks, Omer |
Hi Omer, Thanks - I'll try the new code and see how that works speed-wise! Don't worry about the Susie error - I think thats an issue with the way I've made the bcors. Thanks again! |
Hi Omer, Just wanted to add that the time lag is due to the loops in bcor.py. In finemapper.py the code requests to correlations (readcorr([])) for all the SNPs in the file/regions and the n x n loop is read one by one. a simple solution would be to fetch the upper or lower diagonal matrix. |
Thanks, the code already reads only the half-diagonal matrix (the .bcor file only stores half of the matrix). One cause for the loading time is that looping in Python is slow. If I find some time I'll try rewriting the code in Cython which could make it about 10x faster. |
Hi, Just to commend on this one too: I occurred upon the same issue with @JoniColeman when running though the fine-mapper script and using the --geno flag (i.e. genotype plink files) rather than UKB precomputed LD matrices. For certain chromosomes, the jobs ran fast and efficiently but for some others, it is taking 4-5 days. Also wondering whether this has to do with the memory or whether any file conversion could make these jobs ran faster. Thank you all! |
Hi, I suspect the reason for the slowness is simply that the ldstore code is slow. It could be made faster if it weren't written in Python but in a compiled language such as C. A possible workaround would be to write it in Cython, but this is a big undertaking and I don't have the capacity for this right now. My suggestion is to convert the bcor to npz (using the |
@omerwe and @jdblischak I found that calling for 20k X 20k snps
only caveat we need 5-6x more temporary disk space to store the text file briefly before converting it to npz. |
@jerome-f wow, that's crazy. I'm happy to accept a PR for this! |
Hi,
I am interested in creating priors with PolyFun using approach 3. I would like to pass estimates of LD from the data I have, stored in .bcor files. Is this possible? The --ld-ukb flag does something similar, but seems to be set up for using UK Biobank specifically. I can see how to do the finemapping step with my bcor LD estimates, but I suspect accurate LD estimation in creating the priors will also be very important.
Thanks,
Joni
The text was updated successfully, but these errors were encountered: