-
Notifications
You must be signed in to change notification settings - Fork 11
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
GRCh38 genetic map #7
Comments
Hi, I won't claim the genetic map I provided with the QUILT example package is the best one. There are several genetic maps out there built from different data in different ways that might be more appropriate for any given analysis. The raw data for the one I provided came from I then raw liftOver (not entirely sure which version, md5sum 0bec21ac6843b96284ff976953da0a54) using hg19ToHg38.over.chain.gz md5sum 35887f73fe5e2231656504d1f6430900. To create the above formatted file, I had to fill in a few regions, which I did with the R code that's copied and pasted at the bottom of this message (I can't seem to attach a file to this message?). It's not a well written file, but if you can't reproduce it by downloading things and re-configuring paths, I can clean it up, if you want. About the differences between the two files, yes they look pronounced at the start of the file, but hopefully those differences settle down throughout. More on that in a minute. One thing, I can't validate the file Alkes provided. I'm not sure I've ever seen a formal definition for these types of files, but I've always used the format defined by files that other people provided. Here if we look at the top of the CEU files for chromosome 20 from the above tarball we see Now for Alkes file we have Anyway, ignoring that discrepancy, if we go only based on the provided genetic map column
Using functions from https://github.com/rwdavies/STITCH/blob/master/STITCH/R/genetic-map.R About your last comment, in general I think a more inclusive map would be a great idea. I was a bit lazy in the QUILT paper to just use a constant panel. An African American map might be a good idea, to get a combination of European and African PRDM9 hotspots. I'm not otherwise sure or knowledgeable of a best "global" map. I don't think an inconsistency between a map and a reference panel otherwise would be an issue, as long as they're all the same genome build. Hope that helps,
|
I got an email about this including this error message
but don't see this here in github which is weird, I'm wondering if you deleted it? In any case, if this is still a problem, I tried to check if I could load all of the modified Alkes chromosomes, and this ran without issue (I loaded all the STITCH functions as described above before running the below). So perhaps there is a weird error in your map fixing code? I think that's where the error occurred in what you ran, given the error messages that were printed to screen
|
Dear Robert, sorry for the late response. Indeed, I deleted my message since eventually I found a solution and I didn't mean to bother you any further. There was a really stupid problem with my script for recalculating the 3rd column of the genetic map files. Eventually, I used the Alkes' files after shifting the second column by 1 and recalculating the 3rd column using the formula: g[iRow, 3] = g[iRow - 1, 3] + ( (g[iRow, 1] - g[iRow - 1, 1]) / 1e6 * g[iRow - 1, 2]) Now it seems to be working and, comparing the results for chr20 using the Alkes' map file and the one you provided with QUILT, they are basically the same. I would like to ask you just one more thing related to the QC filters: In Liu et al. 2018, where STITCH was used, they applied a filter on the info score (greater than 0.4) and on the HWE-pvalue (smaller than 10^-6). I decided to apply the same filters after QUILT and I detected a slight increase in the accuracy results (the accuracy then truly increases when a further filter on the GP scores is applied). Do you think this is a good strategy? Is there any other QC filter you would suggest? Thank you very much for your support. Best, |
Great that you solved the problem already. re: the filters. In my own experience, the INFO score is usually by far and away the most useful filter. With STITCH (and I don't remember checking this with QUILT, but expect it to hold), the method is very well calibrated, i.e. the posterior genotype probabilities are very accurate, when assessing their calibration over many samples and SNPs. So this INFO score, which measures the spread of the genotype posterior probabilities, captures nearly all the information in how confident the imputation is, and hence how likely the imputation is to be correct. Other things I've checked like HWE aren't nearly as informative (though practically they make a lot of sense, a SNP imputed with an HWE p-value of 1x10^(-200) probably isn't one you want to analyze). Otherwise nothing comes to mind. If you don't fully trust your reference panel you could combine annotations of the SNPs in the panel from other sources (e.g. mapping related scores, depending on the info you have with your panel) with things like INFO to probably do more additional filtering, but in general it's probably not necessary. Hope that helps, |
Hello Robert, |
Hi, isn't the dataset from https://www.internationalgenome.org/data-portal/search?q=recombination, already the hg38 version, so why use hg19ToHg38.chain file to liftover? Otis. |
It's been a long time since I did this but I just had another look For the question I forgot to answer above, about why 0.4. I don't have a great answer, any cutoff is a tradeoff between sensitivity and specificity, and I've often found 0.4 to be a reasonable one |
Hi,
I downloaded the genetic map for GRCh38 used in this paper: https://www.ncbi.nlm.nih.gov/pmc/articles/PMC7059836/
This genetic map is the one provided by Eagle and can be found here: https://alkesgroup.broadinstitute.org/Eagle/downloads/tables/
When I compared their chromosome 20 with the one provided in the example from QUILT, I found some differences that I can't explain.
These are the first 5 lines of the file you provided with QUILT (CEU-chr20-final.b38.txt.gz):
position COMBINED_rate.cM.Mb. Genetic_Map.cM.
82590 6.80436893751158 0
82603 6.8056503043227 0.0000884567961876505
83158 6.81470108539 0.00386559271508675
88108 6.83273214455531 0.0375983630877673
88453 6.83536710777191 0.0399556556776388
And these are the first 5 lines of the file I downloaded:
chr position COMBINED_rate(cM/Mb) Genetic_Map(cM)
20 81154 0.0000000000 0.000000000000000
20 82590 0.3500036909 0.000502605300132
20 82603 0.3494018702 0.000507147524445
20 83158 0.3501262382 0.000701467586646
20 83509 0.3558643956 0.000826375989502
Focusing on the positions they have in common (e.g., 82590, 82603, and 83158), it is evident that both COMBINED_rate and Genetic_Map are different in the two files and I wonder what these differences are due to and whether both maps are valid or not.
However, if the map you provided with QUILT is the best option, where could I find the map for the entire genome?
Furthermore, since I'm using the HRC panel as reference, would you suggest to use a more inclusive genetic map (i.e, including the positions from multiple populations)? I wonder whether an inconsistency between the genetic map and the reference could possibly represent an issue.
Thank you
The text was updated successfully, but these errors were encountered: