This is a pipeline of summary statistics based fine-mapping using three commonly-used tools.
Clone this repo:
git clone https://github.com/mulinlab/CAUSALdb-finemapping-pip.git
set up conda environment and download fine-mapping tools:
cd bin
bash 00_set_up.sh
cd ..
build reference panel:
cd ref
python 01_prepare_reference.py
cd ..
❗ In PAINTOR's framework, they go through every VCF file when computing LD which is very time-consuming. So I processed the VCF files in 1000G FTP by splitting them into blocks and convert them into genotype matrix. Although the preparation will cost hours, it can speed up at 30 times. If you cannot wait for the reference preparation, you can get the ready data from 115...212:/f/jianhua/jianhua_pipeline/Fine-mapping/ref
. Also, if you are skilled with python, you can figure out what else I have done with the original VCF files.
I wrote the whole pipeline in a python script and it quite simple after activating the conda environment.
I also wrote a brief help:
(base) ➜ conda activate finemap
(finemap) ➜ python fine_map_pipe.py -h
usage: Finemap using three tools, output variants in total credible sets
python fine_map_pipe.py -s 120575 SUM_STAT ./
positional arguments:
input input summary statistics
output directory of output file, default:working directory
optional arguments:
-h, --help show this help message and exit
-p , --pop population of input data,[EUR,EAS,SAS,AMR,AFR],
default=EUR
-n , --maxcausal the maximum number of allowed causal SNPs, default=1
-s , --samplesize sample size of input data
-c , --cred the cutoff of credible set, default=0.95
-t , --thread number of threads, default=10
So we can finemap the test data in input
folder which is can be browsed in CAUSALdb using:
python fine_map_pipe.py -s 120575 ./input/GR018.txt.gz output
Just like the test data, the input file should be a txt file or .txt.gz file containing below information:
Column | Description |
---|---|
CHR | chromosome, int, 1-22 |
BP | Base-pair, int, hg19 |
rsID | can be NA |
MAF | Minor allele frequency, (0, 0.5] |
EA | Effective allele, forward strand, capital |
NEA | Non-effective allele, forward strand, capital |
BETA | Effect size |
SE | Standard error |
P | P value, (0, 1) |
Zscore | Equal to BETA/SE |
zless input/GR018.txt.gz| head -n2
CHR BP rsID MAF EA NEA BETA SE P Zscore
1 762320 exm2268640 C T -2.1671804559878427 2.2192489884814903 0.3288 -0.9765377689642318
The output is a txt file including summary statistics and posterior probability of variants in all credible sets.
The example output is:
head -n2 ./output/GR018_total_credible_set.txt
CHR BP rsID MAF EA NEA BETA SE P Zscore PAINTOR CAVIARBF FINEMAP block_id label
1 55505647 rs11591147 0.020899999999999967 T G -1.4101773015832226 0.2294923644583753 8.033e-10 -6.144767844069152 0.946675 0.9511994 0.950832 34 7
I appended five columns in the input file:
Column | Description |
---|---|
PAINTOR | Posterior probability computing by PAINTOR |
CAVIARBF | Posterior probability computing by CAVIARBF |
FINEMAP | Posterior probability computing by FINEMAP |
block_id | No. of block in /ref/blocks.txt |
label | Because the credible set defined by the three tools are different, I used a this column to annotate the type of credible set of the variants. And the rule is PAINTOR:1, CAVIARBF: 2, FINEMAP: 4. So label = 5 means this variant is in the credible set defined by PAINTOR and FINEMAP, but not in CAVIARBF's credible set. |
- The default maximum number of allowed causal SNPs is 1, I haven't tested more than 1 causal SNP for many times, so I'm not sure whether it can work in all condition.
- We used the relatively independent LD blocks determined by ldtect which only include three population. And considering the preparation of reference panel, I used blocks of EUR population for all condition as it's the most popular population.
- Schaid, D. J., Chen, W., & Larson, N. B. (2018). From genome-wide associations to candidate causal variants by statistical fine-mapping. Nature Reviews Genetics, 19(8), 491–504. https://doi.org/10.1038/s41576-018-0016-z
- http://www.christianbenner.com/
- https://github.com/gkichaev/PAINTOR_V3.0
- https://bitbucket.org/Wenan/caviarbf/src/default/
Wang J, Huang D, Zhou Y, Yao H, Liu H, Zhai S, Wu C, Zheng Z, Zhao K, Wang Z, Yi X, Zhang S, Liu X, Liu Z, Chen K, Yu Y, Sham PC, Li MJ. CAUSALdb: a database for disease/trait causal variants identified using summary statistics of genome-wide association studies. Nucleic Acids Res. 2020 Jan 8;48(D1):D807-D816.