-
Notifications
You must be signed in to change notification settings - Fork 4
/
GimputeTutorial.Rmd
862 lines (710 loc) · 42.1 KB
/
GimputeTutorial.Rmd
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
335
336
337
338
339
340
341
342
343
344
345
346
347
348
349
350
351
352
353
354
355
356
357
358
359
360
361
362
363
364
365
366
367
368
369
370
371
372
373
374
375
376
377
378
379
380
381
382
383
384
385
386
387
388
389
390
391
392
393
394
395
396
397
398
399
400
401
402
403
404
405
406
407
408
409
410
411
412
413
414
415
416
417
418
419
420
421
422
423
424
425
426
427
428
429
430
431
432
433
434
435
436
437
438
439
440
441
442
443
444
445
446
447
448
449
450
451
452
453
454
455
456
457
458
459
460
461
462
463
464
465
466
467
468
469
470
471
472
473
474
475
476
477
478
479
480
481
482
483
484
485
486
487
488
489
490
491
492
493
494
495
496
497
498
499
500
501
502
503
504
505
506
507
508
509
510
511
512
513
514
515
516
517
518
519
520
521
522
523
524
525
526
527
528
529
530
531
532
533
534
535
536
537
538
539
540
541
542
543
544
545
546
547
548
549
550
551
552
553
554
555
556
557
558
559
560
561
562
563
564
565
566
567
568
569
570
571
572
573
574
575
576
577
578
579
580
581
582
583
584
585
586
587
588
589
590
591
592
593
594
595
596
597
598
599
600
601
602
603
604
605
606
607
608
609
610
611
612
613
614
615
616
617
618
619
620
621
622
623
624
625
626
627
628
629
630
631
632
633
634
635
636
637
638
639
640
641
642
643
644
645
646
647
648
649
650
651
652
653
654
655
656
657
658
659
660
661
662
663
664
665
666
667
668
669
670
671
672
673
674
675
676
677
678
679
680
681
682
683
684
685
686
687
688
689
690
691
692
693
694
695
696
697
698
699
700
701
702
703
704
705
706
707
708
709
710
711
712
713
714
715
716
717
718
719
720
721
722
723
724
725
726
727
728
729
730
731
732
733
734
735
736
737
738
739
740
741
742
743
744
745
746
747
748
749
750
751
752
753
754
755
756
757
758
759
760
761
762
763
764
765
766
767
768
769
770
771
772
773
774
775
776
777
778
779
780
781
782
783
784
785
786
787
788
789
790
791
792
793
794
795
796
797
798
799
800
801
802
803
804
805
806
807
808
809
810
811
812
813
814
815
816
817
818
819
820
821
822
823
824
825
826
827
828
829
830
831
832
833
834
835
836
837
838
839
840
841
842
843
844
845
846
847
848
849
850
851
852
853
854
855
856
857
858
859
860
861
862
---
title: "Gimpute: An efficient genetic data imputation pipeline"
author:
- name: Junfang Chen, Dietmar Lippold and Emanuel Schwarz
affiliation: Central Institute of Mental Health, Heidelberg University, Germany
date: "Modified: 27 Feb 2019. Compiled: `r format(Sys.Date(), '%d %b %Y')`"
output:
BiocStyle::html_document:
toc_float: true
number_sections: true
href: GimputeTutorial.html
vignette: >
%\VignetteIndexEntry{GimputeTutorial}
%\VignetteEngine{knitr::rmarkdown}
%\VignetteEncoding{UTF-8}
---
```{r style, echo = FALSE, results = 'asis'}
BiocStyle::markdown()
```
```{r global_options, include=FALSE}
knitr::opts_chunk$set(message=FALSE, error=FALSE, warning=FALSE, fig.width=10, fig.height=10)
options(width=80)
```
# About Gimpute
## Background
Genome-wide association studies (GWAS) have become an essential tool in assisting the improvement of understanding of neuropathology of diseases. Moreover, the imputation of genotyping data with missing information can enhance the power of GWAS by using haplotype relationships between genetic variants [1,2]. In the meantime, the imputation can substantially reduce the cost of genotyping large number of samples. Furthermore, the comparison of significant findings across different cohorts and genotyping platforms is also achievable. Currently large amounts of genotyping data are produced at an unprecedented rate to be analyzed; however, the preparation of ready-to-use data sets needs to be of high quality. Thus the implementation of an efficient and reliable genetic data pre-processing and imputation pipeline is desirable.
## Scope of the pipeline
In order to ensure the reliability and reproducibility of genome-wide association study (GWAS) data,
we set up an efficient, automatic and comprehensive genotype data processing and imputation pipeline termed __Gimpute__.
It consists of pre-processing (genetic variant information updating/matching/liftOver,
quality control of genetic variants and study samples, the alignment of variants to the imputation references),
pre-phasing, imputation and post-imputation quality control [3], as well as an extension to the Genipe framework in this vignette, which is easy-to-follow and user-friendly.
# Getting started
## Prerequisites
Gimpute runs on any 64-bit x86 Linux distribution and it requires the following tools [4,5,6,7]:
* [R](https://www.r-project.org/) (>=3.5.0)
* [PLINK](https://www.cog-genomics.org/plink2)
* [GCTA64](http://cnsgenomics.com/software/gcta/#Download)
* [SHAPEIT](http://www.shapeit.fr/)
* [IMPUTE2](https://mathgen.stats.ox.ac.uk/impute/impute_v2.html)
* [IMPUTE4](https://jmarchini.org/impute-4/)
* [QCTOOLv2](http://www.well.ox.ac.uk/~gav/qctool_v2/documentation/download.html)
* [GTOOL](http://www.well.ox.ac.uk/~cfreeman/software/gwas/gtool.html)
You can install the tools from the command line as the following examples:
```{r eval=FALSE}
## PLINK
wget https://www.cog-genomics.org/static/bin/plink180717/plink_linux_x86_64.zip
unzip plink_linux_x86_64.zip
## GCTA64
wget http://cnsgenomics.com/software/gcta/gcta_1.91.5beta.zip
unzip gcta_1.91.5beta.zip
## SHAPEIT
wget https://mathgen.stats.ox.ac.uk/genetics_software/shapeit/shapeit.v2.r904.glibcv2.17.linux.tar.gz
tar -zxvf shapeit.v2.r904.glibcv2.17.linux.tar.gz
## IMPUTE2
wget https://mathgen.stats.ox.ac.uk/impute/impute_v2.3.2_x86_64_static.tgz
tar -zxvf impute_v2.3.2_x86_64_static.tgz
## GTOOL
wget http://www.well.ox.ac.uk/~cfreeman/software/gwas/gtool_v0.7.5_x86_64.tgz
tar -zxvf gtool_v0.7.5_x86_64.tgz
## QCTOOLv2
wget http://www.well.ox.ac.uk/~gav/resources/qctool_v2.0-CentOS6.8-x86_64.tgz
tar -zxvf qctool_v2.0-CentOS6.8-x86_64.tgz
```
Note that, IMPUTE4 can be downloaded only after applying for access. IMPUTE4 and QCTOOLv2 are not required if you are using IMPUTE2 for the imputation.
## Installation
Development version from Github:
* Install Gimpute in R (Recommended)
```{r eval=FALSE}
install.packages("devtools")
library("devtools")
install_github("transbioZI/Gimpute", build_vignettes=TRUE)
```
* Install Gimpute from the command line
```{r eval=FALSE}
git clone https://github.com/transbioZI/Gimpute
R CMD build Gimpute
R CMD INSTALL Gimpute_*.tar.gz
```
Gimpute runs on any 64-bit x86 Linux distribution. Additional dependencies are described below.
## Experimental data
PLINK format files (PED/MAP or BED/BIM/FAM) are required as the input genotyping data for our pipeline.
Free datasets are available under ./extdata/ in our package. The way for loading these PLINK binary files is shown as below. Since sharing of genetic genotyping data is strictly prohibited, we did not upload any private genotyping data for this demonstration.
```{r eval=FALSE}
library(Gimpute)
bedFile <- system.file("extdata", "dataChr21.bed", package="Gimpute")
bimFile <- system.file("extdata", "dataChr21.bim", package="Gimpute")
famFile <- system.file("extdata", "dataChr21.fam", package="Gimpute")
```
## Get it Setup
### File naming conventions
The files processed and generated are named in a consistent way across all datasets as follows:
* filename.# represents a set of the three PLINK files, i.e. the .bed, .fam and .bim file.
Binary PLINK files are recommended due to their smaller file size and fast processing.
* filename.txt denotes a file with additional information, such as metadata information, SNP names, in pure text format.
The names of the output files begin with the number of the step which produces the respective file.
### Directory naming conventions
The recommended directories are named as follows:
* 1-genoUpdate: Updated genotype information related files are generated in this directory.
* 2-genoQC: Quality controlled genotype data related files are generated in this directory.
* 3-checkAlign: Aligned with the imputation reference related files are generated in this directory.
* 4-imputation: All imputation related files are generated in this directory.
* 5-reductAndExpand: Filtered imputation data and expanded to the original non-imputed genotype files are generated in this directory.
* 6-finalResults: Final processed and filtered imputed genotype files are generated in this directory.
### Required packages and tools
```{r eval=FALSE}
## Gimpute has the following dependencies:
library(lattice)
library(doParallel)
## Define required tools
toolDIR <- "/home/Gimpute/tools/"
plink <- paste0(toolDIR, "plink")
gcta <- paste0(toolDIR, "gcta64")
shapeit <- paste0(toolDIR, "shapeit")
gtool <- paste0(toolDIR, "gtool")
qctool <- paste0(toolDIR, "qctool")
imputeTool <- "impute4" ##
if (imputeTool == "impute2"){
impute <- paste0(toolDIR, "impute2")
} else if (imputeTool == "impute4"){
impute <- paste0(toolDIR, "impute4.1_r291.2")
}
```
### Configuration
* Self-prepared configuration files
+ metadataFile: A text file with at least the following content (column names are in parentheses): family ID in the PLINK files (FID), individual ID in the PLINK files (IID), ID in the description files (descID), self identified ancestry (ance; e.g. AFR: African, AMR: Ad Mixed American, EAS: East Asian, EUR: European, SAS: South Asian), sex (sex; 1 = male, 2 = female, 0 = missing), age (age), group (group; 0 = control/unaffected, 1 = case/affected). All unknown and missing values are represented by the value NA. Lines with a missing value for FID or IID are not contained.
+ removedSampIDFile: A text file that stores sample IDs, each ID per line, which will be excluded. For instance, samples with missing sex information can be included in this file.
+ excludedProbeIdsFile: A text file that stores probe IDs, each ID per line, which will be excluded.
* Genotyping chip annotation file: Most of common chip annotation files can be directly downloaded from [Dr. William Rayner'page](http://www.well.ox.ac.uk/~wrayner/strand/).
This annotation file is particularly important for strand checking and genomic information of genotypes on a variety of genome builds. Gimpute requires the chip annotation file that is organized into two different types:
+ If the snp name of your study genotype data starts with "SNP_", then the chip type "SNPIDstudy" is used; Usually, Affymetrix chip belongs to this category. The prepared output annotation file must consist of at least the following column names : SNPIDstudy, rs, chr, pos, strand. The column "strand" must only have two kinds of values "-" and "+". Variants with all other values are excluded.
+ If the snp name of your study genotype data starts with "rs", then the chip type "rsIDstudy" is used; The annotation file should be prepared in advance and must consist of at least the following column names : rsIDstudy, chr, pos, strand. Illumina chip is often specified in this format. The column "strand" must only have two kinds of values "-" and "+". Variants with all other values are excluded.
* Imputation reference files: Publicly avaiable reference datasets can be downloaded from [IMPUTE2's reference page](https://mathgen.stats.ox.ac.uk/impute/impute_v2.html#reference). The reference files from "1000 Genome Project Phase1v3" are the default panel used in our package. 1000 Genomes Phase 3 is an alternative reference panel.
```{r eval=FALSE}
## Download imputation reference panels
## Reference panel 1: 1000Gphase1v3_macGT1
wget https://mathgen.stats.ox.ac.uk/impute/ALL_1000G_phase1integrated_v3_impute_macGT1.tgz
tar -zxvf ALL_1000G_phase1integrated_v3_impute_macGT1.tgz
## Reference panel 2: 1000Gphase3
wget https://mathgen.stats.ox.ac.uk/impute/1000GP_Phase3.tgz
wget https://mathgen.stats.ox.ac.uk/impute/1000GP_Phase3_chrX.tgz
tar -zxvf 1000GP_Phase3.tgz
tar -zxvf 1000GP_Phase3_chrX.tgz
```
A string indicating the type of imputation reference panels is used: c("1000Gphase1v3_macGT1", "1000Gphase3"). This must be defined before running pipeline.
```{r eval=FALSE}
## Define the directory where you place the imputation reference files
impRefDIR <- "/home/imputeRef/1000Gphase1v3_macGT1/"
referencePanel <- "1000Gphase1v3_macGT1"
## Alternatively,
impRefDIR <- "/home/imputeRef/1000GP_Phase3/"
referencePanel <- "1000Gphase3"
```
* Global parameters
+ ancestrySymbol: A string symbol for self-identified ancestry (AFR: African, AMR: Ad Mixed American, EAS: East Asian, EUR: European, SAS: South Asian, or NULL)
+ chipType: A string symbol to indicate the type of chip annotation file: c("rsIDstudy", "SNPIDstudy")
+ hhCutOff (optional): the cutoff for removing male haploid heterozygous SNPs on the chromosome X. The default is 0.005.
+ hhSubjCutOff (optional): the cutoff for removing male subjects with haploid heterozygous SNPs on the chromosome X. The default is 15.
+ cutoff (optional): the cutoff that distinguishes the outliers from ordinary population using PCA. The default is NULL meaning that there are no outliers or outliers are not required to be removed.
+ referencePanel: A string symbol to indicate the type of input imputation reference panel: c("1000Gphase1v3_macGT1", "1000Gphase3")
+ imputeTool: A string symbol to indicate the type of used imputation tool: c("impute2", "impute4")
### Gimpute flowchart
![](Gimpute.png)
# Running pipeline
## Genotype information update
All files named in this subsection are in the directory 1-genoUpdate/.
In this subsection, the function `updateGenoInfo()` is mandatory and updates genotype information of original genotype data by the use of metadata and configuration files.
### Input files
* Raw genotype data in PLINK format: dataChr21.#
* Metadata information file about the instances: metadataFile.
* Chip annotation file for genotype information mapping and strand check: chipAnnoFile.
* The file with the IDs of the excluded instances: removedSampIDFile.
* The file with the IDs of the excluded probes of the chip: excludedProbeIdsFile.
### Processing example
0. Locate the original PLINK files and other required files either from our package or your side.
```{r eval=FALSE}
## Not run
bedFile <- system.file("extdata", "dataChr21.bed", package="Gimpute")
bimFile <- system.file("extdata", "dataChr21.bim", package="Gimpute")
famFile <- system.file("extdata", "dataChr21.fam", package="Gimpute")
metadataFile <- system.file("extdata", "1_01_metaData.txt", package="Gimpute")
removedSampIDFile <- system.file("extdata", "excludedSampIDs.txt",
package="Gimpute")
excludedProbeIdsFile <- system.file("extdata", "excludedProbeIDs.txt",
package="Gimpute")
chipAnnoFile <- system.file("extdata", "coriellAffyChip.txt",
package="Gimpute")
```
```{r eval=FALSE}
inputPrefix <- "dataChr21"
ancestrySymbol <- "EUR"
outputPrefix <- "1_11_removedYMtSnp"
metaDataFile <- "1_01_metaData.txt"
chipType <- "rsIDstudy"
updateGenoInfo(plink, inputPrefix, metaDataFile, removedSampIDFile,
ancestrySymbol, excludedProbeIdsFile, chipAnnoFile,
chipType, outputPrefix, keepInterFile=TRUE)
```
### Individual steps
Note that, the output files of each step are the input of the next step.
1. Use the raw genotype data and metadata information file to generate a metadata file. In the following example, the metadata file is already preprocessed.
+ Output file: 1_01_metaData.txt
```{r eval=FALSE}
system(paste0("scp ", metadataFile, " ."))
```
2. Remove all excluded instances (subjects, samples) from the raw chip data. Duplicated, related or useless instances are suggested to be removed.
+ Output files: 1_02_removedExclInst.#
```{r eval=FALSE}
inputPrefix <- "dataChr21"
outputPrefix2 <- "1_02_removedExclInst"
removedSampIDFile <- system.file("extdata", "excludedSampIDs.txt",
package="Gimpute")
removeSampID(plink, removedSampIDFile, inputPrefix,
outputPrefix=outputPrefix2)
```
3. Replace all values for affection (phenotype, group) and sex in the PLINK fam file by those values in the file 1_01_metaData.txt. In regards to this, the missing value for sex is represented by the value 0 (zero), the missing value for affection (group) is represented by the value -9.
+ Output files: 1_03_replacedGroupAndSex.#
```{r eval=FALSE}
metaDataFile <- "1_01_metaData.txt"
outputPrefix3 <- "1_03_replacedGroupAndSex"
updateGroupIdAndSex(plink, inputPrefix=outputPrefix2,
metaDataFile, outputPrefix=outputPrefix3)
```
4. Remove instances which have the missing affection value (i.e. the value -9).
+ Output files: 1_04_removedNoGroupId.#
```{r eval=FALSE}
outputPrefix4 <- "1_04_removedNoGroupId"
removeNoGroupId(plink, inputPrefix=outputPrefix3, outputPrefix=outputPrefix4)
```
5. Remove instances which have a value for ance (self identified ancestry) in the file 1_01_metaData.txt that is excluded from the dataset (for the excluded ancestry values).
+ Output files: 1_05_removedWrongAnceInst.#
```{r eval=FALSE}
outputPrefix5 <- "1_05_removedWrongAnceInst"
removedWrongAnceInst(plink, inputPrefix=outputPrefix4, metaDataFile,
ancestrySymbol, outputPrefix=outputPrefix5)
```
6. Remove the excluded probes of the chip defined in advance. Improper SNP name often starts with unexpected format such as AFFX, cnvi.
+ Output files: 1_06_removedExclProbe.#
```{r eval=FALSE}
outputPrefix6 <- "1_06_removedExclProbe"
removedExclProbe(plink, inputPrefix=outputPrefix5,
excludedProbeIdsFile, outputPrefix=outputPrefix6)
```
7. Remove probes which are not contained in the annotation file or which have missing values in the annotation file.
+ Primary output files: 1_07_removedUnmapProbes.#
+ Output file with the removed probes:1_07_probesUnmapped2ChipRef.txt
```{r eval=FALSE}
outputPrefix7 <- "1_07_removedUnmapProbes"
outputSNPfile7 <- "1_07_probesUnmapped2ChipRef.txt"
removedUnmapProbes(plink, inputPrefix=outputPrefix6, chipAnnoFile,
outputPrefix=outputPrefix7, outputSNPfile=outputSNPfile7)
```
8. Remove probes which belong to a SNP or have a position (i.e. a base pair position and chromosome) in the annotation file which is the same as that of at least one other probe.
+ Primary output files: 1_08_removedDoubleProbes.#
+ Output file with the removed probes:1_08_probesDouble.txt
```{r eval=FALSE}
outputSNPdupFile8 <- "1_08_probesDouble.txt"
outputPrefix8 <- "1_08_removedDoubleProbes"
removedDoubleProbes(plink, inputPrefix=outputPrefix7, chipAnnoFile,
chipType, outputSNPdupFile=outputSNPdupFile8,
outputPrefix=outputPrefix8)
```
9. Use the annotation file to replace the probe names by SNP names, in order to update the SNP chromosomal location and the SNP base pair position, as well as to convert all alleles to the positive strand. From this step on, chromosome codes are mapped as: X=23, Y=24, XY (pseudo-autosomal region of X)=25, MT (mitochondrial)=26.
+ Output files: 1_09_updatedSnpInfo.#
```{r eval=FALSE}
outputPrefix9 <- "1_09_updatedSnpInfo"
updatedSnpInfo(plink, inputPrefix=outputPrefix8,
chipAnnoFile, chipType, outputPrefix=outputPrefix9)
```
10. Correct the chromosome of the SNPs in the pseudoautosomal region by using the plink option --split-x with the name of the genome build of the annotation file.
+ Output files: 1_10_changedXyChr.#
```{r eval=FALSE}
outputPrefix10 <- "1_10_splitXchr"
splitXchr(plink, inputPrefix=outputPrefix9, outputPrefix=outputPrefix10)
```
11. Remove the SNPs of the chromosomes 24 (Y) and 26 (MT).
+ Output files: 1_11_removedYMtSnp.#
```{r eval=FALSE}
outputPrefix11 <- "1_11_removedYMtSnp"
removedYMtSnp(plink, inputPrefix=outputPrefix10, outputPrefix=outputPrefix11)
```
## Quality control
All files named in this subsection are in the directory 2-genoQC/.
In this subsection, the function `genoQC()` is mandatory and performs QC on the
genotype data. `removedSnpHetX()`, `removedMaleHetX()`, and `removeOutlierByPCs()`
are optional and can be used part of QC . `removedSnpHetX()` and `removedMaleHetX()`
exclude heterozygous markers in male, and males with haploid heterozygous SNPs
on chromosome X. Our default QC procedure does not remove any population outliers
before imputation. However, if your study data includes significant population outliers
and this may affect the qualtiy of subquent imputation analysis. For this, it is
recommended to remove them by the function `removeOutlierByPCs()` after looking at
PCA plot generated by `plotPCA4plink()`.
### Input files
* Genotype data with updated genotype information:../1-genoUpdate/1_11_removedYMtSnp.#
### Processing example
This step does QC in a single function, including all required substeps to
ensure that the QC-ed genotype data is of high quality prior to imputation.
```{r eval=FALSE}
inputPrefix <- "2_02_removedInstHetX"
outputPrefix <- "2_13_removedSnpHweFemaleX"
genoQC(plink, inputPrefix,
snpMissCutOffpre=0.05,
sampleMissCutOff=0.02,
Fhet=0.2, cutoffSubject=0.05, cutoffSNP=0.1,
snpMissCutOffpost=0.02,
snpMissDifCutOff=0.02,
femaleChrXmissCutoff=0.05,
pval4autoCtl=0.000001,
pval4femaleXctl=0.000001, outputPrefix, keepInterFile=TRUE)
```
### Individual steps
Alternatively, we elaborate step by step instructions of QC procedure in `genoQC()` for better understanding.
Note that, the output files of each step are the input of the next step.
1. Determine for each SNP of the chromosome 23 from the genotype data the number of male instances which have the value one as the minor allele count for that SNP. Then, remove all SNPs which number is higher than 0.5 % of the number of male instances. Note that 0.5% is just an example, this should be dataset specific.
+ Primary output files: 2_01_removedSnpHetX.#
+ Output file with the number of instances with heterozygous alleles for each SNP of the chromosome 23 before SNP removal (each line contains a SNP name and the respective number, lines are sorted descending by number): 2_01_snpHHfreqAll.txt
+ Output file with the number of instances with heterozygous alleles for each SNP of the chromosome 23 after SNP removal:2_01_snpHHfreqRetained.txt
```{r eval=FALSE}
inputPrefix <- "1_11_removedYMtSnp"
hhCutOff <- 0.005 ## can be tuned
outputPrefix <- "2_01_removedSnpHetX"
outputHetSNPfile <- "2_01_snpHHfreqAll.txt"
outputRetainSNPfile <- "2_01_snpHHfreqRetained.txt"
removedSnpHetX(plink, inputPrefix, hhCutOff, outputPrefix,
outputHetSNPfile, outputRetainSNPfile)
```
2. Determine for each male instance the number of SNPs of the chromosome 23 which have the value one as the minor allele count for that instance. Next, remove all instances which number is higher than 15. Note that 15 is just an example, this should be dataset specific.
+ Primary output files: 2_02_removedInstHetX.#
+ Output file stores male subjects that have heterozygous SNPs with their frequency (if any): 2_02_instHetXfreqAll.txt
+ Output file stores male subjects that have heterozygous SNPs with their frequency after 'improper' subject removal (if any):2_02_instHetXfreqRetained.txt
+ Output file stores all heterozygous SNPs with their frequency (the number of males for this SNP): 2_02_snpHHfreqAll.txt
```{r eval=FALSE}
inputPrefix <- "2_01_removedSnpHetX"
hhSubjCutOff <- 15
outputPrefix <- "2_02_removedInstHetX"
outputSubjHetFile <- "2_02_instHetXfreqAll.txt"
outputRetainSubjectFile <- "2_02_instHetXfreqRetained.txt"
outputHetSNPfile <- "2_02_snpHHfreqAll.txt"
removedMaleHetX(plink, inputPrefix, hhSubjCutOff,
outputPrefix, outputSubjHetFile,
outputRetainSubjectFile, outputHetSNPfile)
```
These two steps are automatically skipped if you are working on the autosomal data.
3. Set all heterozygous alleles of SNPs of the chromosome 23 for males (i.e. when the SNP has the value one as the minor allele count) as missing.
+ Output files: 2_03_setHeteroHaploMissing.#
```{r eval=FALSE}
outputPrefix3 <- "2_03_setHeteroHaploMissing"
setHeteroHaploMissing(plink, inputPrefix, outputPrefix=outputPrefix3)
```
4. Remove SNPs with missingness >= 0.05 (before instance removal).
+ Output files: 2_04_removedSnpMissPre.#
```{r eval=FALSE}
outputPrefix4 <- "2_04_removedSnpMissPre"
removedSnpMiss(plink, snpMissCutOff=0.05,
inputPrefix=outputPrefix3, outputPrefix=outputPrefix4)
```
5. Remove instances with missingness >= 0.02.
+ Output files: 2_05_removedInstMiss.#
```{r eval=FALSE}
outputPrefix5 <- "2_05_removedInstMiss"
removedInstMiss(plink, sampleMissCutOff=0.02,
inputPrefix=outputPrefix4, outputPrefix=outputPrefix5)
```
6. Remove instances with great autosomal heterozygosity deviation, i.e. with |Fhet| >= 0.2
+ Output files: 2_06_removedInstFhet.#
```{r eval=FALSE}
outputPrefix6 <- "2_06_removedInstFhet"
removedInstFhet(plink, Fhet=0.2,
inputPrefix=outputPrefix5, outputPrefix=outputPrefix6)
```
7. Replace the paternal ID and maternal ID of instances (childs) by the value zero if the paternal ID and the maternal ID do not belong to any instance (parent) with the same family ID as the child.
+ Output files: 2_07_removedParentIdsMiss.#
```{r eval=FALSE}
outputPrefix7 <- "2_07_removedParentIdsMiss"
removedParentIdsMiss(plink, inputPrefix=outputPrefix6,
outputPrefix=outputPrefix7)
```
8. Exclude subjects and/or genetic variants (SNPs) based on Mendel errors for the family data (trio/duo). (Optional, if no family data)
+ Output files: 2_08_removedMendelErr.#
```{r eval=FALSE}
outputPrefix8 <- "2_08_removedMendelErr"
removedMendelErr(plink, inputPrefix=outputPrefix7, cutoffSubject,
cutoffSNP, outputPrefix=outputPrefix8)
```
9. Remove SNPs with missingness >= 0.02 (after instance removal).
+ Output files: 2_09_removedSnpMissPost.#
```{r eval=FALSE}
outputPrefix9 <- "2_09_removedSnpMissPost"
removedSnpMiss(plink, snpMissCutOff=0.02,
inputPrefix=outputPrefix8, outputPrefix=outputPrefix9)
```
10. Remove SNPs with difference >= 0.02 of SNP missingness between cases and controls.
+ Output files: 2_10_removedSnpMissDiff.#
```{r eval=FALSE}
outputPrefix10 <- "2_10_removedSnpMissDiff"
groupLabel <- getGroupLabel(inputFAMfile=paste0(outputPrefix8, ".fam"))
removedSnpMissDiff(plink, inputPrefix=outputPrefix9, snpMissDifCutOff=0.02,
outputPrefix=outputPrefix10, groupLabel)
```
11. Remove chrX SNPs with missingness >= 0.05 in females. (Optional, if no chrX data)
+ Output files: 2_11_removedSnpFemaleChrXmiss.#
```{r eval=FALSE}
outputPrefix11 <- "2_11_removedSnpFemaleChrXmiss"
removedSnpFemaleChrXmiss(plink, femaleChrXmissCutoff=0.05,
inputPrefix=outputPrefix10,
outputPrefix=outputPrefix11)
```
12. Remove autosomal SNPs with Hardy-Weinberg equilibrium p < 10-6 in controls.
+ Primary output files: 2_12_removedSnpHweAuto.#
+ Output file with HWE p-value for the autosomal SNPs before removing, sorted ascending by the p-values: 2_12_snpHwePvalAuto.txt
+ Output file with the removed SNP names: 2_12_snpRemovedHweAuto.txt
```{r eval=FALSE}
outputPvalFile <- "2_12_snpHwePvalAuto.txt"
outputSNPfile <- "2_12_snpRemovedHweAuto.txt"
outputPrefix12 <- "2_12_removedSnpHweAuto"
if (groupLabel == "control" | groupLabel == "caseControl"){
removedSnpHWEauto(groupLabel="control", plink,
inputPrefix=outputPrefix11,
pval=0.000001, outputPvalFile,
outputSNPfile, outputPrefix=outputPrefix12)
}
```
13. Remove chrX SNPs with HWE p < 10-6 in female controls. (Optional, if no chrX data)
+ Primary output files: 2_13_removedSnpHweFemaleXct.#
+ Output file with HWE p-value for the female chrX SNPs before removing, sorted ascending by the p-values:2_13_snpHwePvalfemaleXct.txt
+ Output file with the removed SNP names:2_13_snpRemovedHweFemaleXct.txt
```{r eval=FALSE}
outputPrefix <- "2_13_removedSnpHweFemaleX"
outputPvalFile <- "2_13_snpHwePvalfemaleXct.txt"
outputSNPfile <- "2_13_snpRemovedHweFemaleXct.txt"
removedSnpFemaleChrXhweControl(plink, inputPrefix=outputPrefix12,
pval=0.000001, outputPvalFile,
outputSNPfile, outputPrefix)
```
14. Calculate the PCA (with plots) for 2_13_removedSnpHweFemaleXct.# and remove the outliers from that dataset.
+ Primary output files: 2_14_removedOutliers.#
+ Output file with the IDs of the instances and the values of their eigenvectors from the PCA after removal of instances: 2_14_eigenvalAfterQC.txt
+ Output file with the plot of the first two PCs from the PCA before removal of instances: 2_14_eigenvalAfterQC.png
+ Output file with the plot of the first two PCs from the PCA after removal of instances: 2_14_removedOutliers.png
+ Output file with the IDs of the removed instances and the values of their eigenvectors, sorted ascending by the PC values: 2_14_eigenval4outliers.txt
```{r eval=FALSE}
inputPrefix <- "2_13_removedSnpHweFemaleX"
outputPC4subjFile <- "2_14_eigenvalAfterQC.txt"
outputPCplotFile <- "2_14_eigenvalAfterQC.png"
plotPCA4plink(gcta, inputPrefix, nThread=20, outputPC4subjFile, outputPCplotFile)
cutoff <- c(-0.4, 0.2)
cutoffSign <- "greater" ## not used if cutoff == NULL, and with two values
inputPC4subjFile <- "2_14_eigenvalAfterQC.txt"
outputPC4outlierFile <- "2_14_eigenval4outliers.txt"
outputPCplotFile <- "2_14_removedOutliers.png"
outputPrefix <- "2_14_removedOutliers"
removeOutlierByPCs(plink, gcta, inputPrefix, nThread=20,
cutoff, cutoffSign, inputPC4subjFile,
outputPC4outlierFile, outputPCplotFile, outputPrefix)
```
## Alignment with the reference
All files named in this subsection are in the directory 3-checkAlign/.
The function `checkAlign2ref()` performs the alignment against a given reference
panel by considering the following parameters: variant name, genomic position and
the allele presentation. Output files are generated in order.
### Input files
* PLINK files after quality control and population outlier detection: 2_14_removedOutliers.#
* Imputation reference files.
### Detailed steps
1. If the genome build of already QC-ed is different from the genome build of the imputation reference files, then lift the data from the first genome build to the second. If the genome build is the same, then just copy the input files.
+ Output files: 3_1_QCdata.#
2. Remove SNPs for which the name has a different position (i.e. combination of base pair position and chromosome) in the imputation reference files.
+ Primary output files: 3_2_removedSnpDiffNamePos.#
+ Output file with the SNPs names for which a different position is contained in the imputation reference files: 3_2_snpDiffNamePos.txt
3. Remove SNPs for which the combination of base pair position and chromosome is not contained in the imputation reference files (ignoring the SNP name).
+ Primary output files: 3_3_removedSnpMissPos.#
+ Output file with the SNPs names for which the combination of base pair position and chromosome is not contained in the imputation reference files: 3_3_snpMissPos.txt
4. Remove SNPs from which the combination of base pair position and chromosome (ignoring the SNP name) has an allele which is not in the
imputation reference files for that combination of base pair position and chromosome.
+ Primary output files: 3_4_removedSnpDiffAlleles.#
+ Output file with the removed SNP names: 3_4_snpDiffAlleles.txt
+ Output file with the retained SNPs: 3_4_snpImpRefAlleles.txt
### Processing example
```{r eval=FALSE}
renamePlinkBFile(inputPrefix="2_14_removedOutliers",
outputPrefix="3_1_QCdata", action="move")
inputFile <- paste0(impRefDIR,"*.legend.gz")
## To recode the intermediate disk usage
## keep in the same directory.
bimReferenceFile <- "bimImputeRef.txt"
.prepareLegend2bim(inputFile, referencePanel,
outputFile=bimReferenceFile, ncore=nCores)
inputPrefix <- "3_1_QCdata"
out2.snp <- "3_2_snpSameNameDiffPos"
out2 <- "3_2_removedSnpSameNameDiffPos"
out3 <- "3_3_removedSnpMissPos"
out3.snp <- "3_3_snpMissPos"
out4 <- "3_4_removedSnpDiffAlleles"
out4.snp <- "3_4_snpDiffAlleles"
out4.snpRetained <- "3_4_snpImpRefAlleles"
nCores <- detectCores()
checkAlign2ref(plink, inputPrefix, referencePanel, bimReferenceFile,
out2, out2.snp, out3, out3.snp,
out4, out4.snp, out4.snpRetained, nCore=nCores)
```
## Imputation
All files named in this subsection are in the directory 4-imputation/.
In this subsection, three major functions `removedMonoSnp()`, `phaseImpute()`
and `postImpQC()` are used for the purpose of
excluding monomorphic SNPs, phasing, imputation and post imputation QC.
### Input files
* PLINK genotyping data after QC and the alignment with the imputation reference
files: ../3-checkAlign/3_4_removedSnpDiffAlleles.#
* Imputation reference files.
### Detailed steps and processing examples
1. Remove monomorphic SNPs from the well aligned genotyping data.
+ Primary output files: 4_1_removedMonoSnp.#
+ Output file with the removed monomorphic SNPs: 4_1_snpMonoRemoved.txt
```{r eval=FALSE}
outputPrefix <- "4_1_removedMonoSnp"
outputMonoSNPfile <- "4_1_snpMonoRemoved.txt"
removedMonoSnp(plink, inputPrefix="3_4_removedSnpDiffAlleles",
outputPrefix, outputSNPfile=outputMonoSNPfile)
```
2. Use the imputation reference files to generate pre-phase haplotypes by SHAPEIT
and then do the imputation by using IMPUTE2. GTOOL is used for format conversion.
In the case of using IMPUTE4, QCTOOL is used instead of GTOOL.
+ Output files: 4_2_imputedDataset.#
+ Output file with the info scores of all SNPs, which consists of two columns, separated by whitespaces, the first the SNPs and the second the info scores: 4_2_snpImputedInfoScore.txt. This is also the output from last step: impute2infoUpdated.txt.
```{r eval=FALSE}
inputPrefix <- "4_1_removedMonoSnp"
outputPrefix <- "4_2_imputedDataset"
outputInfoFile <- "4_2_snpImputedInfoScore.txt"
tmpImputeDir <- paste0(imputeTool, "_", referencePanel)
nCores <- detectCores()
phaseImpute(inputPrefix, outputPrefix, autosome=TRUE,
plink, shapeit, imputeTool, impute, qctool, gtool,
windowSize=3000000, effectiveSize=20000,
nCore=nCores, threshold=0.9, outputInfoFile, SNP=TRUE,
referencePanel, impRefDIR, tmpImputeDir, keepTmpDir=TRUE)
```
3. Remove imputed SNPs with (info < 0.6) where the value info comes from the
files *.impute2_info from the imputation.
+ Output files: 4_3_wellImputeData.#
4. Remove all SNPs (if any) which have the same position as a SNP in 4_1_snpMonoRemoved.txt has in aligned instance data.
+ Output files: 4_4_removedMonoSnpAfter.#
5. Add previous identified monomorphic SNPs (if any) in 4_1_snpMonoRemoved.txt with their values from the lifted instance data.
+ Output files: 4_5_addedMonoSnpAfter.#
6. Remove SNPs which have a non missing value for less then 20 instances.
+ Primary output files: 4_6_removedSnpMissPostImp.#
+ Output file with the removed SNP names: 4_6_snpRemovedMissPostImp.txt
+ Output file with the remaining SNP names: 4_6_snpRetainMissPostImp.txt
```{r eval=FALSE}
inputPrefix <- "4_2_imputedDataset"
out1 <- "4_3_wellImputeData"
out2 <- "4_4_removedMonoSnpAfter"
out3 <- "4_5_addedMonoSnpAfter"
out4 <- "4_6_removedSnpMissPostImp"
outputInfoFile <- "4_2_snpImputedInfoScore.txt"
outputMonoSNPfile <- "4_1_snpMonoRemoved.txt"
prefixAlign2ref <- "3_4_removedSnpDiffAlleles"
outRemovedSNPfile <- "4_6_snpRemovedMissPostImp.txt"
outRetainSNPfile <- "4_6_snpRetainMissPostImp.txt"
postImpQC(plink, inputPrefix, out1, out2, out3, out4,
outputInfoFile, infoScore=0.6, outputMonoSNPfile,
prefixAlign2ref, missCutoff=20, outRemovedSNPfile,
outRetainSNPfile, referencePanel)
```
## Reduction and expansion
All files named in this subsection are in the directory 5-reductAndExpand/.
`reductExpand()` function extracts the genotypes from well imputed data so that
PLINK data with only SNPs before imputation are remained. On the basis of
extracted genotypes, a subset of non-imputed genotypes that are different from
the imputation reference panel are then added, including SNPs with different alleles,
missing positions, and different positions.
### Input files
* The imputed dataset:../4-imputation/4_6_removedSnpMissPostImp.#
* The SNPs before imputation:../3-checkAlign/3_4_snpImpRefAlleles.txt
* The file with the SNPs with different alleles:../3-checkAlign/3_4_snpDiffAlleles.txt
* The SNPs with missing positions: ../3-checkAlign/3_3_snpMissPos.txt
* The SNPs with different positions: ../3-checkAlign/3_2_snpSameNameDiffPos.txt
* The dataset before removing SNPs for the alignment with the imputation reference: ../3-checkAlign/3_1_QCdata.#
### Detailed steps
1. Firt of all, copy all the input files into the directory ./5-reductAndExpand/. Then do the reduction of the imputed dataset to the SNPs before imputation.
+ Output files: 5_1_reducedToSpecific.#
2. Add the SNPs (if any) with different alleles from the dataset before removing SNPs.
+ Output files: 5_2_specificDiffAllele.#
3. Add the SNPs (if any) with missing positions from the dataset before removing SNPs.
+ Output files: 5_3_specificMissPos.#
4. Add the SNPs (if any) with different positions from the dataset before removing SNPs.
+ Output files: 5_4_specificDiffPos.#
### Processing example
```{r eval=FALSE}
inputPrefix <- "4_6_removedSnpMissPostImp"
inputQCprefix <- "3_1_QCdata"
snpRefAlleleFile <- "3_4_snpImpRefAlleles.txt"
snpDiffAlleleFile <- "3_4_snpDiffAlleles.txt"
snpMissPosFile <- "3_3_snpMissPos.txt"
snpSameNameDifPosFile <- "3_2_snpSameNameDiffPos.txt"
dir5 <- "./5-reductAndExpand/"
## imputed dataset
system(paste0("scp ./4-imputation/", inputPrefix, ".* ", dir5))
system(paste0("scp ./3-checkAlign/", inputQCprefix, ".* ", dir5))
system(paste0("scp ./3-checkAlign/", snpRefAlleleFile, " ", dir5))
system(paste0("scp ./3-checkAlign/", snpDiffAlleleFile, " ", dir5))
system(paste0("scp ./3-checkAlign/", snpMissPosFile, " ", dir5))
system(paste0("scp ./3-checkAlign/", snpSameNameDifPosFile, " ", dir5))
setwd(dir5)
reducedToSpecificfn <- "5_1_reducedToSpecific"
specificDiffAllelefn <- "5_2_specificDiffAllele"
specificMissPosfn <- "5_3_specificMissPos"
specificDiffPosfn <- "5_4_specificDiffPos"
reductExpand(plink, referencePanel, inputPrefix, inputQCprefix,
snpRefAlleleFile, snpDiffAlleleFile,
snpMissPosFile, snpSameNameDifPosFile,
reducedToSpecificfn, specificDiffAllelefn,
specificMissPosfn, specificDiffPosfn)
setwd("..")
```
## Final results
Copy and rename the results of interest to the directory ./6-finalResults.
The metadata information file, the final well imputed and a subset of imputed dataset (dataset-specific) are essential for the downstream analysis.
```{r eval=FALSE}
dir6 <- "./6-finalResults/"
system(paste0("scp ./1-genoUpdate/1_01_metaData.txt ", dir6))
system(paste0("scp ./4-imputation/4_6_removedSnpMissPostImp.* ", dir6))
system(paste0("scp ./5-reductAndExpand/5_4_specificDiffPos.* ", dir6))
setwd(dir6)
renamePlinkBFile(inputPrefix="4_6_removedSnpMissPostImp",
outputPrefix="imputedSnpsDataset", action="move")
renamePlinkBFile(inputPrefix="5_4_specificDiffPos",
outputPrefix="specificSnpsDataset", action="move")
```
# Extending pipelines
Gimpute's modular structure allows the integration of other existing imputation workflows, allowing users to choose their preferred imputation strategy. For demonstration, we have embedded Genipe as an external imputation tool [8].
## Integrate with Genipe
### Installation
To set up Genipe, please refer to Genipe's [installation](http://pgxcentre.github.io/genipe/installation.html#id3) page for more details.
### Additional configuration files
* Imputation reference panels
The imputation reference can be downloaded directly from IMPUTE2's reference page: [1000 Genome phase 3](https://mathgen.stats.ox.ac.uk/impute/1000GP_Phase3.html) or [Genipe's page](http://pgxcentre.github.io/genipe/tutorials/tutorial_genipe.html#data-in-more-details).
* The human reference files
This reference genome is used for the strand check, you can download from [Genipe's page](http://pgxcentre.github.io/genipe/tutorials/tutorial_genipe.html#data-in-more-details).
### Imputation with Genipe
1. Prepare additional configuration files.
```{r eval=FALSE}
## example
impRefDIR4genipe <- "../1000GP_Phase3/"
fastaFile <- "../hg19/hg19.fasta"
```
2. Imputing genotypes using Genipe.
```{r eval=FALSE}
chrs =22
inputPrefix <- "3_4_removedSnpDiffAlleles"
thread4impute2 <- 20 ## tune by yourself
thread4shapeit <- 30
segmentSize <- 3000000
imputeTool <- "impute2" ## only impute2
imputedByGenipe(chrs, impRefDIR4genipe, inputPrefix, shapeit, impute, plink, fastaFile, segmentSize, thread4impute2, thread4shapeit)
```
3. Merge chunk-specific region using Genipe.
```{r eval=FALSE}
chr <- 2
inputImpute2 <- 'chr2.33000001_36000000.impute2'
probability <- 0.9
completionRate <- 0.98
# info <- 0.6
outputPrefix <- paste0('imputedChr', chr)
mergeByGenipe(inputImpute2, chr, probability, completionRate, info, outputPrefix)
```
4. Extract imputed results chromosome-wise by Genipe.
```{r eval=FALSE}
## extract imputed markers using Genipe
chr <- 3
inputImpute2 <- paste0('chr', chr,'.imputed.impute2')
inputMAP <- paste0('chr', chr,'.imputed.map')
format <- 'bed'
prob <- 0.9
outputPrefix <- paste0('imputedChr', chr)
extractByGenipe(inputImpute2, inputMAP, outputPrefix, format, prob)
```
```{r eval=FALSE}
## sessionInfo()
# R version 3.5.0 (2018-04-23)
# Platform: x86_64-redhat-linux-gnu (64-bit)
# Running under: CentOS Linux 7 (Core)
# Matrix products: default
# BLAS/LAPACK: /usr/lib64/R/lib/libRblas.so
# locale:
# [1] LC_CTYPE=en_US.UTF-8 LC_NUMERIC=C
# [3] LC_TIME=en_US.UTF-8 LC_COLLATE=en_US.UTF-8
# [5] LC_MONETARY=en_US.UTF-8 LC_MESSAGES=en_US.UTF-8
# [7] LC_PAPER=en_US.UTF-8 LC_NAME=C
# [9] LC_ADDRESS=C LC_TELEPHONE=C
# [11] LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C
# attached base packages:
# [1] parallel stats graphics grDevices utils datasets methods
# [8] base
# other attached packages:
# [1] doParallel_1.0.11 iterators_1.0.10 foreach_1.4.4 lattice_0.20-35
# [5] Gimpute_0.99.9
# loaded via a namespace (and not attached):
# [1] compiler_3.5.0 codetools_0.2-15 grid_3.5.0
```
# References
[1] Marchini, J., et al. (2010). Genotype imputation for genome-wide association studies. Nat Rev Genet 11(7): 499-511.
[2] Marchini, J., et al. (2007). A new multipoint method for genome-wide association studies by imputation of genotypes. Nat Genet 39(7): 906-913.
[3] Schizophrenia Working Group of the Psychiatric Genomics, C. (2014). Biological insights from 108 schizophrenia-associated genetic loci. Nature 511(7510): 421-427.
[4] Purcell, Shaun, et al. PLINK: a tool set for whole-genome association and population-based linkage analyses. The American Journal of Human Genetics 81.3 (2007): 559-575.
[5] Howie, B., et al. (2012). Fast and accurate genotype imputation in genome-wide association studies through pre-phasing. Nat Genet 44(8): 955-959.
[6] Howie, B. N., et al. (2009). A flexible and accurate genotype imputation method for the next generation of genome-wide association studies. PLoS Genet 5(6): e1000529.
[7] Bycroft, Clare, Colin Freeman, Desislava Petkova, Gavin Band, Lloyd T. Elliott, Kevin Sharp, Allan Motyer et al. Genome-wide genetic data on~ 500,000 UK Biobank participants. BioRxiv (2017): 166298.
[8] Lemieux Perreault, L. P., et al. (2016). genipe: an automated genome-wide imputation pipeline with automatic reporting and statistical tools. Bioinformatics, 32(23), 3661-3663.
# Trouble-shooting
If you have any trouble or comment ? --> Contact: junfang.chen@zi-mannheim.de