-
Notifications
You must be signed in to change notification settings - Fork 9
/
tut_preproc.Rmdp
201 lines (155 loc) · 7.41 KB
/
tut_preproc.Rmdp
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
---
title: "GWA analysis and post-analytic interrogation with R"
output: html_document
---
```{r setup, include=FALSE}
knitr::opts_chunk$set(cache=TRUE, warning=FALSE)
```
## Data pre-processing
```{r "code1.r"/code1, echo=FALSE}
```
For this tutorial we use genotype data files formatted for use with
[PLINK](http://pngu.mgh.harvard.edu/~purcell/plink/) software. We
utilize the function, `read.plink` from **snpStats**, which allows the
reading in of data formatted as *.bed*, *.bim*, and *.fam* files. The
*.bed* file contains the genotype information, coded in binary. The
*.bim* file contains information for each SNP with a respective column
for each of the following information: chromosome number, SNP name
(typically an rs #), genetic distance (not necessary for this
tutorial), chromosomal position, identity of allele 1, and identity of
allele 2. The assignment of allele 1 and allele 2, is related to the
effect allele, or the allele that is being counted when we assign a
numeric value to a genotype. This is typically assigned based on
allele frequency, though not always. In this tutorial, allele 1
pertains to the minor, or less common allele. Lastly, the *.fam* file
contains information for each samples with a respective column for
each of the following information: family ID (this will be used to
identify each sample when read into R), individual ID, paternal ID,
maternal ID, sex (coded as 1 = male, 2 = female), and phenotype. In
this tutorial we utilize a supplemental clinical file for outcome
variables and additional covariates.
Alternatively, similar genotype information can also be formatted for
PLINK software as *.ped* and *.map* files. The information of the
*.ped* file can be thought of as a combination of the *.bed* and
*.fam* files. It is a large table with the first six columns identical
to a *.fam* file, followed by a columns containing the genotype data
for each SNP. The *.map* file contains the first four
columns of the *.bim* file, without the allele assignments. These
files can be read in using the function, `read.pedfile`, from
**snpStats**. More information about the formatting of these files
can be found on the PLINK website.
```{r "code1.r"/code1-a, message=FALSE}
```
The `geno` object contains a `genotype` member of type `SnpMatrix`
where each column is a SNP and each row is a sample. For convenience,
we assign that to the object, `genotype`. Within this object
individual genotypes are assigned in the `SnpMatrix` specific `RAW`
format. As a result, many functions dependent on the **snpStats** package specifically require `SnpMatrix` objects, while more basic functions will require conversion to other formats. `geno` also contains the SNP information from the *.bim* file within the `map` member that we assign to `genoBim`.
```{r "code1.r"/code1-b}
```
Supplemental clinical data is found in a corresponding CSV file for
each sample. It contains a column for the sample ID (Family ID in the
*.fam* file) and a respective column for each of the following
variables: coronary artery disease status (coded as 0 = control and 1
= affected), sex (coded as 1 = male, 2 = female), age (years),
triglyceride level (mg/dL), high-density lipoprotein level (mg/dL),
low-density lipoprotein level (mg/dL).
```{r "code1.r"/code1-c}
```
We filter the genotype data to only include samples with corresponding
clinical data by indexing the `genotype` object using only row names
that match the sample IDs.
```{r "code1.r"/code1-d}
```
```{r "code1.r"/code1-end}
```
### SNP level filtering
```{r "code2.r"/code2, echo = FALSE}
```
Once the data is loaded, we are ready to remove SNPs that fail to meet
minimum criteria due to missing data, low variability or genotyping
errors. **snpStats** provides functions, `col.summary` and
`row.summary`, that return statistics on SNPs and samples,
respectively.
```{r "code2.r"/code2-a}
```
Using these summary statistics, we keep the subset of SNPs that meet
our criteria for minimum call rate and minor allele frequency.
```{r "code2.r"/code2-b}
```
```{r "code2.r"/code2-end}
```
### Sample level filtering
```{r "code3.r"/code3, message=FALSE}
```
The second stage of data pre-processing involves filtering samples,
i.e. removing individuals due to missing data, sample contamination,
correlation (for population-based investigations), and racial/ethnic or
gender ambiguity or discordance. In our study, we address these
issues by filtering on call rate, heterozygosity, cryptic relatedness
and duplicates using identity-by-descent, and we visually assess
ancestry.
### Basic sample filtering
Sample level quality control for missing data and heterozygosity is
achieved using the `row.summary` function from **snpStats**. An
additional heterozygosity F statistic calculation is carried out with
the form, $|F|=(1-O/E)$, where $O$ is observed proportion of
heterozygous genotypes for a given sample and $E$ is the expected
proportion of heterozygous genotypes for a given sample based on the
minor allele frequency across all non-missing SNPs for a given
sample.
```{r "code3.r"/code3-a, cache.lazy=FALSE, cache.vars='snpsum.row', message=FALSE}
```
We apply filtering on call rate and heterozygosity, selecting only
those samples that meet our criteria.
```{r "code3.r"/code3-b}
```
### IBD analysis
In addition to these summary statistics, we also want to filter on
relatedness criteria. We use the **SNPRelate** package to perform
identity-by-descent (IBD) analysis. This package requires that the
data be transformed into a *GDS* format file. IBD analysis is
performed on only a subset of SNPs that are in linkage equilibrium by
iteratively removing adjacent SNPs that exceed an LD threshold in a
sliding window using the `snpgdsLDpruning` function.
```{r "code3.r"/code3-c}
```
The `snpgdsIBDMoM` function computes the IBD coefficients using method
of moments. The result is a table indicating kinship among pairs of
samples.
```{r "code3.r"/code3-d}
```
Using the IBD pairwise sample relatedness measure, we iteratively
remove samples that are too similar using a greedy strategy in which
the sample with the largest number of related samples is removed. The
process is repeated until there are no more pairs of samples with
kinship coefficients above our cut-off.
```{r "code3.r"/code3-e}
```
### Ancestry
To better understand ancestry, we plot the first two principal
components of the genotype data. Principal component calculation is
achieved via the `snpgdsPCA` function from **SNPRelate**. It is
important to note that in this example we are reasonably confident
that our samples are homogeneous, coming from european
ancestry. Therefore, given that there are no clear outliers, we fail to
remove any samples.
```{r "code3.r"/code3-f, fig.align='center'}
```
```{r "code3.r"/code3-end}
```
### SNP Filtering - HWE filtering on control samples
```{r "code4.r"/code1, echo=FALSE}
```
Finally, once samples are filtered, we return to SNP level filtering
and apply a check of Hardy-Weinberg equilibrium. Rejection of
Hardy-Weinberg equilibrium can be an indication of population
substructure or genotyping errors. Given that we are performing a
statistical test at every SNP, it is common to use a relatively
lenient cut-off. In this example we only remove SNPs with p-values,
corresponding to the HWE test statistic on CAD controls, of less than
$1 \times 10^{-6}$. We only test HWE on CAD controls due to possible violation of HWE caused by disease association.
```{r "code4.r"/code4-a}
```
```{r "code4.r"/code4-end}
```