R/tut_post.Rmdp

---
title: "GWA analysis and post-analytic interrogation with R"
output: html_document
---
```{r setup, include=FALSE}
knitr::opts_chunk$set(cache=TRUE, warning=FALSE)
```

```{r "code9.r"/code9, echo=FALSE, message=FALSE}
```

## Post-analytic visualization and genomic interrogation

We now have generated and fit both typed and imputed genotypes.  The
next step is to combine the results, and isolate just those SNPs in
our region of interest.  Following similar steps as for imputed SNPs,
the typed SNPs are loaded from a file generated by the `GWAA`
function. We follow similar steps to attach chromosome and position to
each SNP, order by significance, and take $-log_{10}$ of the p-value.

```{r "code9.r"/code9-a}
```

#### Isolate CETP-specific SNPs

The two tables of typed and imputed genotypes are combined into a
single table.  In addition, we also concatenate just the SNPs near
CETP and display them all here.

```{r "code9.r"/code9-b}
```

```{r "code9.r"/code9-end}
```

### Visualization and QC 

```{r "code10.r"/code10, echo=FALSE, message=FALSE}
```

Several plots allow us both to visualize the GWA analysis findings
while performing quality control checks.  Specifically, we are
interested in identifying data inconsistencies and potential systemic
biases.

### Manhattan plot

Manhattan plots are used to visual GWA significant results by
chromosome location.  We will call the `GWAS_Manhattan` function to
plot $-log_{10}$ of the p-value against SNP position across the
entire set of typed and imputed SNPs.  The plot will show two
horizontal lines.  The higher of the two is the commonly used
"Bonferroni" adjusted significance cut-off of $-log_{10}(5 \times
10^{-8})$, while the lower is less stringent ("Candidate") cut-off of $-log_{10}(5
\times 10^{-6})$.  Typed and imputed SNPs will be represented by black
and blue, respectively. We label the typed SNPs with signals that have surpassed the less stringent cutoff.


```{r "GWAS_ManhattanFunction.R"/manhattan}
```

```{r "code10.r"/code10-a, fig.align='center', fig.width=10, fig.height=7}
```

### Quantile-quantile plots and the $\lambda$-statistic ##

Q-Q plots are used to visualize the relationship between the expected
and observed distributions of SNP level test statistics.  Here we
compare these statistics for the unadjusted model (left) compared with
the model adjusted for confounders by incorporating the first ten
principal components along with clinical covariates.

A new set of models is generated with only the phenotype (HDL) and no
additional factors.  The results are plotted using the **GenABEL**
package's `estlambda` function.

```{r "code10.r"/code10-b, fig.align='center'}
```

We see here that the tail of the distribution is brought closer to the
y=x line after accounting for confounding by race/ethnicity in the
modeling framework.  If the data in this figure were shifted up or
down from the $y=x$ line, then we would want to investigate some form
of systemic bias.  The degree of deviation from this line is measured
formally by the $\lambda$-statistic, where a value close to 1 suggests
appropriate adjustment for the potential admixture.  A slight
deviation in the upper right tail from the $y=x$ line suggests crudely
that some form of association is present in the data.  There is only a
slight improvement in $\lambda$ between the unadjusted model and the
model with PCs indicating that the population is relatively
homogenous.

### Heatmap

Heatmaps are typically used in the context of GWA analysis to
visualize the linkage disequilibrium pattern between significant SNPs
other SNPs in nearby regions.  Here we include our most significant
SNP from our analysis and other SNPs near CETP.  The darker shading
indicates higher LD.  The plot also includes $-log_{10}(p)$ values to
illustrate their connection with physical location and LD.

```{r "code10.r"/code10-c, fig.align='center', fig.height=8, fig.width=8, fig.keep='last', message=FALSE}
```

### Regional Association

Similar to the LD heatmap above, a regional association plot allows for visualization of SNP-wise signal accross a segment of a particular chromsome with the pairwise correlation between SNPs. However regional assoication plots typically show a  larger window of the genome. Therefore, for plot legibility, LD calculations to be displayed can be selected based on pairwise SNP proximity and minimum LD. In this example we demonstrate a regional plot create by the `regionplot` function from **postgwas**.  This function can use HapMap data downloaded from [Ensembl](http://www.ensembl.org/index.html), for LD calculations.  By default it will use the most recent [Genome Reference Consortium](http://www.ncbi.nlm.nih.gov/projects/genome/assembly/grc/) human genome build. Therefore, if we wish to use build GRCh37 (hg19) we will have to  create a custom `biomartConfigs` object to retrieve the appropriate data.

```{r "code10.r"/code10-d, fig.align='center', message=FALSE}
```