commit

ericreed · Mar 3, 2014 · 5cb544b · 5cb544b
1 parent 148ae5d
commit 5cb544b
Show file tree

Hide file tree

Showing 9 changed files with 1,558 additions and 0 deletions.
diff --git a/.gitignore b/.gitignore
@@ -0,0 +1,3 @@
+.Rproj.user
+.Rhistory
+.RData
diff --git a/GWASPaper.Rproj b/GWASPaper.Rproj
@@ -0,0 +1,13 @@
+Version: 1.0
+
+RestoreWorkspace: Default
+SaveWorkspace: Default
+AlwaysSaveHistory: Default
+
+EnableCodeIndexing: Yes
+UseSpacesForTab: Yes
+NumSpacesForTab: 2
+Encoding: UTF-8
+
+RnwWeave: knitr
+LaTeX: pdfLaTeX
diff --git a/GWASsteps-concordance.tex b/GWASsteps-concordance.tex
@@ -0,0 +1,4 @@
+\Sconcordance{concordance:GWASsteps.tex:GWASsteps.Rnw:%
+1 33 1 50 0 4 1 5 0 6 1 6 0 6 1 6 0 7 1 5 0 12 1 5 0 6 1 5 0 5 1 5 %
+0 5 1 5 0 5 1 5 0 24 1 6 0 6 1 5 0 10 1 5 0 11 1 5 0 9 1 5 0 14 1 %
+6 0 6 1 5 0 10 1 5 0 10 1 5 0 17 1 7 0 5 1}
diff --git a/GWASsteps.Rnw b/GWASsteps.Rnw
@@ -0,0 +1,211 @@
+\documentclass[11pt]{article}
+
+% margins, size, formatting
+
+\oddsidemargin=0in
+\evensidemargin=0in
+\topmargin=0in
+%\textwidth=6.5in
+%\textheight=9.5in
+\parindent = 0 in
+\pagestyle{plain}
+
+\usepackage{amsmath,amssymb,amsthm, amsfonts}
+\usepackage{array}
+\usepackage{fancyhdr}
+\usepackage{geometry}
+\pagestyle{fancy}
+\usepackage{graphicx}
+
+
+
+\lhead{\textbf{GWAS Analysis\\ February 2014}}
+\rhead{\textbf{Emily Ramos\\ Eric Reed}}
+\cfoot{}
+
+\begin{document}
+
+This document provides the steps necessary to run a Genome Wide Association Study (GWAS) analysis.
+\section{Introduction}
+\section{Data Files and Pre-Processing}
+
+To analyize the Plink files ($.bim$, $.fam$, $.bed$) in R, the "snpStats" package is required.
+
+<<eval=FALSE>>=
+source("http://bioconductor.org/biocLite.R")
+biocLite("snpStats")
+library(snpStats)
+@
+\subsection{Data Files}
+\subsection*{The .fam File}
+The .fam file is a matrix that identifies each study participant. It contains 6 columns which identify each individual by ``Family ID Number", ``Sample ID Number",``Paternal ID Number", ``Maternal ID Number", ``Sex", and ``Phenotype".  There are $n$ rows, one for each individual. Note that not all of these columns may contain information depending on the nature of the data collected and the strategies for analysis. In the example to follow, a separate phenotype file will be read in, containing quantified phenotype data.
+<<eval=FALSE>>=
+genoFam<-read.table("genodata.fam");colnames(genoFam) = c("FamID","IndID","PatID","MatID","sex","phenotype")
+@
+\subsection*{The .bim File}
+The .bim file is a matrix that provides information about each SNP in our study.  There are 6 colums that contain information for the ``Chromosome Number", ``rsNumber", ``Genetic Distance", ``Position ID", ``Allele 1", ``Allele 2".  There are $n$ rows, one for each SNP in the analysis.
+<<eval=FALSE>>=
+genoBim <- read.table("genodata.bim");colnames(genoBim) = c("chr","SNP","GenDist","BPPos","g1","g2")
+genoBim$SNP <- as.character(genoBim$SNP)
+@
+\subsection*{The .bed File}
+The .bed file is a matrix that contains the genotypic data in binary format. It's the result of the conversion of .ped and .map files, by the PLINK software. This is the largest of the three files because it contains every SNP in the study, as well as the genotype at this snp for each individual. In order to interpret the binary data, we need to employ the ``read.plink()" function, from the ``snpStats" package, to read in the .fam, .bim, and .bed functions together, and to interpret the binary data in the .bed file and put it into the proper format.  The output will be the``genotype", list object, which contains a matrix $p$ columns, one for each SNP by ``rsNumber", and $n$ rows, one for each study participant by ``Family ID Number", containing the genotype of each study participant, at each individual SNP, as either ``01", ``02", and ``03", stored as ``raw" data.
+
+
+<<eval=FALSE>>=
+geno<-read.plink("genodata.bed","genodata.bim","genodata.fam")$genotype
+@
+Our final genotype (``genoNum") matrix contains the genotype for each individual at each individual SNP, stored as either the number of Major or Minor alleles, depending on how the data was originally formatted.  For example, if the data was formatted to count the number of major alleles, the meaning of the genotype data is coded as follows:
+
+\begin{itemize}
+  \item 0 $=$ Homozygous Recessive
+  \item 1 $=$ Heterozygous
+  \item 2 $=$ Homozygous Dominant
+\end{itemize}
+
+Make a table one of the columns to make sure the data is formatted correctly.
+<<eval=FALSE>>=
+table(genoNum[,1])
+@
+\subsection*{The Phenotype File}
+The phenotype file contains a matrix, with the phenotype variables as columns and the participant IDs as the unit of observation.
+
+<<eval=FALSE>>=
+pheno<-read.csv("phenopathway.csv")
+@
+
+Only phenotypes of patients who have genotype data should be included in the genotype file.
+<<eval=FALSE>>=
+phenoSub<-pheno[pheno$rpid%in%geno.fam$FamID,]
+@
+
+In order to merge the princple components with the phenotype file, we must first attach participant IDs to the Princple Components.
+<<eval=FALSE>>=
+pcs<-data.frame(rpid=genoFam$FamID,pcs)
+@
+
+We can now merge the two sets of data.
+<<eval=FALSE>>=
+phenoSub<-merge(phenoSub,pcs,by.x="rpid",by.y="rpid",all.x=TRUE)
+@
+
+\subsection{SNP Level Filtering}
+The ``snpStats" package will allow us to perform analysis on the raw ``geno" file.  We will use the ``col.summary()" function, which will produce a matrix with values for a variety of summary analyses at each SNP.
+This we will allow us to filter our data based on a certain set criteria.  For this study we have chosen to filter our data as follows: 
+\subsubsection*{Call Rate}
+The call rate is the most straight forward of the three criteria, as it means to only keep that have genotype data above a certain threshold. In the following example we chose a call rate of 90 \%.  This is essentially stating that we only want to keep the SNPs for which there is data for more than or equal 90\% of the study particpants, or for which there is missing data for less than or equal to 10\% of the study participants.
+\subsubsection*{Minor Allele Frequency}
+If every study participant is homozygous at a given SNP, than there is nothing to infer about the relationship between genotype and phenotype, as each participant has the same genotype at this SNP.  This is a common occurence with the major allele, as it is often much more common in a given population.  Therefore, we it is important to keep only the SNPs for which we have a certain proportion of minor alleles present in the study sample.  In the example to follow we will use a Minor Allele Frequency, that will remove SNPs for which the minor allele frequency is less than 1 \%.
+\subsubsection*{Hardy-Weinberg Equilibrium}
+The Hardy-Weingberg Equilibrium is based off the principal that if two alleles are possible for a given genotype, then the sum of the probabilities of recieving each allele is 1.
+\centerline{$p+q=1$}
+Given the frequency of a certain allele we can, therfore, not only predict the frequency of the other allele, but also the proportion of homozygous recessive, heterozygous, and homozygous dominant individuals in a population. If violated can assume that there was a possible genotyping error, and we must remove the data for this SNP. We can't expect our samples to exhibit this principle perfectly, however we can calculate the p-value that this principle is being followed, and set a threshold to keep only data with Hardy-Weinberg equlibrium p-values below a certain p-value. In the example to follow we have chosen a p-value of 0.001.
+
+
+<<eval=FALSE>>=
+call<-0.9
+minor<-0.01
+hardy<-0.001
+snpsum.col <- col.summary(geno)
+use <-  (!is.na(snpsum.col$MAF)&snpsum.col$MAF > minor) & (!is.na(snpsum.col$z.HWE)&snpsum.col$z.HWE^2 < qnorm(hardy/2)^2) & snpsum.col$Call.rate >= call #This creates a Boolean Vector of "TRUE" and "FALSE"
+use[is.na(use)] <- FALSE 
+genoBim <- genoBim[use,] #This will filter the .bim file
+geno <- geno[,use] #This will filter our genotype file.
+@
+
+\subsubsection*{Formatting the Genotype File}
+Once we have filtered our data using the formatting necessary for the ``snpStats" package we will need to convert our genotype file into a matrix in numeric format.
+<<eval=FALSE>>=
+genoNum<-as(geno,"numeric")
+@
+
+\subsection{Principal Components}
+Principal Component (PC) Analysis is performed on the genotype data and the first ten PCs are retained as predictors in the model to account for population substructure.
+\\ 
+The PC analysis begins by calculating the genetic covariance matrix, correlation coefficients between sample loadings (*not sure what sample loadings are) and genotypes for each SNP, SNP eigenvectors (loadings) and estimate the sample loadings from specified SNP eigenvectors.
+\\ These ten principal components are retained in a vector ``pcs". In later steps, this vector will be merged into the phenotype file and used for adjusting our phenotype variable when performing linear regression for each SNP.
+
+<<eval=FALSE>>=
+num.princ.comp<-10
+@
+<<eval=FALSE>>=
+xxmat <- xxt(geno, correct.for.missing = FALSE)
+evv <- eigen(xxmat, symmetric = TRUE)
+pcs <- evv$vectors[,1:num.princ.comp]
+evals <- evv$values[1:num.princ.comp]
+btr <- snp.pre.multiply(geno, diag(1/sqrt(evals)) %*% t(pcs))
+pcs <- snp.post.multiply(geno, t(btr))
+colnames(pcs) <- paste("pc",1:num.princ.comp,sep="")
+save.image("/project/cricsnp/greg.files/CRIC_GWAS_white.RData")
+rm(xxmat)
+@
+
+\section{GWAS Analysis}
+
+To run the linear regression on our data, we first need to install the ``parallel" package to be able to run a parallel analysis, an analysis on multiple cores.
+l
+<<load, eval=FALSE>>=
+library(parallel)
+load("/home/ramoser/correct_phase3.Rdata")
+@
+
+We create a GWAS function that will run our analysis. This function takes a subset of our data ``tempSNP" of ID numbers and RS Numbers. The linear regression model compares our filtered and merged dataset (in this case we are running an analysis on our baseline data) and adjusting for the variables of interest: 
+The ``a" matrix contains the SNP genotype on the x-axis and the phenotype on the y-axis. This function then maps the regression model for the filtered, merged dataset (in this case we are running an analysis on our baseline data) and adjusting for the variables of interest: Age, Sex, Race, the ten PCs and SNPs. The ``out" matrix pulls just the row that contains the estimate for the SNP from the linear model summary. 
+
+<<Function, eval=FALSE>>=
+ldlGWAS<-function(rsNumber){
+  print(rsNumber)
+  tempSNP <- data.frame(rpid=row.names(genoNum),snp=genoNum[,rsNumber])
+  dat <- merge(phenoSub,tempSNP,by.x="rpid",by.y="rpid",all.x=TRUE)
+  a <- summary(lm(fldl_wk0 ~ age + sex + raceth + pc1 + pc2 + pc3 + pc4 + pc5 + pc6 + pc7 + pc8 + pc9 + pc10 + snp, data=dat))
+  out <- as.matrix(a$coefficients['snp',])
+  out
+}
+@
+
+Next, we create a column of Genotype Numbers and remove SNP's that cause errors in our analysis.
+<<rsVec, eval=FALSE>>=
+rsVec <- as.matrix(colnames(genoNum))
+rsVec <- rsVec[-c(834037,834038,834039,834040,834042, 834043,834046),]
+@
+
+We use ``mcapply()" to run parallel analysis on 8 different cores to save time. Since ``mcapply()" must analyze a list, we first make ``rsVec" into a list. After running the analysis, we put the data back together.
+
+<<aa, eval=FALSE>>=
+rsVec <- as.list(rsVec)
+start <- Sys.time()
+aa <- mclapply(rsVec,ldlGWAS,mc.cores=8)
+out <- do.call(cbind,aa)
+fldl_wk0_p3 <- data.frame(t(out))
+@
+
+
+After analysis, We have a new dataset that we write to a file that contains rsNumber, Estimate, Standard Error, Zvalues and Pvalues
+
+<<data, eval=FALSE>>=
+names(fldl_wk0_p3) <- c("Estimate","SE","Zvalue","Pvalue")
+fldl_wk0_p3$rsNumber <- rsVec
+fldl_wk0_p3 <- as.matrix(fldl_wk0_p3)
+write.csv(fldl_wk0_p3,"/home/ramoser/fldl_wk0_p3")
+@
+
+We can now sort the results by p-value to find our strongest associations.
+
+%<<pvalues>>=
+%fldlwk0<-read.csv("/Users/Eric/Desktop/RA/fldlwk0GWAS")
+%wk0bypval<-fldlwk0[order(fldlwk0$Pvalue),]
+%wk0bypval<-wk0bypval[,-c(1,3,12)] #This is to remove numbered columns added by saving as a ".csv" file
+%wk0bypval[1:10,]
+%@
+
+From this GWAS we can see a high level of association between baseline LDL and snp rsNumber, rs7412, on the APOE gene, with a p-value of 2.69e-12.
+
+*I'm not sure how to finish.  I think we should go on to discuss the shear number of analysis we are running and how the Bonferroni correction returns p-value threshold of:
+<<bonferoni>>=
+0.05/834279
+
+@
+If we were to only go by this number we'd only associate high significance to rs7412.  This is where we could point to other analysis such as MixMAP.
+
+
+\end{document}