Skip to content

Commit

Permalink
edits to data files and spell checking
Browse files Browse the repository at this point in the history
  • Loading branch information
ericreed committed Apr 16, 2014
1 parent c30ca65 commit db1d0b8
Show file tree
Hide file tree
Showing 6 changed files with 63 additions and 60 deletions.
4 changes: 2 additions & 2 deletions GWASsteps-concordance.tex
Original file line number Diff line number Diff line change
@@ -1,5 +1,5 @@
\Sconcordance{concordance:GWASsteps.tex:GWASsteps.Rnw:%
1 35 1 50 0 4 1 5 0 7 1 6 0 6 1 6 0 7 1 5 0 7 1 5 0 5 1 5 0 6 1 4 %
0 29 1 7 0 7 1 5 0 12 1 5 0 10 1 3 0 4 1 5 0 10 1 5 0 7 1 5 0 55 1 %
1 35 1 50 0 4 1 5 0 9 1 6 0 6 1 6 0 7 1 5 0 7 1 5 0 5 1 5 0 6 1 4 %
0 28 1 6 0 7 1 5 0 12 1 5 0 10 1 3 0 4 1 5 0 10 1 5 0 7 1 5 0 55 1 %
6 0 9 1 5 0 10 1 5 0 11 1 5 0 14 1 4 0 13 1 5 0 29 1 10 0 18 1 7 0 %
14 1 7 0 10 1 5 0 7 1 8 0 5 1}
25 changes: 13 additions & 12 deletions GWASsteps.Rnw
Original file line number Diff line number Diff line change
Expand Up @@ -31,7 +31,7 @@ This document provides the steps necessary to run a Genome Wide Association Stud
\section{Introduction}
\section{Data Files and Pre-Processing}

To analyize the Plink files ($.bim$, $.fam$, $.bed$) in R, the "snpStats" package is required.


<<eval=FALSE>>=
source("http://bioconductor.org/biocLite.R")
Expand All @@ -40,19 +40,21 @@ library(snpStats)
@
%set working directory?
\subsection{Data Files}
\subsection*{The .fam File}
The .fam file is a matrix that identifies each study participant. It contains 6 columns which identify each individual by ``Family ID Number", ``Sample ID Number",``Paternal ID Number", ``Maternal ID Number", ``Sex", and ``Phenotype". There are $n$ rows, one for each individual. Note that not all of these columns may contain information depending on the nature of the data collected and the strategies for analysis. In the example to follow, a separate phenotype file will be read in, containing quantified phenotype data.
In order to perform a genome wide association study, it is essential to obtain genotype data, properly formatted for analysis. In this example, the "snpStats" package is utlized. The ``snpStats" package, is an update of the ``snpMatrix" package, and creates a ``snpMatrix" object. It is freely available in R, and contains functions for reading in ``.fam", ``.bed", and ``.bim" files. These file types, are created from ``.ped" and ``.map" files using, PLINK, an open source software package written in C/C++. ``.ped" files contain participant identification information, as well as the genotype data for each participant, coded as the first letter of the nucleotide (A,C,T or G).``.map" files contain the rsNumber and information regarding the location of each SNP. Upon conversion, the ``.fam" files will contain the participant identification information,``.bed" files will contain a binary version of the genotype data, and the ``.bim" files will contain the same information as the ``.map" files, as well as the nucleotide identifiers (A, C, T, or G) from the ``.ped", files.

\subsection*{.fam File}
The .fam file is a 6 column matrix which identifies each participant by ``Family ID Number", ``Sample ID Number",``Paternal ID Number", ``Maternal ID Number", ``Sex", and ``Phenotype". There is a row for each individual. Note that not all of these columns may contain information depending on the nature of the data collected and the type for analysis. As is often the case when when analyzing quanitfied traits, in the example to follow a separate phenotype file will be read in.
<<eval=FALSE>>=
genoFam<-read.table("genodata.fam");colnames(genoFam) = c("FamID","IndID","PatID","MatID","sex","phenotype")
@
\subsection*{The .bim File}
The .bim file is a matrix that provides information about each SNP in our study. There are 6 colums that contain information for the ``Chromosome Number", ``rsNumber", ``Genetic Distance", ``Position ID", ``Allele 1", ``Allele 2". There are $n$ rows, one for each SNP in the analysis. By default, the rsNumbers are read in as factor variables, and will be changed to character variables.
\subsection*{.bim File}
The ``.bim" file is a matrix that provides information about each SNP in our study. There are 6 colums that contain information for the ``Chromosome Number", ``rsNumber", ``Genetic Distance", ``Position ID", ``Allele 1", ``Allele 2". There is a row for each SNP in the analysis. By default, the rsNumbers are read in as factor variables, and will be changed to character variables.
<<eval=FALSE>>=
genoBim <- read.table("genodata.bim");colnames(genoBim) = c("chr","SNP","GenDist","BPPos","g1","g2")
genoBim$SNP <- as.character(genoBim$SNP)
@
\subsection*{The .bed File}
The .bed file is a matrix that contains the genotypic data in binary format. It's the result of the conversion of .ped and .map files by the PLINK software. This is the largest of the three files because it contains every SNP in the study, as well as the genotype at this snp for each individual. In order to interpret the binary data, we need to employ the ``read.plink()" function, from the ``snpStats" package, to read in the .fam, .bim, and .bed functions together, and to interpret the binary data in the .bed file and put it into the proper format. The ``snpStats" package is an update of the ``snpMatrix" package, and creates a ``snpMatrix" object. We will pull out the ``genotype" slot of this object, which contains the genotype data, stored as a matrix of $p$ columns, one for each SNP by ``rsNumber", and $n$ rows, one for each study participant by ``Family ID Number", containing the genotype of each study participant at each individual SNP, as either ``01", ``02", or ``03".
\subsection*{.bed File}
As previously mentioned, the ``.bed" file contains the genotypic data in binary format. This is the largest of the three files because it contains every SNP in the study, as well as the genotype at this snp for each individual. In order to interpret the binary data, we need to employ the ``read.plink()" function from the ``snpStats" package, to read in the ``.fam", ``.bim", and ``.bed" functions together. We will pull out the ``genotype" slot of this object, which contains the genotype data, stored as a matrix with a column for each SNP, and a row for each study participant by ``Family ID Number", containing the genotype of each study participant at each individual SNP, as either ``01", ``02", or ``03".


<<eval=FALSE>>=
Expand Down Expand Up @@ -90,17 +92,16 @@ The Hardy-Weingberg Equilibrium is based off the principle that if two alleles a
$$p+q=1$$
where $p$ and $q$ represent the probabilities of either allele. Therefore, given the frequency of a certain allele, we should not only be able predict the frequency of the other allele, but also the proportion of homozygous major, heterozygous, and homozygous dominant individuals in a given population using the formula:
$$p^2 +2pq+q^2=1$$
Here, the proportion of the homozygous major genotype is predicted as $p^2$, the hetrozygous genotype is predicted as $2pq$, and the proportion of the homozygous recessive genotype is predicted as $q^2$. Since we cannot expect the observed values of each genotype to precisely match the exected values in our sample, a $\chi^2$-test statistic is used to measure the departure between the expected number of a given genotype and the observed number of a given genotype to obtain a p-value. If the p-value is above a certain cut-off, we assume that the allele frequencies comply with Hardy-Weinberg Equilibrium.
Here, the proportion of the homozygous major genotype is predicted as $p^2$, the hetrozygous genotype is predicted as $2pq$, and the proportion of the homozygous recessive genotype is predicted as $q^2$. Since we cannot expect the observed values of each genotype to precisely match the expected values in our sample, a $\chi^2$-test statistic is used to measure the departure between the expected number of a given genotype and the observed number of a given genotype to obtain a p-value. If the p-value is greater than a certain cut-off, we assume that the allele frequencies comply with Hardy-Weinberg Equilibrium, and therefore the stringency of Hardy-Weinberg filtering is enhanced by larger p-value cutoffs.

Though Hardy-Weinberg Equilibrium can be violated via population admixture and stratification, and in such cases we would still consider the associations between genotype and phentoype to be informative, they may also occur as a result of genotyping errors. Therefore, it is common prectice to remove the data for SNPS that violate Hardy-Weinberg Equilibrium. In the following example we have chosen a cutoff p-value of 0.001. Since we are dealing with multiple comparisons, it is usefule to then divide the p-value by a certain value. A conservative value here would be the total number of SNPs, however in this example we will use a somewhat generous value of 1000.
Though Hardy-Weinberg Equilibrium can be violated via population admixture and stratification, and in such cases we would still consider the associations between genotype and phenotype to be informative, they may also occur as a result of genotyping errors. Therefore, it is common practice to remove the data for SNPS that violate Hardy-Weinberg Equilibrium. In this example we will use a commonly used cutoff of $p<0.001$.

<<eval=FALSE>>=
call<-0.9
minor<-0.01
hardy<-0.001
comp<-1000
snpsum.col <- col.summary(geno)
use <- (!is.na(snpsum.col$MAF)&snpsum.col$MAF > minor) & (!is.na(snpsum.col$z.HWE)&snpsum.col$z.HWE^2 < qnorm((hardy/2)/comp)^2) & snpsum.col$Call.rate >= call #This creates a Boolean Vector of "TRUE" and "FALSE"
use <- (!is.na(snpsum.col$MAF)&snpsum.col$MAF > minor) & (!is.na(snpsum.col$z.HWE)&snpsum.col$z.HWE^2 < qnorm((hardy/2))^2) & snpsum.col$Call.rate >= call #This creates a Boolean Vector of "TRUE" and "FALSE"
use[is.na(use)] <- FALSE
genoBim <- genoBim[use,] #This will filter the .bim file
geno <- geno[,use] #This will filter our genotype file.
Expand Down Expand Up @@ -161,7 +162,7 @@ phenoSub<-merge(phenoSub,pcs,by.x="FamID",by.y="FamID",all.x=TRUE)

\subsection{Fitting the Model}

Multiple methods are used to analyse GWAS data: Linkage analysis, Admixture mapping and Assosciation anaylysis. since we have a high number of SNPs and these subjects are assumed not to be correlated, we will be conducting an Association analysis. Genetic models are used to incorporate the interaction between alleles on homolougous chromosomes. Genetic models commonly used are Additive Models, Dominant Models and Recessive models. \\
Multiple methods are used to analyse GWAS data: Linkage analysis, Admixture mapping and Association analysis. since we have a high number of SNPs and these subjects are assumed not to be correlated, we will be conducting an Association analysis. Genetic models are used to incorporate the interaction between alleles on homolougous chromosomes. Genetic models commonly used are Additive Models, Dominant Models and Recessive models. \\

\textsc{Additive}\\
Additive models are used to evaluate additive structure and reveal associations that depend additively based on the allele classification. These models assume that if having a single minor allele will increase the quantitative trait we are interested in, $y$, by $\beta$, then having two minor alleles in the homologs will increase $y$ by $2\beta$. \\
Expand Down
68 changes: 35 additions & 33 deletions GWASsteps.log
Original file line number Diff line number Diff line change
@@ -1,4 +1,4 @@
This is pdfTeX, Version 3.1415926-2.4-1.40.13 (MiKTeX 2.9) (preloaded format=pdflatex 2013.10.17) 2 APR 2014 07:42
This is pdfTeX, Version 3.1415926-2.4-1.40.13 (MiKTeX 2.9) (preloaded format=pdflatex 2013.10.17) 16 APR 2014 08:28
entering extended mode
**GWASsteps.tex
(C:\Users\Eric\Desktop\RA\GWASPaper\GWASsteps.tex
Expand Down Expand Up @@ -278,27 +278,18 @@ LaTeX Font Info: ... okay on input line 78.
* \@reversemarginfalse
* (1in=72.27pt=25.4mm, 1cm=28.453pt)

LaTeX Font Info: Try loading font information for U+msa on input line 84.
("C:\Program Files (x86)\MiKTeX 2.9\tex\latex\amsfonts\umsa.fd"
File: umsa.fd 2013/01/14 v3.01 AMS symbols A
)
LaTeX Font Info: Try loading font information for U+msb on input line 84.

("C:\Program Files (x86)\MiKTeX 2.9\tex\latex\amsfonts\umsb.fd"
File: umsb.fd 2013/01/14 v3.01 AMS symbols B
)
Package color Info: Redefining color shadecolor on input line 87.
LaTeX Font Info: Font shape `OT1/cmtt/bx/n' in size <10.95> not available
(Font) Font shape `OT1/cmtt/m/n' tried instead on input line 89.
Package color Info: Redefining color shadecolor on input line 101.
Package color Info: Redefining color shadecolor on input line 103.

Overfull \hbox (12.64433pt too wide) in paragraph at lines 104--104
Overfull \hbox (12.64433pt too wide) in paragraph at lines 106--106
[][]\OT1/cmtt/m/n/10.95 colnames[][](genoFam)[] []=[] []c[][]([][]"FamID"[][],[
] []"IndID"[][],[] []"PatID"[][],[] []"MatID"[][],[] []"sex"[][],[] []"phenotyp
e"[][])[][]
[]

Package color Info: Redefining color shadecolor on input line 112.
Package color Info: Redefining color shadecolor on input line 114.

Package Fancyhdr Warning: \headheight is too small (12.0pt):
Make it at least 25.2842pt.
Expand All @@ -308,30 +299,40 @@ Package Fancyhdr Warning: \headheight is too small (12.0pt):
[1{C:/ProgramData/MiKTeX/2.9/pdftex/config/pdftex.map}

]
Package color Info: Redefining color shadecolor on input line 126.
Package color Info: Redefining color shadecolor on input line 128.

Overfull \hbox (1.14696pt too wide) in paragraph at lines 128--128
Overfull \hbox (1.14696pt too wide) in paragraph at lines 130--130
[][]\OT1/cmtt/m/n/10.95 geno[] []<-[] []read.plink[][]([][]"genodata.bed"[][],[
] []"genodata.bim"[][],[] []"genodata.fam"[][])[][]$[][]genotype[][]
[]

Package color Info: Redefining color shadecolor on input line 138.
Package color Info: Redefining color shadecolor on input line 148.
Package color Info: Redefining color shadecolor on input line 157.
[2] [3]
Package color Info: Redefining color shadecolor on input line 184.
Package color Info: Redefining color shadecolor on input line 140.
Package color Info: Redefining color shadecolor on input line 150.
Package color Info: Redefining color shadecolor on input line 159.
[2]
LaTeX Font Info: Try loading font information for U+msa on input line 178.
("C:\Program Files (x86)\MiKTeX 2.9\tex\latex\amsfonts\umsa.fd"
File: umsa.fd 2013/01/14 v3.01 AMS symbols A
)
LaTeX Font Info: Try loading font information for U+msb on input line 178.

Overfull \hbox (64.38252pt too wide) in paragraph at lines 191--191
("C:\Program Files (x86)\MiKTeX 2.9\tex\latex\amsfonts\umsb.fd"
File: umsb.fd 2013/01/14 v3.01 AMS symbols B
) [3]
Package color Info: Redefining color shadecolor on input line 186.

Overfull \hbox (64.38252pt too wide) in paragraph at lines 192--192
[][]\OT1/cmtt/m/n/10.95 use[] []<-[] []([][]![][]is.na[][](snpsum.col[][]$[][]M
AF)[] []&[] []snpsum.col[][]$[][]MAF[] []>[] []minor)[] []&[] []([][]![][]is.na
[][](snpsum.col[][]$[][]z.HWE)[] []&[][]
[]


Overfull \hbox (1.14696pt too wide) in paragraph at lines 192--192
Overfull \hbox (313.47714pt too wide) in paragraph at lines 193--193
[] []\OT1/cmtt/m/n/10.95 snpsum.col[][]$[][]z.HWE[][]^[][]2[] []<[] []qnorm[
][]((hardy[][]/[][]2[][])[][]/[][]comp)[][]^[][]2[][])[] []&[] []snpsum.col[][]
$[][]Call.rate[] []>=[][]
][]((hardy[][]/[][]2[][]))[][]^[][]2[][])[] []&[] []snpsum.col[][]$[][]Call.rat
e[] []>=[] []call[] []\OT1/cmtt/m/it/10.95 #This creates a Boolean Vector of
TRUE and FALSE[][]
[]

Package color Info: Redefining color shadecolor on input line 206.
Expand Down Expand Up @@ -407,24 +408,24 @@ Overfull \hbox (35.63908pt too wide) in paragraph at lines 332--332

Package color Info: Redefining color shadecolor on input line 347.
Package color Info: Redefining color shadecolor on input line 358.
Package color Info: Redefining color shadecolor on input line 375.
[7]
Package color Info: Redefining color shadecolor on input line 375.
Package color Info: Redefining color shadecolor on input line 389.
Package color Info: Redefining color shadecolor on input line 407.

[8]
Underfull \hbox (badness 10000) in paragraph at lines 422--423

[]

Package color Info: Redefining color shadecolor on input line 425.
[8]

Overfull \hbox (1.14696pt too wide) in paragraph at lines 444--444
[] []\OT1/cmtt/m/n/10.95 pos[(TotalPrev[] []+[] []1[][])[][]:[][](TotalPrev[
] []+[] []nsnp[chr])][] []<-[] []start[] []+[] []position[chr[] []==[][]
[]

Package color Info: Redefining color shadecolor on input line 464.

[9]
Overfull \hbox (35.63908pt too wide) in paragraph at lines 466--466
[][]\OT1/cmtt/m/n/10.95 bitmap[][]([][]file[] []=[] []"manhattan.jpeg"[][],[] [
]type[] []=[] []"jpeg"[][],[] []width[] []=[] []10[][],[] []height[] []=[] []6[
Expand All @@ -438,7 +439,7 @@ Overfull \hbox (35.63908pt too wide) in paragraph at lines 469--469
og10(p-value)"[][],[][]
[]

[9]

Underfull \hbox (badness 10000) in paragraph at lines 490--491

[]
Expand All @@ -452,20 +453,21 @@ Overfull \hbox (104.62334pt too wide) in paragraph at lines 498--498
[]

Package color Info: Redefining color shadecolor on input line 510.
[10]
Package color Info: Redefining color shadecolor on input line 526.
[10] [11] (C:\Users\Eric\Desktop\RA\GWASPaper\GWASsteps.aux)
[11] (C:\Users\Eric\Desktop\RA\GWASPaper\GWASsteps.aux)

LaTeX Warning: There were undefined references.

)
Here is how much of TeX's memory you used:
2942 strings out of 493921
38278 string characters out of 3144878
106640 words of memory out of 3000000
103526 words of memory out of 3000000
6187 multiletter control sequences out of 15000+200000
11064 words of font info for 42 fonts, out of 3000000 for 9000
841 hyphenation exceptions out of 8191
27i,11n,43p,1007b,277s stack positions out of 5000i,500n,10000p,200000b,50000s
27i,11n,43p,1105b,276s stack positions out of 5000i,500n,10000p,200000b,50000s
<C:/Program Files (x86)/MiKTeX 2.9/fonts/type1/public/amsfonts/cm/cmbx10.pfb>
<C:/Program Files (x86)/MiKTeX 2.9/fonts/type1/public/amsfonts/cm/cmbx12.pfb><C
:/Program Files (x86)/MiKTeX 2.9/fonts/type1/public/amsfonts/cm/cmbx8.pfb><C:/P
Expand All @@ -479,7 +481,7 @@ les (x86)/MiKTeX 2.9/fonts/type1/public/amsfonts/cm/cmr10.pfb><C:/Program Files
6)/MiKTeX 2.9/fonts/type1/public/amsfonts/cm/cmsy10.pfb><C:/Program Files (x86)
/MiKTeX 2.9/fonts/type1/public/amsfonts/cm/cmsy8.pfb><C:/Program Files (x86)/Mi
KTeX 2.9/fonts/type1/public/amsfonts/cm/cmtt10.pfb>
Output written on GWASsteps.pdf (11 pages, 193617 bytes).
Output written on GWASsteps.pdf (11 pages, 193319 bytes).
PDF statistics:
90 PDF objects out of 1000 (max. 8388607)
0 named destinations out of 1000 (max. 500000)
Expand Down
Binary file modified GWASsteps.pdf
Binary file not shown.
Binary file modified GWASsteps.synctex.gz
Binary file not shown.
Loading

0 comments on commit db1d0b8

Please sign in to comment.