forked from raphg/Biostat-578
-
Notifications
You must be signed in to change notification settings - Fork 0
/
Copy pathNormalization.Rmd
169 lines (100 loc) · 5.63 KB
/
Normalization.Rmd
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
---
title: 'Bioinformatics for Big Omics Data: Microarray normalization'
author: "Raphael Gottardo"
date: "January 21, 2014"
output:
ioslides_presentation:
fig_caption: yes
fig_retina: 1
keep_md: yes
smaller: yes
---
## Setting up some options
Let's first turn on the cache for increased performance and improved styling
```{r, cache=FALSE}
# Set some global knitr options
library("knitr")
opts_chunk$set(tidy=TRUE, tidy.opts=list(blank=FALSE, width.cutoff=60), cache=TRUE, messages=FALSE)
```
## What is normalization
- Normalization is needed to ensure that observed differences in intensities are indeed biological and not due to some technical artifact (e.g. array batch, technician, etc)
- Normalization is necessary before any analysis which involves between slides comparisons of
intensities (i.e. almost all analyses)
- Normalization techniques are different in spotted/two-color (cDNA) and high-density-oligonucleotides technologies
## cDNA microarray: An example
We have two colors (Red) and (Green)
$M=\log_2(R/G)$ (and $A=\log_2(R\cdot G)$)
Smyth, G. K., & Speed, T. (2003). Normalization of cDNA microarray data. Methods, 31(4), 265–273. doi:10.1016/s1046-2023(03)00155-5
## What is normalization
<img src="Images/Printip.png" width=600>
## Analysis of variance (ANOVA)
- Statistical procedure due to Fisher used to identify sources of variability from one or more potential sources ("treatments" or "factors").
- This is what we should do with microarray data
Kerr, M. K., Martin, M., & Churchill, G. A. (2000). Analysis of variance for gene expression microarray data. Journal of Computational Biology : a Journal of Computational Molecular Cell Biology, 7(6), 819–837. doi:10.1089/10665270050514954
## ANOVA - Design
<img src="Images/Design.png" width=500>
## ANOVA - Design
<img src="Images/Design-dyeswap.png" width=500>
## ANOVA - Model
<img src="Images/Anova.png" width=600>
- Are some effects confounded?
- Can we estimate all the effect?
- Need replicates
## ANOVA normalization
- Kerr and Chuchill use least squares to estimate the effects
- The VG effect is automatically normalized
- Used bootstrap to compute error bars
- Nice statistical approach but do not account for non linear effect (MA plot)!
It's even worse than that!
## ANOVA normalization, enough?
<img src="Images/Lowess.png" width=400>
- $M=\log_2(R/G)$ (and $A=\log_2(R\cdot G)$)
- Strong non-linear relationship
## Lowess Normalization
- Locally weighted scatterplot smoothing technique (Cleveland, 1979)
- Locally linear polynomial Robust to outliers
- Each smoothed value is computed using neighboring values in a given window (span)
- Span value $0< f <1$ (proportion of data to use)
**Lowess normalization normalizes the data as follows:**
$$M \leftarrow M - c(A)$$ where $c(A)$ is the estimated lowess fit.
This can be done for each print-tip, and the M values could even be scaled if necessary (again could be done by print-tip).
## Lowess + scale
<img src="Images/Lowess-scale.png" width=600>
**Note:** Lowess can also be applied to one color arrays (See Cyclic Lowess in R)
## Normalization of oligo-based arrays
**quantile-quantile plot (qqplot):**
- Used to determine if two samples comes from populations with the same distribution.
- Plot the quantiles from the first sample against the quantiles from the second.
- If straight line with slope 1 and intercept 0, the distributions are the same.
- The rationale of quantile normalization is to force the line to be y=x when comparing any two arrays.
- This can be extended to n dimensions n data vectors
- If distributions are the same, the points should align on the line passing through the origin and (1,1,....,1)
## Quantile normalization
Given $n$ arrays of length $p$, form $X$ of dimension $p \times n$ where each array is a column:
1. sort each column of $X$ to give $X_{sort}$
2. take the means across rows of $X_{sort}$ and assign this
mean to each element in the row to get $X'_{sort}$
3. get $X_{normalized}$ by rearranging each column of
$X'_{sort}$ to have the same order as the original $X$
Bolstad, B. M., Irizarry, R. A., Astrand, M., & Speed, T. P. (2003). A comparison of normalization methods for high density oligonucleotide array data based on variance and bias. Bioinformatics, 19(2), 185–193. doi:10.1093/bioinformatics/19.2.185
See http://en.wikipedia.org/wiki/Quantile_normalization for a quick illustration
## Quantile illustration
<img src="Images/quantile-density.png" width=500>
## MA plots before quantile normalization
<img src="Images/before-quantile.png" width=500>
## MA plots after quantile normalization
<img src="Images/after-quantile.png" width=500>
## Summary
- You should always normalize your data before any analysis unless you have a very good reason not to!
- In most cases, quantile normalization will do unless you're working with cDNA arrays
- Quantile normalization is readily available in R, part of standard pipelines for Affymetryx and Illumina data analyses
**Note that all these techniques assume that most genes are not changing accross conditions. This could be a problem in some context.**
## Rank invariant methods
Li, C., & Wong, W. H. (2001). Model-based analysis of oligonucleotide arrays: model validation, design issues and standard error application. Genome Biol.
- Find a set of genes, which is believed not to change
- Use these to find the non linear relationship in the MA plot
- How do we find such genes?
- Rank based methods.
- Look at the intensity ranks, if a gene is not differentially expressed its rank should be about the same in the different samples.
## Rank based lowess normalization
<img src="Images/Lowess.png" width=600>