-
Notifications
You must be signed in to change notification settings - Fork 25
/
affymetrix.Rmd
487 lines (328 loc) · 20.6 KB
/
affymetrix.Rmd
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
335
336
337
338
339
340
341
342
343
344
345
346
347
348
349
350
351
352
353
354
355
356
357
358
359
360
361
362
363
364
365
366
367
368
369
370
371
372
373
374
375
376
377
378
379
380
381
382
383
384
385
386
387
388
389
390
391
392
393
394
395
396
397
398
399
400
401
402
403
404
405
406
407
408
409
410
411
412
413
414
415
416
417
418
419
420
421
422
423
424
425
426
427
428
429
430
431
432
433
434
435
436
437
438
439
440
441
442
443
444
445
446
447
448
449
450
451
452
453
454
455
456
457
458
459
460
461
462
463
464
465
466
467
468
469
470
471
472
473
474
475
476
477
478
479
480
481
482
483
484
485
486
487
---
title: "The Affymetrix Analysis Workflow"
author: Mark Dunning, based on materials by Benilton Carvalho
output:
html_notebook:
toc: yes
toc_float: yes
---
![](https://upload.wikimedia.org/wikipedia/commons/2/22/Affymetrix-microarray.jpg)
# Technology overview
Affymetrix microarrays are a popular commercial platform available for a wide range of
genomics applications (gene expression profiling, SNP genotyping, ChIP-chip analysis etc.)
in different species.
The main distinction between Affymetrix and other array technologies is the use of many
short (25mer) probes to measure hybridisation.
In this practical, we explore the basic analysis issues for Affymetrix GeneChips which measure gene expression using multiple (11-20) perfect match (PM) and mismatch (MM) probes
concentrated in the 3’ region of each transcript.
![](images/pm.png)
Despite our focus on expression arrays, many
of the processing steps, such as quality assessment and normalisation are equally important
for other applications of Affymetrix technology to ensure that the signal is comparable between arrays.
In this practical we use several Bioconductor packages ([affy](http://bioconductor.org/packages/release/bioc/html/affy.html), [affyPLM](http://bioconductor.org/packages/release/bioc/html/affyPLM.html), [limma](http://bioconductor.org/packages/release/bioc/html/limma.html), etc.) to read in
the raw data, assess data quality and normalise an Affymetrix experiment. Raw data for Affymetrix arrays are usually stored in [`.cel`](http://dept.stat.lsa.umich.edu/~kshedden/Courses/Stat545/Notes/AffxFileFormats/cel.html) format. We wouldn't normally wish to view these files directly. These files contain the intensities of each "cell" on chip surface after image analysis.
- Each hybridisation has a unique *cel* file
- It is know in advance which probe / probe set is located within each cell
- Each chip type (e.g.hgu95av2) has a *cdf* (chip description) file which is required to decode which probe and probe set is within each cell.
# Example analysis in R
## The estrogen dataset
The experiment we will analyse is made-up of eight Affymetrix *HGU95Av2*
GeneChips. The aim of the experiment is briefly described below (excerpt taken from the
factDesign package vignette).
> “The investigators in this experiment were interested in the effect of estrogen on the genes in ER+ breast cancer cells over time. After serum starvation of all eight samples, they exposed
four samples to estrogen, and then measured mRNA transcript abundance after 10 hours for
two samples and 48 hours for the other two. They left the remaining four samples untreated,
and measured mRNA transcript abundance at 10 hours for two samples, and 48 hours for
the other two. Since there are two factors in this experiment (estrogen and time), each at
two levels (present or absent, 10 hours or 48 hours), this experiment is said to have a 2 × 2
factorial design.”
The data for this section are described in the [estrogen data package](http://www.bioconductor.org/packages/release/data/experiment/html/estrogen.html) in Bioconductor
The cel files for the example experiement are stored in the `estrogen` directory
The first stage is to read the *targets* file. In Bioconductor we often use a targets file to define the files corresponding to the raw data for the experiment and which sample groups they belong to. Such a file can be created in a spreadsheet, or text editor, and usually saved as a tab-delimited file. We can have as many columns in the file as we see fit, and one row for each hybridisation. The sample group information is propogated through the analysis and incorporated in the quality assessment and eventually differential expression. We refer to this as the "phenotype data" for the experiment, or sometimes the *"metadata"*.
### Packages and data
If you did not install the `affy` Bioconductor package, you will need to do so now:-
```{r eval=FALSE}
source("http://www.bioconductor.org/biocLite.R")
biocLite("affy")
```
The data for the practical can be found in the [course zip file](https://rawgit.com/bioinformatics-core-shared-training/microarray-analysis/master/Course_Data.zip)
```{r}
library(affy)
targetsFile <- "estrogen/estrogen.txt"
targetsFile
pd <- read.AnnotatedDataFrame(targetsFile,header=TRUE,sep="",row.names=1)
pData(pd)
```
As `.cel` files are not a typical file format, we cannot use the standard R functions such as `read.delim`, `read.csv` and `read.table`. The function to import the data is `ReadAffy` from the [affy](http://bioconductor.org/packages/release/bioc/html/affy.html) package.
```{r}
raw <-ReadAffy(celfile.path = "estrogen", filenames=rownames(pData(pd)),phenoData = pd)
raw
```
******
What type of Affy array is this
How many features?
How many samples?
******
## Diagnostic plots
As with other high-throughput technologies, quality assessment is an important part of the
analysis process. Data quality can be checked using various diagnostic plots.
The first diagnostic plot we will meet is the "[boxplot](https://en.wikipedia.org/wiki/Box_plot)", which is a commonly-used plot in data analysis for *comparing data distributions*. The exact definition of how is boxplot is drawn can vary, the definitions always hold;
- The bottom of the box is the *first quartile*; the 25th percentile
- The top of the box is the *third quartile*; the 75th percentile
- The median is represented by a horizontal line
- The *inter-quartile range* (IQR) is the difference between the 75th and 25th percentiles
Here, we draw a boxplot to compare two distributions; `x` with a mean of 0 and standard deviation of 1, and `y` with a mean of 2 and standard deviation of 4.
```{r}
x <- rnorm(100)
y <- rnorm(100, mean = 2,sd=4)
boxplot(x,y)
```
It is also common to draw whiskers that extend to 1.5 X the IQR from the lower and upper quartile. Any points outside this range can be represented by a dot.
In Bioconductor, you will usually find that package authors have created shortcuts to allow complicated data types to be visualised with common functions. For instance, if we want to construct a boxplot from our raw Affymetrix data it would be quite a daunting task for the novice to extract the revelant information from the object. Instead, we can use functions that we are familiar with such as `boxplot`. The plot can also be customised in ways that we are familiar with such as changing the color (the `col` argument) and label orientation (`las=2` to make labels perpendicular to the axis)
```{r}
boxplot(raw,col="red",las=2)
```
******
### What do you notice from this plot?
******
## Perfect-Match and Mismatch probes
With the short DNA sequences used for microarray hybridisation, there is the possibility for the sequence designed to not be specific-enough and hybridise to many regions of the genome. Affymetrix attempt to resolve this issue by having a mismatch sequence corresponding to each probe. Ideally, the subtracting the mismatch from the corresponding perfect match( $PM - MM$) should yield a background-corrected, and more-reliable, signal.
******
Generate histograms of the PM and MM intensities from the first array. Do you
notice any difference in the signal distribution of the PMs and MMs?
******
```{r}
par(mfrow=c(2,1))
hist(log2(pm(raw[,1])),breaks=100,col="steelblue",main="PM",xlim=c(4,14))
hist(log2(mm(raw[,1])),breaks=100,col="steelblue",main="MM",xlim=c(4,14))
```
***M-A*** plots are a useful way of comparing the red and green channels from a two-colour
microarray. For Affymetrix data, which is a single-channel technology, M and A values can
be calculated using the intensities from a pair of chips. We would typically produce a plot using samples from the same biological group, where we would not expect to observe much difference in intensity for a given gene.
i.e. if X$_i$ is the intensity of a given
probe from chip $i$ and X$_j$ is the intensity for the same probe on chip $j$, then
$A = 1/2 (log_2(X_i) + log_2(X_j))$ and $M = log_2(X_i) − log_2(X_j)$ . In this experiment, there are 8 GeneChips, which gives
28 distinct pair-wise comparisons between arrays. We will focus on a subset of these below.
```{r}
mva.pairs(pm(raw)[,1:4],plot.method="smoothScatter")
mva.pairs(pm(raw)[,5:8],plot.method="smoothScatter")
```
******
Based on all the plots you have generated, what would you conclude about the overall quality of this experiment?
******
## Probe-level Linear Models
Probe-level Linear Models (PLMs) can be used as an additional tool to assess relative
data quality within an experiment. Many different model specifications are possible, with
the simplest fitting chip, and probe effects to the log$_2$ intensities within each probeset across
an experiment in a robust way.
The output is a matrix of residuals, or weights for each chip which can be used as an additional diagnostic; systematically high residuals across an entire array, or a large fraction of an array is indicative of an outlier array. The main use of this tool is in deciding whether or
not to keep an array in the down-stream data analysis.
### Relative Log Expression (RLE)
The Relative Log Expression (RLE) values are computed by calculating for each probe-set the ratio between the expression of a probe-set and the median expression of this probe-set across all arrays of the experiment. It is assumed that most probe-sets are not changed across the arrays, so it is expected that these ratios are around 0 on a log scale. The boxplots presenting the distribution of these log-ratios should then be centered near 0 and have similar spread. Other behavior would be a sign of low quality.
### Normalized Unscaled Standard Error (NUSE)
The Normalized Unscaled Standard Error (NUSE) is the individual probe error fitting the Probe-Level Model (the PLM models expression measures using a M-estimator robust regression). The NUSE values are standardized at the probe-set level across the arrays: median values for each probe-set are set to 1. The boxplots allow checking (1) if all distributions are centered near 1 (typically an array with a boxplot centered around 1.1 shows bad quality) and (2) if one array has globally higher spread of NUSE distribution than others, which may also be a sign of low quality.
```{r}
library(affyPLM)
plmset <- fitPLM(raw)
NUSE(plmset,las=2)
RLE(plmset,las=2)
```
We can look at the images of the array surface. Ocassionally this can reveal problems on the array. See this [gallery](http://plmimagegallery.bmbolstad.com/) of interesting examples. Affymetrix, and other older arrays are vunerable to spatial artefacts as the same probe set is found in the same location on each chip.
If you want to look at an example of a bad array image, we can use the following code;
```{r}
bad <- ReadAffy(celfile.path = "estrogen/",filenames="bad.cel")
image(bad)
```
Try out some images for this dataset
```{r}
par(mfrow=c(2,4))
image(raw[,1])
image(raw[,2])
image(raw[,3])
image(raw[,4])
image(raw[,5])
image(raw[,6])
image(raw[,7])
image(raw[,8])
```
- Do you see any problems?
If we are happy with the quality of the raw data we can proceed to the next step. So far in the data, we have multiple probes within a probeset, and a perfect and mismatch measurement for each probe. These are not particular covenient values for statistical analysis as we would like a *single measurement* for each gene for each sample (/hybridisation).
## Summarising and Normalising the Estrogen dataset
Many normalisation and summarisation methods have been developed for Affymetrix
data. These include MAS5.0 and PLIER which were developed by Affymetrix, and RMA,
GC-RMA, dChip and vsn (to name but a few) which have been developed by academic researchers.
Many of these methods are available in the [affy](http://bioconductor.org/packages/release/bioc/html/affy.html) package. For a comparison of some of these
methods and assessment of their performance on different control data sets, see [Bolstad et al](http://bioinformatics.oxfordjournals.org/content/19/2/185.long) or [Millenaar et al](http://bmcbioinformatics.biomedcentral.com/articles/10.1186/1471-2105-7-137). In
this practical we will use the ***RMA*** (Robust Multichip Averaging) method described in [Irizarry et al](http://www.ncbi.nlm.nih.gov/pubmed/12925520).
Procedure;
- model based background adjustment
- followed by quantile normalisation and a
- robust summary method (median polish) on the log$_2$ PM intensities
- to obtain *probeset summary values*.
## Normalisation
- We want to be observing *biological* and not *technical* variation
- We wouldn't expect such wholesale changes on a per-sample basis
- Easy option would to scale values for each array to median level
- **What would happen if we had hybridised all the *estrogen-treated* samples on the first four arrays, and *un-treated* on the last four** - Could you analyse the data?
![](images/linear-effects.png)
- Genes on array 2 are on average 2.6 lower than the global median, so add 2.6 to each gene
- Genes on array 8 are on average 0.7 higher than the global median, so subtract 0.7 from each gene
- etc
## Non-linear effects
As we saw before The *MA-plot* is commonly-used in microarray analysis to commonly-used to visualise differences between pairs of arrays. These plots can reveal non-linear effects and motivate more-sophisticated methods. On older array technologies, it was typical to see a banana-shaped trend on these plots.
(***not our dataset***)
![](images/non-linear.png)
## Quantile normalisation
This is arguably the most-popular normalisation method available. Consider the following matrix of values to be normalised
```{r echo=FALSE}
# Set a seed to make sure we get the same matrix
set.seed("15022016")
raw.values <- round(data.frame(Array1 = rnorm(5,mean = 2), Array2 = rnorm(5,mean=3), Array3 = rnorm(5,mean=4)),2)
rownames(raw.values) <- LETTERS[1:nrow(raw.values)]
raw.values
```
```{r echo=FALSE}
ranked.values <- apply(raw.values, 2, function(x) paste("Rank",rank(x,ties.method="min"),sep=""))
ranked.values
```
Sort each column Smallest...Largest
Sorted data
```{r}
apply(raw.values, 2,sort)
```
Then calculate target distribution by averaging the sorted rows
```{r echo=FALSE}
target <- round(rowMeans(apply(raw.values, 2,sort)),3)
names(target) <- paste("Rank", 1:length(target),sep="")
target
```
Go back to the rank matrix
```{r echo=FALSE}
ranked.values
```
Substitute with values from the target distribution
```{r echo=FALSE}
target
```
```{r}
ranked.values[,1] <- gsub("Rank1",target["Rank1"],ranked.values[,1])
ranked.values
ranked.values[,1] <- gsub("Rank2",target["Rank2"],ranked.values[,1])
ranked.values
ranked.values[,1] <- gsub("Rank3",target["Rank3"],ranked.values[,1])
ranked.values
ranked.values[,1] <- gsub("Rank4",target["Rank4"],ranked.values[,1])
ranked.values
ranked.values[,1] <- gsub("Rank5",target["Rank5"],ranked.values[,1])
ranked.values
```
```{r}
for(i in 1:3){
ranked.values[,i] <- gsub("Rank1",target["Rank1"],ranked.values[,i])
ranked.values[,i] <- gsub("Rank2",target["Rank2"],ranked.values[,i])
ranked.values[,i] <- gsub("Rank3",target["Rank3"],ranked.values[,i])
ranked.values[,i] <- gsub("Rank4",target["Rank4"],ranked.values[,i])
ranked.values[,i] <- gsub("Rank5",target["Rank5"],ranked.values[,i])
}
ranked.values <- as.data.frame(ranked.values)
rownames(ranked.values) <- rownames(raw.values)
```
```{r}
raw.values
ranked.values
```
##Final Code
The procedure can be encapsulated in relatively few lines of code
```{r eval=FALSE}
ranked.values <- apply(raw.values, 2, function(x) paste("Rank",
rank(x,ties.method="min"),sep=""))
target <- round(rowMeans(apply(df, 2,sort)),3)
names(target) <- paste("Rank", 1:length(target),sep="")
for(i in 1:ncol(raw.values)){
for(nm in names(target)){
ranked.values[,i] <- gsub(nm,target[nm],ranked.values[,i])
}
}
norm <- as.data.frame(ranked.values)
```
We can use the `normalizeQuantiles` function within the `limma` package to do the normalisation.
```{r message=FALSE}
library(limma)
?normalizeQuantiles
```
### Caveats
- Distributions of samples are expected to be the same
- Majority of genes do not change between groups
+ might be violated if you have samples that are dramatically altered
+ e.g. low number of genes in a screening panel that are expected to change
## Other normalisation procedures
- loess or splines
+ need a reference or "average" array to compare
+ estimate the curve and use to correct each point
```{r}
library(affy)
?normalize.loess
```
## Running RMA on our dataset
The affy package provides lots of different ways to process raw Affymetricx data, one of which is an implementation of RMA. For other methods, see the `expresso` function or package *Vignette*
```{r eval=FALSE}
?expresso
vignette("affy")
```
The result is an `ExpressionSet`, which is a ubiquitous object in Biooconductor for storing high-throughput data. Later-on in the course will we import data from public repositories, and these data will often be in a normalised form.
```{r}
eset <- rma(raw)
eset
```
The `ExpressionSet` is a convenient object-type for storing multiple high-dimensional data frames. The structure of the object is much more complicated than other types of data we have seen before. In fact we do not need to know need exactly *how* the object is constructed internally. The package authors have provided convenient functions that will allow us to access the data.
For example, if we want to extract the expression measurements themselves the function to use is called `exprs`. The result will be a matrix with one row for each gene, and one column for each sample.
```{r}
head(exprs(eset))
summary(exprs(eset))
```
You will notice the *rma* also incorporates a log$_2$ transformation. The values present in the *cel* files are derived from processing the *TIFF* images of the chip surface, which are on the scale 0 - 2^16. However, this is not a very convenient scale for visualisation and analysis as most of the observations are found at the lower end of the scale and there is a much greater variance at the higer values. We say that the data exhibit ['heteroscedasticity'](https://en.wikipedia.org/wiki/Heteroscedasticity). To alleviate this we can use a data transformation such as log$_2$ and afterwards our data will be in the range 0 to 16. Another method you may come across is [vsn](http://bioinformatics.oxfordjournals.org/content/18/suppl_1/S96.long).
```{r}
boxplot(exprs(eset),las=2)
```
******
### How can you tell from the boxplot that these data have been normalised?
******
An MA-plot can be repeated on the normalised data to verify that the normalisation procedure has been sucessful.
```{r}
mva.pairs(exprs(eset)[,1:4],log.it = FALSE,plot.method="smoothScatter")
```
If we have used a *targets* file to read the raw data, this information will stored in the summarised object. Commonly-referred to as "pheno data", this can retrieved using the `pData` function. A nice property of this data frame is that the *rows* are arranged in the same order as the *columns* of the expression matrix. i.e. The first column of the expression matrix is described in the first row of the pheno data matrix, and so on.
```{r}
head(pData(eset))
colnames(exprs(eset))
rownames(pData(eset))
```
Clustering techniques and PCA that we will meet later in the course can also inform decisions about experiment quality by providing a way of visualing relationships between sample groups.
![](images/affy-clustering.png)
![](images/affy-pca.png)
## Automated production of QC plots
The `arrayQualityMetrics` package is able to produce QA plots from set of cel files, or a normalised dataset.
```{r eval=FALSE}
library(arrayQualityMetrics)
arrayQualityMetrics(eset)
```
### Thoughts on Quality Assessment
- It is difficult to come up with definitive rules about when to reject a sample from the analysis
- Try and keep as much information from the lab
+ RIN (RNA Integrity Numbers), tumour composition
- If we try and reject samples for looking different, we might bias our results and miss interesting findings
- If the majority of our samples are poor quality, the better quality ones might stand-out as outliers!
- Try and collect as many samples / replicates as possible to be tolerant to variation
## Summary
- Affy data come in `.cel` files that can be imported with the `affy` package
- QC checks on the raw data include;
+ check the chip image
+ probe-level models
+ boxplots
- Affy data need to be summarised before further analysis
+ `rma` (and variants thereof) is most-popular
- Affy data can be summarised into a common Bioconductor object-type
+ The "`ExpressionSet`"
- The `ExpressionSet`, or derivaties thereof, is used throughout Bioconductor to represent all types of high-throughput data
+ methylation, ChIP, RNA-seq
+ basically anything where you can have some kind of genomic measure as rows, and samples as columns