forked from raphg/Biostat-578
-
Notifications
You must be signed in to change notification settings - Fork 0
/
Copy pathProbe_summary.Rmd
180 lines (100 loc) · 5.61 KB
/
Probe_summary.Rmd
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
---
title: 'Bioinformatics for Big Omics Data: Probe summary for oligonucleotide arrays'
author: "Raphael Gottardo"
date: "January 21, 2014"
output:
ioslides_presentation:
fig_caption: yes
fig_retina: 1
keep_md: yes
smaller: yes
---
## Setting up some options
Let's first turn on the cache for increased performance and improved styling
```{r, cache=FALSE}
# Set some global knitr options
library("knitr")
opts_chunk$set(tidy=TRUE, tidy.opts=list(blank=FALSE, width.cutoff=60), cache=TRUE, messages=FALSE)
```
## Probe summary
- Remember that Affymetrix and Illumina use multiple probes for a given gene
- So it is important to summarize these "replicated" measurements before any down stream analysis
- There exist many different ways to do this for both Affymetrix and Illumina
## Affymetrix arrays - probes
<img src="Images/Affy-probe.png" width=400>
- ~20 probes that “perfectly” represent the gene (Perfect Match)
- ~20 probes that do not match the gene sequence (Mismatch)
- Probeset
## Affymetrix arrays - probes (suite)
<img src="Images/PM-MM-probes.png" width=400>
For a valid gene expression measurement
the Perfect Match sticks and the Mistach does not!
## Affymetrix probesets
<img src="Images/probe-set.png" width=400>
## Probe summary - A Naive approach
<img src="Images/Naive-summary.png" width=400>
## Probe summary - A Naive approach (suite)
<img src="Images/PM-MM-probes.png" width=500>
## Basis of the MAS 4.0 model (Affymetrix)
Probe summary - MAS 4.0
<img src="Images/MAS-4.0.png" width=600>
Is this an appropriate model?
## Probe summary - Dchip
Li, C., & Wong, W. H. (2001). Model-based analysis of oligonucleotide arrays: model validation, design issues and standard error application. Genome Biology.
<img src="Images/Dchip.png" width=600>
## Probe summary - Dchip vs MAS4.0
<img src="Images/Lack-of-fit.png" width=600>
## Probe summary - Dchip (suite)
<img src="Images/dChip-fit.png" width=600>
The same can be done with the probes
by looking at the variance of $\phi$
## Probe summary - MAS 5.0
<img src="Images/MAS-5.0.png" width=600>
## Alternative approaches
Irizarry, R. A., Bolstad, B. M., Collin, F., Cope, L. M., Hobbs, B., & Speed, T. P. (2003). Summaries of Affymetrix GeneChip probe level data. Nucleic Acids Research, 31(4), e15.
Irizarry et al. (2003) argued that MM is a poor measure of non specific hybridization
- Modeling $PM-MM$ results in large variance!
- Model only the PM intensities
- Before do a "gentle" background correction
$$\log_2(PM^*) =\theta_i+\phi_j+\epsilon_{ij}$$
Where $PM^*$ is a normalized background corrected intensity
This the basic model implemeted in the robust multiarray average _RMA_
## RMA background estimation
<img src="Images/RMA-background.png" width=600>
This can also be used in new Affymetrix arrays that have no MM probes
## Probe sequence affinity
- Does the sequence composition have an effect?
- Yes
- Should we account for it?
- How?
- The physical system producing probe intensity is a complicated one!
- Non specific background hybridization is related to the base compositions.
- Many factors (nucleotide composition, base position, neighbors?, etc)
- **G/C content**
## GC bias
<img src="Images/GC-bias.png" width=600>
## GC bias (suite)
<img src="Images/GC-effect.png" width=600>
## Probe sequence affinity models
Naef, F., & Magnasco, M. O. (2003). Solving the riddle of the bright mismatches: labeling and effective binding in oligonucleotide arrays. Physical Review. E, Statistical, Nonlinear, and Soft Matter Physics, 68(1 Pt 1), 011906.
<img src="Images/Positional-model.png" width=600>
## Sequence effect estimates
<img src="Images/Wu-Naef.png" width=600>
Wu, Z., & Irizarry, R. A. (2004). Stochastic models inspired by hybridization theory for short oligonucleotide arrays. the eighth annual international conference (pp. 98–106). New York, New York, USA: ACM. doi:10.1145/974614.974628
## RMA vs MAS5 vs gcRMA
<img src="Images/gcRMA-MAS-RMA.png" width=800>
## Summary
- There exist many different algorithms for computing probe summaries from Affymetrix arrays
- RMA is a popular package for probe summary
- Other good packages available in Bioconductor including gcRMA, PUMA, BGX, PLIER, XPS, etc
- Some of the packages are optimized for large datasets
## Illumina beadarrays
Bead level data from Illumina BeadArrays involve an analysis similar to Affymetrix arrays to obtain probe summary data.
Ritchie, M. E., Dunning, M. J., Smith, M. L., Shi, W., & Lynch, A. G. (2011). BeadArray expression analysis using bioconductor. PLoS Computational Biology, 7(12), e1002276. doi:10.1371/journal.pcbi.1002276
<img src="Images/journal.pcbi.1002276.g001.png" width=600>
## Bead summary from Illumina beadarrays
<small>Dunning, M. J., Barbosa-Morais, N. L., Lynch, A. G., Tavaré, S., & Ritchie, M. E. (2008). Statistical issues in the analysis of Illumina data. BMC Bioinformatics, 9(1), 85. doi:10.1186/1471-2105-9-85</small>
**Conclusions from the paper:**
- Access to bead level data promotes more detailed quality assessment and more flexible analyses. Bead level data can be summarised on a relevant scale. We were able to use the means and variances of the log2 data in the DE analysis to improve our ability to detect known changes in the spikes.
- The background correction and summarisation methods used in BeadStudio reduce bias and produce robust gene expression measurements. However, we find that back- ground normalisation introduces a significant number of negative values and much increased variability.
- Base composition of probes has an effect on intensity and further investigation is required to remove this effect.