You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
> “The greatest value of a picture is when it forces us to notice what we never expected to see.” - John W. Tukey
37
37
38
-
- Discover biases, systematic errors and unexpected variability in data
39
-
- Graphical approach to detecting these issues
40
-
-Represents a first step in data analysis and guides hypothesis testing
41
-
- Opportunities for discovery in the outliers
38
+
- Discover biases, systematic errors and unexpected variability in data.
39
+
- Graphical approach to detecting these issues. Represents a first step in data analysis and guides hypothesis testing.
40
+
-EDA helps us check the assumptions of our statistical tests.
41
+
- Opportunities for discovery are often in the outliers.
42
42
43
43
## Quantile Quantile Plots
44
44
45
-
- Quantiles divide a distribution into equally sized bins
46
-
- Division into 100 bins gives percentiles
47
-
- Quantiles of a theoretical distribution are plotted against an experimental distribution
48
-
- alternatively, quantiles of two experimental distributions
49
-
- Given a perfect fit, $x=y$
50
-
- Useful in determining data distribution (normal, t, etc.)
45
+
-**Why use them?** A primary tool for checking if our data follows a theoretical distribution.`
46
+
- Quantiles divide a distribution into equally sized bins (e.g., 100 bins for percentiles).
47
+
- We plot the quantiles from our data against the theoretical quantiles of a distribution (e.g., the normal distribution).
48
+
- If our data perfectly matches the theoretical distribution, the points will form a straight line ($y=x$). Deviations from the line indicate our data does not fit that distribution.
51
49
52
50
## Example: Quantile Quantile Plots
53
51
@@ -84,10 +82,9 @@ qqline(x)
84
82
85
83
## Boxplots: About
86
84
87
-
- Provide a graph that is easy to interpret where data is not normally distributed
88
-
- Would be an appropriate choice to explore income data, as distribution is highly skewed
89
-
- Particularly informative in relation to outliers and range
90
-
- Possible to compare multiple distributions side by side
85
+
-**Why use them?** Boxplots excel at showing the distribution of data, especially when it is not normally distributed (e.g., highly skewed data like income).
86
+
- They provide a simple, easy-to-interpret summary of the data's range, center, and spread, while clearly highlighting outliers.
87
+
- Their greatest advantage comes from placing them side-by-side to compare distributions across multiple groups at once.
91
88
92
89
## Boxplots: Example
93
90
@@ -98,14 +95,13 @@ qqnorm(exec.pay, main = "CEO Compensation")
98
95
boxplot(exec.pay, ylab="10,000s of dollars", ylim=c(0,400), main = "CEO Compensation")
99
96
```
100
97
<center>
101
-
Three different views of a continuous variable
98
+
Three different views of a continuous variable. The boxplot clearly shows the skew and outliers.
102
99
</center>
103
100
104
101
## Scatterplots And Correlation: About
105
102
106
-
- For two continuous variables, scatter plot and calculation of correlation is useful
107
-
- Provides a graphical and numeric estimation of relationships
108
-
- Quick and easy with `plot()` and `cor()`
103
+
- For two continuous variables, a scatter plot graphically shows the relationship, while correlation provides a single number to summarize its strength and direction.
104
+
- Quick and easy with `plot()` and `cor()`.
109
105
110
106
## Scatterplots And Correlation: Example
111
107
@@ -182,13 +178,15 @@ legend("bottomright",
182
178
183
179
## Volcano plots: Summary
184
180
185
-
- Many small p-values with small effect size indicate low within-group variability
186
-
- Inspect for asymmetry
187
-
- Can color points by significance threshold
181
+
- A volcano plot lets us visualize both the **statistical significance** (p-value) and **biological significance** (effect size or fold change) at the same time for thousands of genes.
182
+
-**Top-right/left corners:** Genes with large effect sizes and high statistical significance. These are often the most interesting candidates.
183
+
-**Top-center:** Genes that are statistically significant but have a small effect size.
184
+
-**Bottom:** Genes that are not statistically significant, regardless of their effect size.
185
+
- Can color points by significance threshold. Check for asymmetry, which might indicate biases.
188
186
189
187
## P-value histograms: Setup
190
188
191
-
- If all null hypotheses are true, expect a flat histogram of p-values:
189
+
- If all null hypotheses are true (i.e., no genes are truly differentially expressed), we expect a **flat histogram** of p-values, where every p-value from 0 to 1 is equally likely.
192
190
193
191
```{r pvalhist1}
194
192
m <- nrow(geneExpression)
@@ -218,46 +216,37 @@ hist(permresults$p.value)
218
216
219
217
## P-value histograms: Summary
220
218
221
-
- Give a quick look at how many significant p-values there may be
222
-
- When using permuted labels, can expose non-independence among the samples
223
-
+ can be due to batch effects or family structure
224
-
- Most common approaches for correcting batch effects are:
225
-
+`ComBat`: corrects for known batch effects by linear model), and
226
-
+`sva`: creates surrogate variables for unknown batch effects, corrects the structure of permutation p-values
227
-
+ correction using control (housekeeping) genes
228
-
+`batchelor` for single-cell analysis
229
-
230
-
`ComBat` and `sva` are available from the [sva](https://www.bioconductor.org/packages/sva) Bioconductor package
219
+
- Give a quick look at the overall results of a high-throughput experiment. A spike near zero suggests the presence of differentially expressed genes.
220
+
- A non-uniform histogram for permuted data is a red flag, suggesting non-independence between samples, often due to hidden batch effects.
231
221
232
222
## MA plot
233
223
234
-
- just a scatterplot rotated 45$^o$
224
+
-**Why use it?** An MA plot is a clever transformation of a scatter plot, designed to better visualize differences between two samples (or one sample and a reference). It's just a scatterplot rotated 45$^o$.
225
+
- The rotation helps us see systematic biases. The 'A' (Average) on the x-axis represents overall signal intensity, and the 'M' (Minus, or log-ratio) on the y-axis represents the fold change. This makes it much easier to see if the fold change is dependent on gene intensity.
- useful for quality control of high-dimensional data
246
-
- plot all data values for a sample against another sample or a median "pseudosample"
247
-
-`affyPLM::MAplots` better MA plots
248
-
- adds a smoothing line to highlight departures from horizontal line
249
-
- plots a "cloud" rather than many data points
236
+
- Useful for quality control of high-dimensional data.
237
+
- In an ideal MA plot, the cloud of points is centered on y=0 with no trend.
238
+
-`affyPLM::MAplots` creates better MA plots, adding a smoothing line to highlight departures from the horizontal.
250
239
251
240
## Heatmaps
252
241
253
-
* Detailed representation of high-dimensional dataset.
254
-
-`ComplexHeatmap` package is the best as of 2024: large datasets, interactive heatmaps, simple defaults but many customizations possible
242
+
* Detailed representation of a high-dimensional dataset. The `ComplexHeatmap` package is the best as of 2025.
243
+
***Important Note:** Before plotting, we usually **scale** the data for each gene. This ensures the color pattern is driven by relative expression changes, not by a few highly expressed genes dominating the color scale.
- Clustering becomes slow and memory-intensive for thousands of rows
277
-
- probably too detailed for thousands of rows
278
-
- can show co-expressed genes, groups of samples
265
+
- Clustering becomes slow for thousands of rows but is great for visualizing co-expressed genes and sample groups.
279
266
280
267
## Colors
281
268
282
-
- Types of color pallettes:
283
-
-**sequential**: shows a gradient
284
-
-**diverging**: goes in two directions from a center
285
-
-**qualitative**: for categorical variables
286
-
- Keep color blindness in mind (10% of all men)
287
-
288
-
## Colors (cont'd)
289
-
290
-
Combination of `RColorBrewer` package and `colorRampPalette()` can create anything you want
291
-
292
-
```{r brewer, fig.height=5, echo=FALSE}
293
-
rafalib::mypar(1, 1)
294
-
RColorBrewer::display.brewer.all(n = 7)
295
-
```
269
+
- Palettes: **sequential** (gradient), **diverging** (two directions from a center), **qualitative** (categorical).
270
+
- Keep color blindness in mind (10% of men). `RColorBrewer` has colorblind-friendly options.
296
271
297
272
## Plots To Avoid
298
273
299
274
> "Pie charts are a very bad way of displaying information." - R Help
300
275
301
-
- Avoid pie charts
302
-
- Avoid doughnut charts too
303
-
- Avoid pseudo 3D
304
-
- Use color judiciously
276
+
-**Avoid pie charts and doughnut charts.** Humans are much better at judging length and position than angles and areas. A simple bar chart is almost always a better, clearer alternative.
277
+
-**Avoid pseudo 3D plots.** They distort the data and make comparisons difficult.
278
+
- Use color judiciously to highlight, not to decorate.
- affect DNA and RNA sequencing, proteomics, imaging, microarray...
312
-
- have caused high-profile problems and retractions
313
-
- you can't get rid of them
314
-
- but you can make sure they are not confounded with your experiment
315
-
316
-
## Batch Effects - an example
284
+
- Pervasive in genomics and have caused high-profile retractions.
285
+
- You can't get rid of them, but you can design your experiment to manage them.
286
+
-**The Golden Rule:** Make sure batch is not confounded with your variable of interest.
287
+
- Consider this **nightmare scenario**:
317
288
318
-
- Nat Genet. 2007 Feb;39(2):226-31. Epub 2007 Jan 7.
319
-
- Title: *Common genetic variants account for differences in gene expression among ethnic groups.*
320
-
- "The results show that specific genetic variation among populations contributes appreciably to differences in gene expression phenotypes."
289
+
> ***Batch 1:** All your "Control" samples, processed on Monday.
290
+
> ***Batch 2:** All your "Treatment" samples, processed on Tuesday.
291
+
>
292
+
> If you find a difference, is it due to the treatment or the processing day? It's **impossible to know**.
293
+
294
+
Prevent such "confounding" through **blocking** and **randomization** during experimental design.
295
+
296
+
## Blocking and Randomization
297
+
298
+
*_Randomization_ is the process of randomly assigning participants or experimental units to different treatments or batches.
299
+
- Randomization is the only way to guarantee there can be no systematic relationship between treatment and batch, or between study subject and treatment
300
+
*_Blocking_ is the process of grouping similar experimental units together to control for known sources of variability (e.g., time, technician, reagent lot).
301
+
- For example, if you have multiple technicians, you can block by technician to ensure each treatment is applied by each technician.
302
+
- This helps reduce variability and improve the accuracy of your results.
303
+
- Only works for known sources of variability
304
+
305
+
## The batch effects impact clustering
321
306
322
-
```{r ge, message=FALSE}
307
+
```{r clust1}
308
+
# Data from a real study where date of processing was confounded with ethnicity
323
309
library(Biobase)
324
310
library(genefilter)
325
311
library(GSE5859)
326
312
data(GSE5859)
327
313
geneExpression = exprs(e)
328
314
sampleInfo = pData(e)
315
+
year <- as.integer(format(sampleInfo$date, "%y")) - min(as.integer(format(sampleInfo$date, "%y")))
316
+
hcclass <- cutree(hclust(as.dist(1 - cor(geneExpression))), k = 5)
317
+
table(hcclass, year) # Clustering is driven by year of processing, not biology
329
318
```
330
319
331
-
* Note: the `ExpressionSet` object is obsolete, we use `SummarizedExperiment` now
332
-
333
-
## Date of processing as a proxy for batch
334
-
335
-
- Sample metadata included *date of processing*:
336
-
337
-
```{r ge2}
338
-
head(table(sampleInfo$date))
339
-
```
340
-
341
-
```{r ge3}
342
-
year <- as.integer(format(sampleInfo$date, "%y"))
343
-
year <- year - min(year)
344
-
month = as.integer(format(sampleInfo$date, "%m")) + 12 * year
* Rescales batches to have the same mean/variance.
326
+
* Good for single-cell data because it maintains sparsity (zeros stay zeros).
388
327
389
-
```{r clust1}
390
-
hcclass <- cutree(hclust(d), k = 5)
391
-
table(hcclass, year)
392
-
```
328
+
***For Known Batches (Bulk RNA-seq):**
329
+
***`limma::removeBatchEffect()`** or **`sva::ComBat()`**: Use a linear model to regress out the effect of known batch variables (e.g., processing date, sequencing machine). Assumes cell type composition is similar across batches.
393
330
394
-
## Approaches to correcting for batch effects
331
+
***For Unknown Batches (Bulk RNA-seq):**
332
+
***`sva::sva()`**: Identifies and creates "surrogate variables" that capture hidden sources of variation. Excellent when you don't know the exact source of the batch effect but the p-value histogram looks problematic.
395
333
396
-
*_No correction_
397
-
- in my experience, the best choice for machine learning applications
*_Linear modeling to achieve a flat p-value histogram when permuting labels_
407
-
- can be used when batches are unknown
408
-
- "Surrogate Variables Analysis" implemented by the [sva](https://bioconductor.org/packages/sva/) package
409
-
*_Mutual Nearest Neighbors_
410
-
- developed specifically for single-cell RNA-seq
411
-
- no assumption of the same composition of cells across batches, but still
412
-
assumes no meaningful biological differences exist between batches
413
-
-`batchelor::fastMNN()`
334
+
***For Single-Cell Data Integration:**
335
+
***`batchelor::fastMNN()`**: Finds Mutual Nearest Neighbors (MNNs) between batches to align them. Powerful for single-cell RNA-seq because it does **not** assume the same composition of cells across batches.
414
336
415
337
## Batch Effects - summary
416
338
417
-
- batches can be corrected for **only** if not overlapping with conditions of interest
418
-
- if confounded with treatment or outcome of interest, nothing can help you
419
-
- randomization of samples to batches in study design protects against this
420
-
- during experimental design:
421
-
- keep track of anything that might cause a batch effect for post-hoc analysis
422
-
- include control samples in each batch
423
-
- tend to affect many or all measurements by a little bit
0 commit comments