Skip to content

Commit 4e546bf

Browse files
committed
Updates to day 4
1 parent 861f0bd commit 4e546bf

File tree

1 file changed

+81
-165
lines changed

1 file changed

+81
-165
lines changed

vignettes/day4_batcheffects-vis.Rmd

Lines changed: 81 additions & 165 deletions
Original file line numberDiff line numberDiff line change
@@ -35,19 +35,17 @@ library(AppStatBio)
3535

3636
> “The greatest value of a picture is when it forces us to notice what we never expected to see.” - John W. Tukey
3737
38-
- Discover biases, systematic errors and unexpected variability in data
39-
- Graphical approach to detecting these issues
40-
- Represents a first step in data analysis and guides hypothesis testing
41-
- Opportunities for discovery in the outliers
38+
- Discover biases, systematic errors and unexpected variability in data.
39+
- Graphical approach to detecting these issues. Represents a first step in data analysis and guides hypothesis testing.
40+
- EDA helps us check the assumptions of our statistical tests.
41+
- Opportunities for discovery are often in the outliers.
4242

4343
## Quantile Quantile Plots
4444

45-
- Quantiles divide a distribution into equally sized bins
46-
- Division into 100 bins gives percentiles
47-
- Quantiles of a theoretical distribution are plotted against an experimental distribution
48-
- alternatively, quantiles of two experimental distributions
49-
- Given a perfect fit, $x=y$
50-
- Useful in determining data distribution (normal, t, etc.)
45+
- **Why use them?** A primary tool for checking if our data follows a theoretical distribution.`
46+
- Quantiles divide a distribution into equally sized bins (e.g., 100 bins for percentiles).
47+
- We plot the quantiles from our data against the theoretical quantiles of a distribution (e.g., the normal distribution).
48+
- If our data perfectly matches the theoretical distribution, the points will form a straight line ($y=x$). Deviations from the line indicate our data does not fit that distribution.
5149

5250
## Example: Quantile Quantile Plots
5351

@@ -84,10 +82,9 @@ qqline(x)
8482

8583
## Boxplots: About
8684

87-
- Provide a graph that is easy to interpret where data is not normally distributed
88-
- Would be an appropriate choice to explore income data, as distribution is highly skewed
89-
- Particularly informative in relation to outliers and range
90-
- Possible to compare multiple distributions side by side
85+
- **Why use them?** Boxplots excel at showing the distribution of data, especially when it is not normally distributed (e.g., highly skewed data like income).
86+
- They provide a simple, easy-to-interpret summary of the data's range, center, and spread, while clearly highlighting outliers.
87+
- Their greatest advantage comes from placing them side-by-side to compare distributions across multiple groups at once.
9188

9289
## Boxplots: Example
9390

@@ -98,14 +95,13 @@ qqnorm(exec.pay, main = "CEO Compensation")
9895
boxplot(exec.pay, ylab="10,000s of dollars", ylim=c(0,400), main = "CEO Compensation")
9996
```
10097
<center>
101-
Three different views of a continuous variable
98+
Three different views of a continuous variable. The boxplot clearly shows the skew and outliers.
10299
</center>
103100

104101
## Scatterplots And Correlation: About
105102

106-
- For two continuous variables, scatter plot and calculation of correlation is useful
107-
- Provides a graphical and numeric estimation of relationships
108-
- Quick and easy with `plot()` and `cor()`
103+
- For two continuous variables, a scatter plot graphically shows the relationship, while correlation provides a single number to summarize its strength and direction.
104+
- Quick and easy with `plot()` and `cor()`.
109105

110106
## Scatterplots And Correlation: Example
111107

@@ -182,13 +178,15 @@ legend("bottomright",
182178

183179
## Volcano plots: Summary
184180

185-
- Many small p-values with small effect size indicate low within-group variability
186-
- Inspect for asymmetry
187-
- Can color points by significance threshold
181+
- A volcano plot lets us visualize both the **statistical significance** (p-value) and **biological significance** (effect size or fold change) at the same time for thousands of genes.
182+
- **Top-right/left corners:** Genes with large effect sizes and high statistical significance. These are often the most interesting candidates.
183+
- **Top-center:** Genes that are statistically significant but have a small effect size.
184+
- **Bottom:** Genes that are not statistically significant, regardless of their effect size.
185+
- Can color points by significance threshold. Check for asymmetry, which might indicate biases.
188186

189187
## P-value histograms: Setup
190188

191-
- If all null hypotheses are true, expect a flat histogram of p-values:
189+
- If all null hypotheses are true (i.e., no genes are truly differentially expressed), we expect a **flat histogram** of p-values, where every p-value from 0 to 1 is equally likely.
192190

193191
```{r pvalhist1}
194192
m <- nrow(geneExpression)
@@ -218,46 +216,37 @@ hist(permresults$p.value)
218216

219217
## P-value histograms: Summary
220218

221-
- Give a quick look at how many significant p-values there may be
222-
- When using permuted labels, can expose non-independence among the samples
223-
+ can be due to batch effects or family structure
224-
- Most common approaches for correcting batch effects are:
225-
+ `ComBat`: corrects for known batch effects by linear model), and
226-
+ `sva`: creates surrogate variables for unknown batch effects, corrects the structure of permutation p-values
227-
+ correction using control (housekeeping) genes
228-
+ `batchelor` for single-cell analysis
229-
230-
`ComBat` and `sva` are available from the [sva](https://www.bioconductor.org/packages/sva) Bioconductor package
219+
- Give a quick look at the overall results of a high-throughput experiment. A spike near zero suggests the presence of differentially expressed genes.
220+
- A non-uniform histogram for permuted data is a red flag, suggesting non-independence between samples, often due to hidden batch effects.
231221

232222
## MA plot
233223

234-
- just a scatterplot rotated 45$^o$
224+
- **Why use it?** An MA plot is a clever transformation of a scatter plot, designed to better visualize differences between two samples (or one sample and a reference). It's just a scatterplot rotated 45$^o$.
225+
- The rotation helps us see systematic biases. The 'A' (Average) on the x-axis represents overall signal intensity, and the 'M' (Minus, or log-ratio) on the y-axis represents the fold change. This makes it much easier to see if the fold change is dependent on gene intensity.
235226

236227
```{r pvalhist4, fig.height=3}
237228
rafalib::mypar(1, 2)
238229
pseudo <- apply(geneExpression, 1, median)
239-
plot(geneExpression[, 1], pseudo)
240-
plot((geneExpression[, 1] + pseudo) / 2, (geneExpression[, 1] - pseudo))
230+
plot(geneExpression[, 1], pseudo) # Standard scatter plot
231+
plot((geneExpression[, 1] + pseudo) / 2, (geneExpression[, 1] - pseudo)) # MA plot
241232
```
242233

243234
## MA plot: Summary
244235

245-
- useful for quality control of high-dimensional data
246-
- plot all data values for a sample against another sample or a median "pseudosample"
247-
- `affyPLM::MAplots` better MA plots
248-
- adds a smoothing line to highlight departures from horizontal line
249-
- plots a "cloud" rather than many data points
236+
- Useful for quality control of high-dimensional data.
237+
- In an ideal MA plot, the cloud of points is centered on y=0 with no trend.
238+
- `affyPLM::MAplots` creates better MA plots, adding a smoothing line to highlight departures from the horizontal.
250239

251240
## Heatmaps
252241

253-
* Detailed representation of high-dimensional dataset.
254-
- `ComplexHeatmap` package is the best as of 2024: large datasets, interactive heatmaps, simple defaults but many customizations possible
242+
* Detailed representation of a high-dimensional dataset. The `ComplexHeatmap` package is the best as of 2025.
243+
* **Important Note:** Before plotting, we usually **scale** the data for each gene. This ensures the color pattern is driven by relative expression changes, not by a few highly expressed genes dominating the color scale.
255244

256245
```{r ma1, fig.width=12, echo=FALSE}
257246
suppressPackageStartupMessages(library(ComplexHeatmap))
258-
keep <- rank(apply(geneExpression, 1, var)) <= 100 # 500 most variable genes
247+
keep <- rank(apply(geneExpression, 1, var)) <= 100
259248
ge <- geneExpression[keep, ]
260-
ge <- t(scale(t(ge))) #scale
249+
ge <- t(scale(t(ge))) # Scale genes across samples
261250
rownames(ge) <- NULL; colnames(ge) <- NULL
262251
chr <- sub("chr", "", geneAnnotation$CHR)
263252
chr[is.na(chr)] <- "Un"
@@ -273,162 +262,89 @@ Heatmap(ge, use_raster = FALSE, top_annotation = column_ha, right_annotation = r
273262

274263
## Heatmaps: Summary
275264

276-
- Clustering becomes slow and memory-intensive for thousands of rows
277-
- probably too detailed for thousands of rows
278-
- can show co-expressed genes, groups of samples
265+
- Clustering becomes slow for thousands of rows but is great for visualizing co-expressed genes and sample groups.
279266

280267
## Colors
281268

282-
- Types of color pallettes:
283-
- **sequential**: shows a gradient
284-
- **diverging**: goes in two directions from a center
285-
- **qualitative**: for categorical variables
286-
- Keep color blindness in mind (10% of all men)
287-
288-
## Colors (cont'd)
289-
290-
Combination of `RColorBrewer` package and `colorRampPalette()` can create anything you want
291-
292-
```{r brewer, fig.height=5, echo=FALSE}
293-
rafalib::mypar(1, 1)
294-
RColorBrewer::display.brewer.all(n = 7)
295-
```
269+
- Palettes: **sequential** (gradient), **diverging** (two directions from a center), **qualitative** (categorical).
270+
- Keep color blindness in mind (10% of men). `RColorBrewer` has colorblind-friendly options.
296271

297272
## Plots To Avoid
298273

299274
> "Pie charts are a very bad way of displaying information." - R Help
300275
301-
- Avoid pie charts
302-
- Avoid doughnut charts too
303-
- Avoid pseudo 3D
304-
- Use color judiciously
276+
- **Avoid pie charts and doughnut charts.** Humans are much better at judging length and position than angles and areas. A simple bar chart is almost always a better, clearer alternative.
277+
- **Avoid pseudo 3D plots.** They distort the data and make comparisons difficult.
278+
- Use color judiciously to highlight, not to decorate.
305279

306280
# Batch effects
307281

308282
## Pervasiveness of batch Effects
309283

310-
- pervasive in genomics (e.g. [Leek *et al.* Nat Rev Genet. 2010 Oct;11(10):733-9.](https://www.ncbi.nlm.nih.gov/pubmed/?term=20838408))
311-
- affect DNA and RNA sequencing, proteomics, imaging, microarray...
312-
- have caused high-profile problems and retractions
313-
- you can't get rid of them
314-
- but you can make sure they are not confounded with your experiment
315-
316-
## Batch Effects - an example
284+
- Pervasive in genomics and have caused high-profile retractions.
285+
- You can't get rid of them, but you can design your experiment to manage them.
286+
- **The Golden Rule:** Make sure batch is not confounded with your variable of interest.
287+
- Consider this **nightmare scenario**:
317288

318-
- Nat Genet. 2007 Feb;39(2):226-31. Epub 2007 Jan 7.
319-
- Title: *Common genetic variants account for differences in gene expression among ethnic groups.*
320-
- "The results show that specific genetic variation among populations contributes appreciably to differences in gene expression phenotypes."
289+
> * **Batch 1:** All your "Control" samples, processed on Monday.
290+
> * **Batch 2:** All your "Treatment" samples, processed on Tuesday.
291+
>
292+
> If you find a difference, is it due to the treatment or the processing day? It's **impossible to know**.
293+
294+
Prevent such "confounding" through **blocking** and **randomization** during experimental design.
295+
296+
## Blocking and Randomization
297+
298+
* _Randomization_ is the process of randomly assigning participants or experimental units to different treatments or batches.
299+
- Randomization is the only way to guarantee there can be no systematic relationship between treatment and batch, or between study subject and treatment
300+
* _Blocking_ is the process of grouping similar experimental units together to control for known sources of variability (e.g., time, technician, reagent lot).
301+
- For example, if you have multiple technicians, you can block by technician to ensure each treatment is applied by each technician.
302+
- This helps reduce variability and improve the accuracy of your results.
303+
- Only works for known sources of variability
304+
305+
## The batch effects impact clustering
321306

322-
```{r ge, message=FALSE}
307+
```{r clust1}
308+
# Data from a real study where date of processing was confounded with ethnicity
323309
library(Biobase)
324310
library(genefilter)
325311
library(GSE5859)
326312
data(GSE5859)
327313
geneExpression = exprs(e)
328314
sampleInfo = pData(e)
315+
year <- as.integer(format(sampleInfo$date, "%y")) - min(as.integer(format(sampleInfo$date, "%y")))
316+
hcclass <- cutree(hclust(as.dist(1 - cor(geneExpression))), k = 5)
317+
table(hcclass, year) # Clustering is driven by year of processing, not biology
329318
```
330319

331-
* Note: the `ExpressionSet` object is obsolete, we use `SummarizedExperiment` now
332-
333-
## Date of processing as a proxy for batch
334-
335-
- Sample metadata included *date of processing*:
336-
337-
```{r ge2}
338-
head(table(sampleInfo$date))
339-
```
340-
341-
```{r ge3}
342-
year <- as.integer(format(sampleInfo$date, "%y"))
343-
year <- year - min(year)
344-
month = as.integer(format(sampleInfo$date, "%m")) + 12 * year
345-
table(year, sampleInfo$ethnicity)
346-
```
347-
348-
## Visualizing batch effects by PCA
349-
350-
```{r ge4, cache=TRUE, warning=FALSE}
351-
pc <- prcomp(t(geneExpression), scale. = TRUE)
352-
```
353-
354-
```{r, echo=FALSE, warning=FALSE}
355-
boxplot(
356-
pc$x[, 1] ~ month,
357-
varwidth = TRUE,
358-
notch = TRUE,
359-
main = "PC1 scores vs. month",
360-
xlab = "Month",
361-
ylab = "PC1"
362-
)
363-
```
364-
365-
## Visualizing batch effects by MDS
366-
367-
A starting point for a color palette:
368-
```{r ge5, eval=TRUE}
369-
RColorBrewer::display.brewer.all(n = 3, colorblindFriendly = TRUE)
370-
```
371-
372-
Interpolate one color per month on a quantitative palette:
373-
```{r rcb2}
374-
col3 <- c(RColorBrewer::brewer.pal(n = 3, "Greys")[2:3], "black")
375-
MYcols <- colorRampPalette(col3, space = "Lab")(length(unique(month)))
376-
```
377-
378-
## Visualizing batch effects by MDS
320+
## Approaches to correcting for batch effects
379321

380-
```{r mds1, fig.height=3.5, fig.align='center'}
381-
d <- as.dist(1 - cor(geneExpression))
382-
mds <- cmdscale(d)
383-
plot(mds, col = MYcols[as.integer(factor(month))],
384-
main = "MDS shaded by month")
385-
```
322+
Methods can be categorized by their approach and data type:
386323

387-
## The batch effects impact clustering
324+
* **Simple Rescaling (e.g., `batchelor::rescaleBatches()`):**
325+
* Rescales batches to have the same mean/variance.
326+
* Good for single-cell data because it maintains sparsity (zeros stay zeros).
388327

389-
```{r clust1}
390-
hcclass <- cutree(hclust(d), k = 5)
391-
table(hcclass, year)
392-
```
328+
* **For Known Batches (Bulk RNA-seq):**
329+
* **`limma::removeBatchEffect()`** or **`sva::ComBat()`**: Use a linear model to regress out the effect of known batch variables (e.g., processing date, sequencing machine). Assumes cell type composition is similar across batches.
393330

394-
## Approaches to correcting for batch effects
331+
* **For Unknown Batches (Bulk RNA-seq):**
332+
* **`sva::sva()`**: Identifies and creates "surrogate variables" that capture hidden sources of variation. Excellent when you don't know the exact source of the batch effect but the p-value histogram looks problematic.
395333

396-
* _No correction_
397-
- in my experience, the best choice for machine learning applications
398-
* [Simple rescaling](http://bioconductor.org/books/3.17/OSCA.multisample/integrating-datasets.html#by-rescaling-the-counts)
399-
- Rescale observations (cells) to the same mean, variance in each batch
400-
- maintains sparsity (ie zeros remain zeros)
401-
- `batchelor::rescaleBatches()`
402-
* _Linear modeling to regress out batch effects_
403-
- use when batches are known. Fits a linear model and use residuals
404-
- assumes the same composition of cells across batches
405-
- `limma::removeBatchEffect()`, `sva::comBat()`, `batchelor::regressBatches()`
406-
* _Linear modeling to achieve a flat p-value histogram when permuting labels_
407-
- can be used when batches are unknown
408-
- "Surrogate Variables Analysis" implemented by the [sva](https://bioconductor.org/packages/sva/) package
409-
* _Mutual Nearest Neighbors_
410-
- developed specifically for single-cell RNA-seq
411-
- no assumption of the same composition of cells across batches, but still
412-
assumes no meaningful biological differences exist between batches
413-
- `batchelor::fastMNN()`
334+
* **For Single-Cell Data Integration:**
335+
* **`batchelor::fastMNN()`**: Finds Mutual Nearest Neighbors (MNNs) between batches to align them. Powerful for single-cell RNA-seq because it does **not** assume the same composition of cells across batches.
414336

415337
## Batch Effects - summary
416338

417-
- batches can be corrected for **only** if not overlapping with conditions of interest
418-
- if confounded with treatment or outcome of interest, nothing can help you
419-
- randomization of samples to batches in study design protects against this
420-
- during experimental design:
421-
- keep track of anything that might cause a batch effect for post-hoc analysis
422-
- include control samples in each batch
423-
- tend to affect many or all measurements by a little bit
424-
425-
## Exercises
426-
427-
* OSCA Multi-sample [Chapter 1: Correcting batch effects](http://bioconductor.org/books/release/OSCA.multisample/integrating-datasets.html)
339+
- Batch effects can ONLY be corrected if they are not confounded with your variable of interest.
340+
- **Randomization is your best defense.**
341+
- Always record info that might become a batch effect (date, technician, reagent lot, etc.).
342+
- Include control samples in each batch.
428343

429344
## Links
430345

431346
- A built [html](https://waldronlab.io/AppStatBio/articles/day4_batcheffects-vis.html)
432347
version of this lecture is available.
433348
- The [source](https://github.com/waldronlab/AppStatBio/blob/main/vignettes/day4_batcheffects-vis.Rmd) R Markdown is
434349
available from Github.
350+

0 commit comments

Comments
 (0)