Updates to day 4

lwaldron · lwaldron · commit 4e546bfa14b8 · 2025-07-18T09:37:14.000+02:00
diff --git a/vignettes/day4_batcheffects-vis.Rmd b/vignettes/day4_batcheffects-vis.Rmd
@@ -35,19 +35,17 @@ library(AppStatBio)
 
 > “The greatest value of a picture is when it forces us to notice what we never expected to see.” - John W. Tukey
 
-- Discover biases, systematic errors and unexpected variability in data
-- Graphical approach to detecting these issues
-- Represents a first step in data analysis and guides hypothesis testing
-- Opportunities for discovery in the outliers
+- Discover biases, systematic errors and unexpected variability in data.
+- Graphical approach to detecting these issues. Represents a first step in data analysis and guides hypothesis testing.
+- EDA helps us check the assumptions of our statistical tests.
+- Opportunities for discovery are often in the outliers.
 
 ## Quantile Quantile Plots
 
-- Quantiles divide a distribution into equally sized bins
-- Division into 100 bins gives percentiles
-- Quantiles of a theoretical distribution are plotted against an experimental distribution
-    - alternatively, quantiles of two experimental distributions
-- Given a perfect fit, $x=y$
-- Useful in determining data distribution (normal, t, etc.)
+- **Why use them?** A primary tool for checking if our data follows a theoretical distribution.`
+- Quantiles divide a distribution into equally sized bins (e.g., 100 bins for percentiles).
+- We plot the quantiles from our data against the theoretical quantiles of a distribution (e.g., the normal distribution).
+- If our data perfectly matches the theoretical distribution, the points will form a straight line ($y=x$). Deviations from the line indicate our data does not fit that distribution.
 
 ## Example: Quantile Quantile Plots
 
@@ -84,10 +82,9 @@ qqline(x)
 
 ## Boxplots: About
 
-- Provide a graph that is easy to interpret where data is not normally distributed
-- Would be an appropriate choice to explore income data, as distribution is highly skewed
-- Particularly informative in relation to outliers and range
-- Possible to compare multiple distributions side by side
+- **Why use them?** Boxplots excel at showing the distribution of data, especially when it is not normally distributed (e.g., highly skewed data like income).
+- They provide a simple, easy-to-interpret summary of the data's range, center, and spread, while clearly highlighting outliers.
+- Their greatest advantage comes from placing them side-by-side to compare distributions across multiple groups at once.
 
 ## Boxplots: Example
 
@@ -98,14 +95,13 @@ qqnorm(exec.pay, main = "CEO Compensation")
 boxplot(exec.pay, ylab="10,000s of dollars", ylim=c(0,400), main = "CEO Compensation")
 ```
 <center>
-Three different views of a continuous variable
+Three different views of a continuous variable. The boxplot clearly shows the skew and outliers.
 </center>
 
 ## Scatterplots And Correlation: About
 
-- For two continuous variables, scatter plot and calculation of correlation is useful
-- Provides a graphical and numeric estimation of relationships
-- Quick and easy with `plot()` and `cor()`
+- For two continuous variables, a scatter plot graphically shows the relationship, while correlation provides a single number to summarize its strength and direction.
+- Quick and easy with `plot()` and `cor()`.
 
 ## Scatterplots And Correlation: Example
 
@@ -182,13 +178,15 @@ legend("bottomright",
 
 ## Volcano plots: Summary
 
-- Many small p-values with small effect size indicate low within-group variability
-- Inspect for asymmetry
-- Can color points by significance threshold
+- A volcano plot lets us visualize both the **statistical significance** (p-value) and **biological significance** (effect size or fold change) at the same time for thousands of genes.
+- **Top-right/left corners:** Genes with large effect sizes and high statistical significance. These are often the most interesting candidates.
+- **Top-center:** Genes that are statistically significant but have a small effect size.
+- **Bottom:** Genes that are not statistically significant, regardless of their effect size.
+- Can color points by significance threshold. Check for asymmetry, which might indicate biases.
 
 ## P-value histograms: Setup
 
-- If all null hypotheses are true, expect a flat histogram of p-values:
+- If all null hypotheses are true (i.e., no genes are truly differentially expressed), we expect a **flat histogram** of p-values, where every p-value from 0 to 1 is equally likely.
 
 ```{r pvalhist1}
 m <- nrow(geneExpression)
@@ -218,46 +216,37 @@ hist(permresults$p.value)
 
 ## P-value histograms: Summary
 
-- Give a quick look at how many significant p-values there may be
-- When using permuted labels, can expose non-independence among the samples
-    + can be due to batch effects or family structure
-- Most common approaches for correcting batch effects are:
-    + `ComBat`: corrects for known batch effects by linear model), and 
-    + `sva`: creates surrogate variables for unknown batch effects, corrects the structure of permutation p-values
-    + correction using control (housekeeping) genes
-    + `batchelor` for single-cell analysis
-
-`ComBat` and `sva` are available from the [sva](https://www.bioconductor.org/packages/sva) Bioconductor package
+- Give a quick look at the overall results of a high-throughput experiment. A spike near zero suggests the presence of differentially expressed genes.
+- A non-uniform histogram for permuted data is a red flag, suggesting non-independence between samples, often due to hidden batch effects.
 
 ## MA plot
 
-- just a scatterplot rotated 45$^o$
+- **Why use it?** An MA plot is a clever transformation of a scatter plot, designed to better visualize differences between two samples (or one sample and a reference). It's just a scatterplot rotated 45$^o$.
+- The rotation helps us see systematic biases. The 'A' (Average) on the x-axis represents overall signal intensity, and the 'M' (Minus, or log-ratio) on the y-axis represents the fold change. This makes it much easier to see if the fold change is dependent on gene intensity.
 
 ```{r pvalhist4, fig.height=3}
 rafalib::mypar(1, 2)
 pseudo <- apply(geneExpression, 1, median)
-plot(geneExpression[, 1], pseudo)
-plot((geneExpression[, 1] + pseudo) / 2, (geneExpression[, 1] - pseudo))
+plot(geneExpression[, 1], pseudo) # Standard scatter plot
+plot((geneExpression[, 1] + pseudo) / 2, (geneExpression[, 1] - pseudo)) # MA plot
 ```
 
 ## MA plot: Summary
 
-- useful for quality control of high-dimensional data
-- plot all data values for a sample against another sample or a median "pseudosample"
-- `affyPLM::MAplots` better MA plots
-    - adds a smoothing line to highlight departures from horizontal line
-    - plots a "cloud" rather than many data points
+- Useful for quality control of high-dimensional data.
+- In an ideal MA plot, the cloud of points is centered on y=0 with no trend.
+- `affyPLM::MAplots` creates better MA plots, adding a smoothing line to highlight departures from the horizontal.
 
 ## Heatmaps
 
-* Detailed representation of high-dimensional dataset.
-    - `ComplexHeatmap` package is the best as of 2024: large datasets, interactive heatmaps, simple defaults but many customizations possible
+* Detailed representation of a high-dimensional dataset. The `ComplexHeatmap` package is the best as of 2025.
+* **Important Note:** Before plotting, we usually **scale** the data for each gene. This ensures the color pattern is driven by relative expression changes, not by a few highly expressed genes dominating the color scale.
 
 ```{r ma1, fig.width=12, echo=FALSE}
 suppressPackageStartupMessages(library(ComplexHeatmap))
-keep <- rank(apply(geneExpression, 1, var)) <= 100  # 500 most variable genes
+keep <- rank(apply(geneExpression, 1, var)) <= 100
 ge <- geneExpression[keep, ]
-ge <- t(scale(t(ge))) #scale
+ge <- t(scale(t(ge))) # Scale genes across samples
 rownames(ge) <- NULL; colnames(ge) <- NULL
 chr <- sub("chr", "", geneAnnotation$CHR)
 chr[is.na(chr)] <- "Un"
@@ -273,162 +262,89 @@ Heatmap(ge, use_raster = FALSE, top_annotation = column_ha, right_annotation = r
 
 ## Heatmaps: Summary
 
-- Clustering becomes slow and memory-intensive for thousands of rows
-    - probably too detailed for thousands of rows
-- can show co-expressed genes, groups of samples
+- Clustering becomes slow for thousands of rows but is great for visualizing co-expressed genes and sample groups.
 
 ## Colors
 
-- Types of color pallettes: 
-    - **sequential**: shows a gradient
-    - **diverging**: goes in two directions from a center
-    - **qualitative**: for categorical variables
-- Keep color blindness in mind (10% of all men)
-
-## Colors (cont'd)
-
-Combination of `RColorBrewer` package and `colorRampPalette()` can create anything you want
-
-```{r brewer, fig.height=5, echo=FALSE}
-rafalib::mypar(1, 1)
-RColorBrewer::display.brewer.all(n = 7)
-```
+- Palettes: **sequential** (gradient), **diverging** (two directions from a center), **qualitative** (categorical).
+- Keep color blindness in mind (10% of men). `RColorBrewer` has colorblind-friendly options.
 
 ## Plots To Avoid
 
 > "Pie charts are a very bad way of displaying information." - R Help
 
-- Avoid pie charts
-- Avoid doughnut charts too
-- Avoid pseudo 3D
-- Use color judiciously
+- **Avoid pie charts and doughnut charts.** Humans are much better at judging length and position than angles and areas. A simple bar chart is almost always a better, clearer alternative.
+- **Avoid pseudo 3D plots.** They distort the data and make comparisons difficult.
+- Use color judiciously to highlight, not to decorate.
 
 # Batch effects
 
 ## Pervasiveness of batch Effects
 
-- pervasive in genomics (e.g. [Leek *et al.* Nat Rev Genet. 2010 Oct;11(10):733-9.](https://www.ncbi.nlm.nih.gov/pubmed/?term=20838408))
-- affect DNA and RNA sequencing, proteomics, imaging, microarray...
-- have caused high-profile problems and retractions
-    - you can't get rid of them
-    - but you can make sure they are not confounded with your experiment
-    
-## Batch Effects - an example
+- Pervasive in genomics and have caused high-profile retractions.
+- You can't get rid of them, but you can design your experiment to manage them.
+- **The Golden Rule:** Make sure batch is not confounded with your variable of interest.
+- Consider this **nightmare scenario**:
 
-- Nat Genet. 2007 Feb;39(2):226-31. Epub 2007 Jan 7.
-- Title: *Common genetic variants account for differences in gene expression among ethnic groups.*
-    - "The results show that specific genetic variation among populations contributes appreciably to differences in gene expression phenotypes."
+> * **Batch 1:** All your "Control" samples, processed on Monday.
+> * **Batch 2:** All your "Treatment" samples, processed on Tuesday.
+>
+> If you find a difference, is it due to the treatment or the processing day? It's **impossible to know**.
+
+Prevent such "confounding" through **blocking** and **randomization** during experimental design.
+
+## Blocking and Randomization
+
+* _Randomization_ is the process of randomly assigning participants or experimental units to different treatments or batches.
+   - Randomization is the only way to guarantee there can be no systematic relationship between treatment and batch, or between study subject and treatment
+* _Blocking_ is the process of grouping similar experimental units together to control for known sources of variability (e.g., time, technician, reagent lot).
+   - For example, if you have multiple technicians, you can block by technician to ensure each treatment is applied by each technician.
+   - This helps reduce variability and improve the accuracy of your results.
+   - Only works for known sources of variability
+
+## The batch effects impact clustering
 
-```{r ge, message=FALSE}
+```{r clust1}
+# Data from a real study where date of processing was confounded with ethnicity
 library(Biobase)
 library(genefilter)
 library(GSE5859)
 data(GSE5859)
 geneExpression = exprs(e)
 sampleInfo = pData(e)
+year <-  as.integer(format(sampleInfo$date, "%y")) - min(as.integer(format(sampleInfo$date, "%y")))
+hcclass <- cutree(hclust(as.dist(1 - cor(geneExpression))), k = 5)
+table(hcclass, year) # Clustering is driven by year of processing, not biology
 ```
 
-* Note: the `ExpressionSet` object is obsolete, we use `SummarizedExperiment` now
-
-## Date of processing as a proxy for batch
-
-- Sample metadata included *date of processing*:
-
-```{r ge2}
-head(table(sampleInfo$date))
-```
-
-```{r ge3}
-year <-  as.integer(format(sampleInfo$date, "%y"))
-year <- year - min(year)
-month = as.integer(format(sampleInfo$date, "%m")) + 12 * year
-table(year, sampleInfo$ethnicity)
-```
-
-## Visualizing batch effects by PCA
-
-```{r ge4, cache=TRUE, warning=FALSE}
-pc <- prcomp(t(geneExpression), scale. = TRUE)
-```
-
-```{r, echo=FALSE, warning=FALSE}
-boxplot(
-    pc$x[, 1] ~ month,
-    varwidth = TRUE,
-    notch = TRUE,
-    main = "PC1 scores vs. month",
-    xlab = "Month",
-    ylab = "PC1"
-)
-```
-
-## Visualizing batch effects by MDS
-
-A starting point for a color palette:
-```{r ge5, eval=TRUE}
-RColorBrewer::display.brewer.all(n = 3, colorblindFriendly = TRUE)
-```
-
-Interpolate one color per month on a quantitative palette:
-```{r rcb2}
-col3 <- c(RColorBrewer::brewer.pal(n = 3, "Greys")[2:3], "black")
-MYcols <- colorRampPalette(col3, space = "Lab")(length(unique(month)))
-```
-
-## Visualizing batch effects by MDS
+## Approaches to correcting for batch effects
 
-```{r mds1, fig.height=3.5, fig.align='center'}
-d <- as.dist(1 - cor(geneExpression))
-mds <- cmdscale(d)
-plot(mds, col = MYcols[as.integer(factor(month))],
-     main = "MDS shaded by month")
-```
+Methods can be categorized by their approach and data type:
 
-## The batch effects impact clustering
+* **Simple Rescaling (e.g., `batchelor::rescaleBatches()`):**
+    * Rescales batches to have the same mean/variance.
+    * Good for single-cell data because it maintains sparsity (zeros stay zeros).
 
-```{r clust1}
-hcclass <- cutree(hclust(d), k = 5)
-table(hcclass, year)
-```
+* **For Known Batches (Bulk RNA-seq):**
+    * **`limma::removeBatchEffect()`** or **`sva::ComBat()`**: Use a linear model to regress out the effect of known batch variables (e.g., processing date, sequencing machine). Assumes cell type composition is similar across batches.
 
-## Approaches to correcting for batch effects
+* **For Unknown Batches (Bulk RNA-seq):**
+    * **`sva::sva()`**: Identifies and creates "surrogate variables" that capture hidden sources of variation. Excellent when you don't know the exact source of the batch effect but the p-value histogram looks problematic.
 
-* _No correction_
-    - in my experience, the best choice for machine learning applications
-* [Simple rescaling](http://bioconductor.org/books/3.17/OSCA.multisample/integrating-datasets.html#by-rescaling-the-counts)
-    - Rescale observations (cells) to the same mean, variance in each batch
-    - maintains sparsity (ie zeros remain zeros)
-    - `batchelor::rescaleBatches()`
-* _Linear modeling to regress out batch effects_
-    - use when batches are known. Fits a linear model and use residuals
-    - assumes the same composition of cells across batches
-    - `limma::removeBatchEffect()`, `sva::comBat()`, `batchelor::regressBatches()`
-* _Linear modeling to achieve a flat p-value histogram when permuting labels_
-    - can be used when batches are unknown
-    - "Surrogate Variables Analysis" implemented by the [sva](https://bioconductor.org/packages/sva/) package
-* _Mutual Nearest Neighbors_
-    - developed specifically for single-cell RNA-seq
-    - no assumption of the same composition of cells across batches, but still 
-        assumes no meaningful biological differences exist between batches
-    - `batchelor::fastMNN()`
+* **For Single-Cell Data Integration:**
+    * **`batchelor::fastMNN()`**: Finds Mutual Nearest Neighbors (MNNs) between batches to align them. Powerful for single-cell RNA-seq because it does **not** assume the same composition of cells across batches.
 
 ## Batch Effects - summary
 
-- batches can be corrected for **only** if not overlapping with conditions of interest
-    - if confounded with treatment or outcome of interest, nothing can help you
-    - randomization of samples to batches in study design protects against this
-- during experimental design:
-    - keep track of anything that might cause a batch effect for post-hoc analysis
-    - include control samples in each batch
-- tend to affect many or all measurements by a little bit
-
-## Exercises
-
-* OSCA Multi-sample [Chapter 1: Correcting batch effects](http://bioconductor.org/books/release/OSCA.multisample/integrating-datasets.html)
+- Batch effects can ONLY be corrected if they are not confounded with your variable of interest.
+- **Randomization is your best defense.**
+- Always record info that might become a batch effect (date, technician, reagent lot, etc.).
+- Include control samples in each batch.
 
 ## Links
 
 -   A built [html](https://waldronlab.io/AppStatBio/articles/day4_batcheffects-vis.html)
     version of this lecture is available.
 -   The [source](https://github.com/waldronlab/AppStatBio/blob/main/vignettes/day4_batcheffects-vis.Rmd) R Markdown is
      available from Github.
+