05_overview_enrichment_analysis.Rmd

# (PART\*) Part II: Enrichment analysis {-}

# Overview of enrichment analysis {#enrichment-overview}


## Terminology

### Gene sets and pathway

A gene set is an unordered collection of genes that are functionally related. A pathway can be interpreted as a gene set by ignoring functional relationships among genes.

### Gene Ontology (GO)


[Gene Ontology](http://www.geneontology.org/) defines concepts/classes used to describe gene function, and relationships between these concepts. It classifies functions along three aspects:

+ MF: Molecular Function
  - molecular activities of gene products
+ CC: Cellular Component
  - where gene products are active
+ BP: Biological Process
  - pathways and larger processes made up of the activities of multiple gene products

GO terms are organized in a directed acyclic graph, where edges between terms represent parent-child relationship.


### Kyoto Encyclopedia of Genes and Genomes (KEGG)

[KEGG](https://www.genome.jp/kegg/) is a collection of manually drawn pathway maps representing molecular interaction and reaction networks. These pathways cover a wide range of biochemical processes that can be divided into [7 broad categories](https://www.genome.jp/kegg/pathway.html): 

1. Metabolism
2. Genetic information processing
3. Environmental information processing
4. Cellular processes
5. Organismal systems
6. Human diseases
7. Drug development.

<!-- https://pathview.uncc.edu/data/khier.tsv -->

### Other gene sets

GO and KEGG are the most frequently used for functional analysis. They are typically the first choice because of their long-standing curation and availability for a wide range of species.


Other gene sets include but are not limited to Disease Ontology ([DO](http://disease-ontology.org/)), Disease Gene Network ([DisGeNET](https://www.ncbi.nlm.nih.gov/pmc/articles/PMC4397996/)), [wikiPathways](https://www.wikipathways.org), Molecular Signatures Database ([MSigDb](http://software.broadinstitute.org/gsea/msigdb)). 


## Over Representation Analysis {#ora-algorithm}

Over Representation Analysis (ORA) [@boyle2004] is a widely used approach to determine whether known biological functions or processes are over-represented (= enriched) in an experimentally-derived gene list, *e.g.* a list of differentially expressed genes (DEGs). 


The _p_-value can be calculated by hypergeometric distribution.


$p = 1 - \displaystyle\sum_{i = 0}^{k-1}\frac{{M \choose i}{{N-M} \choose {n-i}}} {{N \choose n}}$

In this equation, `N` is the total number of genes in the background distribution,
`M` is the number of genes within that distribution that are annotated (either directly or indirectly) to the gene set of interest,
`n` is the size of the list of genes of interest and `k` is the number of genes within that list which are annotated to the gene set. The background distribution by default is all the genes that have annotation. _P_-values should be adjusted for [multiple comparison](https://en.wikipedia.org/wiki/Multiple_comparisons_problem).


**Example:** Suppose we have 17,980 genes detected in a Microarray study and 57 genes were differentially expressed. Among the differentially expressed genes, 28 are annotated to a gene set^[example adopted from <https://guangchuangyu.github.io/2012/04/enrichment-analysis/>]. 


```{r}
d <- data.frame(gene.not.interest=c(2613, 15310), gene.in.interest=c(28, 29))
row.names(d) <- c("In_category", "not_in_category")
d
```                        


Whether the overlap(s) of 25 genes are significantly over represented in the gene set can be assessed using a hypergeometric distribution. This corresponds to a one-sided version of Fisher's exact test.

```{r}
fisher.test(d, alternative = "greater")
```


## Gene Set Enrichment Analysis {#gsea-algorithm}


A common approach to analyzing gene expression profiles is identifying differentially expressed genes that are deemed interesting. The [ORA enrichment analysis](#ora-algorithm) is based on these differentially expressed genes. This approach will find genes where the difference is large and will fail where the difference is small, but evidenced in coordinated way in a set of related genes. Gene Set Enrichment Analysis (GSEA)[@subramanian_gene_2005] directly addresses this limitation.
All genes can be used in GSEA; GSEA aggregates the per gene statistics across genes within a gene set, therefore making it possible to detect situations where all genes in a predefined set change in a small but coordinated way. This is important since it is likely that many relevant phenotypic differences are manifested by small but consistent changes in a set of genes.

Genes are ranked based on their phenotypes. Given apriori defined set of gene _S_ (e.g., genes sharing the same _DO_ category), the goal of GSEA is to determine whether the members of _S_ are randomly distributed throughout the ranked gene list (_L_) or primarily found at the top or bottom.

There are three key elements of the GSEA method:

* Calculation of an Enrichment Score.
  + The enrichment score (_ES_) represents the degree to which a set _S_ is over-represented at the top or bottom of the ranked list _L_. The score is calculated by walking down the list _L_, increasing a running-sum statistic when we encounter a gene in _S_ and decreasing when it is not encountered. The magnitude of the increment depends on the gene statistics (e.g., correlation of the gene with phenotype). The _ES_ is the maximum deviation from zero encountered in the random walk; it corresponds to a weighted Kolmogorov-Smirnov(KS)-like statistic [@subramanian_gene_2005].
* Esimation of Significance Level of _ES_.
  + The _p_-value of the _ES_ is calculated using a permutation test. Specifically, we permute the gene labels of the gene list _L_ and recompute the _ES_ of the gene set for the permutated data, which generate a null distribution for the _ES_. The _p_-value of the observed _ES_ is then calculated relative to this null distribution.
* Adjustment for Multiple Hypothesis Testing.
  + When the entire gene sets are evaluated, the estimated significance level is adjusted to account for multiple hypothesis testing and also _q_-values are calculated for FDR control.

We implemented the GSEA algorithm proposed by Subramanian [@subramanian_gene_2005]. Alexey Sergushichev implemented an algorithm for fast GSEA calculation in the `r Biocpkg("fgsea")` [@korotkevich_fast_2019] package. In our packages (`r Biocpkg("clusterProfiler")`, `r Biocpkg("DOSE")`, `r Biocpkg("meshes")` and `r Biocpkg("ReactomePA")`), users can use the GSEA algorithm implemented in `DOSE` or `fgsea` by specifying the parameter `by="DOSE"` or `by="fgsea"`. By default, the `fgsea` method will be used since it is much more faster.


## Leading edge analysis and core enriched genes

Leading edge analysis reports `Tags` to indicate the percentage of genes contributing to the enrichment score, `List` to indicate where in the list the enrichment score is attained and `Signal` for enrichment signal strength.

It would also be very interesting to get the core enriched genes that contribute to the enrichment. Our packages (`r Biocpkg("clusterProfiler")`, `r Biocpkg("DOSE")`, `r Biocpkg("meshes")` and `r Biocpkg("ReactomePA")`) support leading edge analysis and report core enriched genes in GSEA analysis.