-
Notifications
You must be signed in to change notification settings - Fork 29
/
05_overview_enrichment_analysis.Rmd
108 lines (59 loc) · 7.17 KB
/
05_overview_enrichment_analysis.Rmd
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
# (PART\*) Part II: Enrichment analysis {-}
# Overview of enrichment analysis {#enrichment-overview}
## Terminology
### Gene sets and pathway
A gene set is an unordered collection of genes that are functionally related. A pathway can be interpreted as a gene set by ignoring functional relationships among genes.
### Gene Ontology (GO)
[Gene Ontology](http://www.geneontology.org/) defines concepts/classes used to describe gene function, and relationships between these concepts. It classifies functions along three aspects:
+ MF: Molecular Function
- molecular activities of gene products
+ CC: Cellular Component
- where gene products are active
+ BP: Biological Process
- pathways and larger processes made up of the activities of multiple gene products
GO terms are organized in a directed acyclic graph, where edges between terms represent parent-child relationship.
### Kyoto Encyclopedia of Genes and Genomes (KEGG)
[KEGG](https://www.genome.jp/kegg/) is a collection of manually drawn pathway maps representing molecular interaction and reaction networks. These pathways cover a wide range of biochemical processes that can be divided into [7 broad categories](https://www.genome.jp/kegg/pathway.html):
1. Metabolism
2. Genetic information processing
3. Environmental information processing
4. Cellular processes
5. Organismal systems
6. Human diseases
7. Drug development.
<!-- https://pathview.uncc.edu/data/khier.tsv -->
### Other gene sets
GO and KEGG are the most frequently used for functional analysis. They are typically the first choice because of their long-standing curation and availability for a wide range of species.
Other gene sets include but are not limited to Disease Ontology ([DO](http://disease-ontology.org/)), Disease Gene Network ([DisGeNET](https://www.ncbi.nlm.nih.gov/pmc/articles/PMC4397996/)), [wikiPathways](https://www.wikipathways.org), Molecular Signatures Database ([MSigDb](http://software.broadinstitute.org/gsea/msigdb)).
## Over Representation Analysis {#ora-algorithm}
Over Representation Analysis (ORA) [@boyle2004] is a widely used approach to determine whether known biological functions or processes are over-represented (= enriched) in an experimentally-derived gene list, *e.g.* a list of differentially expressed genes (DEGs).
The _p_-value can be calculated by hypergeometric distribution.
$p = 1 - \displaystyle\sum_{i = 0}^{k-1}\frac{{M \choose i}{{N-M} \choose {n-i}}} {{N \choose n}}$
In this equation, `N` is the total number of genes in the background distribution,
`M` is the number of genes within that distribution that are annotated (either directly or indirectly) to the gene set of interest,
`n` is the size of the list of genes of interest and `k` is the number of genes within that list which are annotated to the gene set. The background distribution by default is all the genes that have annotation. _P_-values should be adjusted for [multiple comparison](https://en.wikipedia.org/wiki/Multiple_comparisons_problem).
**Example:** Suppose we have 17,980 genes detected in a Microarray study and 57 genes were differentially expressed. Among the differentially expressed genes, 28 are annotated to a gene set^[example adopted from <https://guangchuangyu.github.io/2012/04/enrichment-analysis/>].
```{r}
d <- data.frame(gene.not.interest=c(2613, 15310), gene.in.interest=c(28, 29))
row.names(d) <- c("In_category", "not_in_category")
d
```
Whether the overlap(s) of 25 genes are significantly over represented in the gene set can be assessed using a hypergeometric distribution. This corresponds to a one-sided version of Fisher's exact test.
```{r}
fisher.test(d, alternative = "greater")
```
## Gene Set Enrichment Analysis {#gsea-algorithm}
A common approach to analyzing gene expression profiles is identifying differentially expressed genes that are deemed interesting. The [ORA enrichment analysis](#ora-algorithm) is based on these differentially expressed genes. This approach will find genes where the difference is large and will fail where the difference is small, but evidenced in coordinated way in a set of related genes. Gene Set Enrichment Analysis (GSEA)[@subramanian_gene_2005] directly addresses this limitation.
All genes can be used in GSEA; GSEA aggregates the per gene statistics across genes within a gene set, therefore making it possible to detect situations where all genes in a predefined set change in a small but coordinated way. This is important since it is likely that many relevant phenotypic differences are manifested by small but consistent changes in a set of genes.
Genes are ranked based on their phenotypes. Given apriori defined set of gene _S_ (e.g., genes sharing the same _DO_ category), the goal of GSEA is to determine whether the members of _S_ are randomly distributed throughout the ranked gene list (_L_) or primarily found at the top or bottom.
There are three key elements of the GSEA method:
* Calculation of an Enrichment Score.
+ The enrichment score (_ES_) represents the degree to which a set _S_ is over-represented at the top or bottom of the ranked list _L_. The score is calculated by walking down the list _L_, increasing a running-sum statistic when we encounter a gene in _S_ and decreasing when it is not encountered. The magnitude of the increment depends on the gene statistics (e.g., correlation of the gene with phenotype). The _ES_ is the maximum deviation from zero encountered in the random walk; it corresponds to a weighted Kolmogorov-Smirnov(KS)-like statistic [@subramanian_gene_2005].
* Esimation of Significance Level of _ES_.
+ The _p_-value of the _ES_ is calculated using a permutation test. Specifically, we permute the gene labels of the gene list _L_ and recompute the _ES_ of the gene set for the permutated data, which generate a null distribution for the _ES_. The _p_-value of the observed _ES_ is then calculated relative to this null distribution.
* Adjustment for Multiple Hypothesis Testing.
+ When the entire gene sets are evaluated, the estimated significance level is adjusted to account for multiple hypothesis testing and also _q_-values are calculated for FDR control.
We implemented the GSEA algorithm proposed by Subramanian [@subramanian_gene_2005]. Alexey Sergushichev implemented an algorithm for fast GSEA calculation in the `r Biocpkg("fgsea")` [@korotkevich_fast_2019] package. In our packages (`r Biocpkg("clusterProfiler")`, `r Biocpkg("DOSE")`, `r Biocpkg("meshes")` and `r Biocpkg("ReactomePA")`), users can use the GSEA algorithm implemented in `DOSE` or `fgsea` by specifying the parameter `by="DOSE"` or `by="fgsea"`. By default, the `fgsea` method will be used since it is much more faster.
## Leading edge analysis and core enriched genes
Leading edge analysis reports `Tags` to indicate the percentage of genes contributing to the enrichment score, `List` to indicate where in the list the enrichment score is attained and `Signal` for enrichment signal strength.
It would also be very interesting to get the core enriched genes that contribute to the enrichment. Our packages (`r Biocpkg("clusterProfiler")`, `r Biocpkg("DOSE")`, `r Biocpkg("meshes")` and `r Biocpkg("ReactomePA")`) support leading edge analysis and report core enriched genes in GSEA analysis.