Analysis and pipeline prepared by: Ha-Na Shim
Contact: hshim1@outlook.com for questions, feedback, or suggestions regarding this workflow.
A lightweight and simple workflow/pipeline in R for extracting gene lists of interest from the gene ontology (GO) and MSigDB collections, as well as visualization of aggregate gene expression of a group of genes between conditions/treatments over time.
This workflow uses a small set of CRAN and Bioconductor packages.
| Package | Source | Package | Source |
|---|---|---|---|
pheatmap | CRAN | RColorBrewer | CRAN |
dplyr | CRAN | ggplot2 | CRAN |
tidyverse | CRAN | patchwork | CRAN |
matrixStats | CRAN | biomaRt | Bioconductor |
clusterProfiler | Bioconductor | org.Hs.eg.db | Bioconductor |
org.Mm.eg.db | Bioconductor | BiocManager | CRAN |
BiocManager::install(c(
"pheatmap",
"RColorBrewer",
"dplyr",
"ggplot2",
"tidyverse",
"patchwork",
"matrixStats",
"biomaRt",
"clusterProfiler",
"org.Hs.eg.db",
"org.Mm.eg.db"
))The structure of this repository and location of R scripts for functions used in this workflow are shown below.
.
├─ functions/
│ ├─ extract_GO_genes.R
│ ├─ extract_msigdb_genes.R
│ ├─ compile_gene_lists.R
│ ├─ create_aggregate_lineplot.R
│ ├─ create_clustered_heatmap.R
│ └─ compiled_functions_hshim1_v3.R
│
├─ example_workflow/
│ ├─ example_workflow.Rmd
│ └─ example_workflow.html (knitted output)
│
└─ extracted_gene_lists/
└─ combined_muscle_gene_lists.csv
-
functions/— self-contained functions used across the workflow.extract_GO_genes.R— retrieve gene sets for Gene Ontology (GO) terms (supports descendants via GO hierarchy; Ensembl BioMart backend).extract_msigdb_genes.R— fetch gene sets from MSigDB (Hallmark, KEGG, Reactome, GO, etc.) with flexible ID types and optional cross-species mapping.compile_gene_lists.R— consolidate/merge curated gene lists (e.g., combine muscle-related sets into a single reference CSV or list object).create_aggregate_lineplot.R— generate aggregate time-course/condition line plots from gene sets (e.g., mean/median expression with confidence bands).create_clustered_heatmap.R— build clustered heatmaps for selected gene panels with publication-grade settings.compiled_functions_hshim1_v3.R— convenience script bundling all functions into one file for quick sourcing in fresh environments.
-
example_workflow/— minimal, runnable example of the pipeline.example_workflow.Rmd— end-to-end vignette demonstrating how to load functions, fetch gene sets, and produce figures.example_workflow.html— knitted HTML output for quick preview of results without running the code.
-
extracted_gene_lists/— precomputed/reference gene lists.combined_muscle_gene_lists.csv— compiled muscle-related gene set used throughout the figures/examples.
Quick start (source functions & run the example)
- Install dependencies (CRAN + Bioconductor), see the “Required packages / dependencies” section.
- In a fresh R session, source either individual functions or the bundled script:
# Option A: source only what you need source("functions/extract_GO_genes.R") source("functions/extract_msigdb_genes.R") source("functions/create_clustered_heatmap.R")source("functions/compiled_functions_hshim1_v3.R") - Knit or open the example:
rmarkdown::render("example_workflow/example_workflow.Rmd") # produces example_workflow.html
Notes. Figures and intermediate outputs are not committed by default if your .gitignore excludes them; adjust as needed.
Consider using a project-local environment manager (e.g., renv) for reproducible package versions.
extract_GO_genes() provides functionality to retrieve gene lists associated with one or more Gene Ontology (GO) terms via Ensembl BioMart.
It can optionally expand each term to include descendant terms in the GO hierarchy, supports BP / MF / CC ontologies,
and returns results as a list, vector, or data.frame.
# Example: Biological process "muscle structure development" (GO:0061061)
muscle_development <- extract_GO_genes(
"GO:0061061",
include_descendants = TRUE,
organism = "human",
ontology = "BP",
output_format = "list",
export_csv = FALSE,
output_dir = "GO_gene_lists/"
)
Function overview
- What it does: Connects to Ensembl BioMart and fetches genes annotated to the specified GO term(s).
If
include_descendants = TRUE, it expands the query to all descendant terms usingGO.dband deduplicates gene symbols. - Why it’s useful: Produces comprehensive, up-to-date gene sets for biological concepts (e.g., “cell cycle”, “muscle development”).
- Performance note: Requires an internet connection. Large GO terms may take longer depending on Ensembl server availability.
| Parameter | Type | Default | Description |
|---|---|---|---|
go_terms | character | — | One or more GO IDs (e.g., "GO:0061061"). |
organism | character | "human" | "human" or "mouse". Sets Ensembl dataset. |
include_descendants | logical | TRUE | If TRUE, expands to include all descendant terms. |
ontology | character | "BP" | GO domain: "BP" (Biological Process), "MF", or "CC". |
output_format | character | "list" | "list" (default), "vector" (single GO), or "data.frame". |
export_csv | logical | FALSE | If TRUE, exports results as CSVs in output_dir. |
output_dir | character | "GO_gene_lists/" | Directory for CSV exports (created if missing). |
- Single term + "vector" → character vector of gene symbols
- Single/multiple + "list" → named list of gene vectors (with metadata in
attr(result, "metadata")) - "data.frame" → table with columns: gene_symbol, go_label, go_id, go_name
References:
Ashburner et al., Nat Genet, 2000.
The Gene Ontology Consortium, Nucleic Acids Res, 2019.
The Molecular Signatures Database (MSigDB) is a widely used collection of expert-curated and community-contributed gene sets designed for pathway and enrichment analyses (e.g., GSEA, fgsea, clusterProfiler). It organizes thousands of gene sets into thematic collections—Hallmark biology, curated pathways (KEGG, Reactome, WikiPathways), regulatory targets, ontology terms, and more—so you can move from a list of differentially expressed genes to interpretable biology with confidence. Explore the human collections here: gsea-msigdb.org › msigdb › human › genesets .
In practice, MSigDB shines when you need clean, reproducible definitions of pathways or biological programs—whether that’s a compact Hallmark signature for a figure, or a comprehensive GO category for a hypothesis test. The structure below shows how the database is organized; each collection groups gene sets with a shared purpose or source.
H — Hallmark gene sets (50 gene sets)
Compact, non-redundant “consensus” signatures that capture major biological states and processes.
C1 — Positional gene sets (302 gene sets)
Genes grouped by chromosomal location.
Chromosomes: 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 X Y MT
C2 — Curated gene sets (7,561 gene sets)
- CGP — Chemical & genetic perturbations (3,538)
- CP — Canonical pathways (4,023), including:
- CP:BIOCARTA (292)
- CP:KEGG_MEDICUS (658)
- CP:PID (196)
- CP:REACTOME (1,787)
- CP:WIKIPATHWAYS (885)
- CP:KEGG_LEGACY (186)
C3 — Regulatory gene sets (3,713 gene sets)
- MIR — microRNA targets (2,598)
- MIR:MIRDB (2,377)
- MIR:MIR_LEGACY (221)
- TFT — Transcription factor targets (1,115)
- TFT:GTRD (505)
- TFT:TFT_LEGACY (610)
C4 — Computational gene sets (1,006 gene sets)
- 3CA — Curated Cancer Cell Atlas (148)
- CGN — Cancer gene neighborhoods (427)
- CM — Cancer modules (431)
C5 — Ontology gene sets (16,228 gene sets)
- GO — Gene Ontology (10,480)
- GO:BP — Biological Process (7,583)
- GO:CC — Cellular Component (1,042)
- GO:MF — Molecular Function (1,855)
- HPO — Human Phenotype Ontology (5,748)
C6 — Oncogenic signatures (189 gene sets)
Gene sets capturing oncogene activation and cancer-relevant programs.
C7 — Immunologic signatures (5,219 gene sets)
- IMMUNESIGDB — ImmuneSigDB (4,872)
- VAX — Vaccine response signatures (347)
C8 — Cell type signatures (866 gene sets)
Reference gene sets for cell identity and composition analyses.
Note. Counts shown above reflect the breakdown provided for the human MSigDB collections. For the latest tallies and updates, see the official MSigDB site: MSigDB Human Gene Sets.
Purpose. extract_msigdb_genes() pulls gene sets from the
Molecular Signatures Database (MSigDB)
and returns clean vectors or named lists ready for enrichment analyses (GSEA/fgsea, clusterProfiler) and figure workflows.
You can search by exact set names (e.g., HALLMARK_APOPTOSIS) or by a broad pattern (e.g., "apoptosis"),
restrict to specific collections (Hallmark, KEGG, Reactome, GO), choose your preferred ID type (SYMBOL/ENTREZID/ENSEMBL),
and optionally convert across organisms (human ↔ mouse).
# Hallmark (human, SYMBOLs)
apoptosis_genes <- extract_msigdb_genes(
gene_sets = "HALLMARK_APOPTOSIS",
organism = "human",
gene_id_type = "SYMBOL"
)
# KEGG glycolysis, restricting to canonical pathways (C2 > CP:KEGG)
kegg_glycolysis <- extract_msigdb_genes(
gene_sets = "GLYCOLYSIS",
search_all_collections = FALSE,
collection = "C2",
subcollection = "CP:KEGG"
)
# Reactome WNT, ENTREZ IDs
reactome_wnt <- extract_msigdb_genes(
gene_sets = "SIGNALING_BY_WNT",
gene_id_type = "ENTREZID",
search_all_collections = FALSE,
collection = "C2",
subcollection = "CP:REACTOME"
)
# Mouse Hallmark inflammatory response
mouse_inflammation <- extract_msigdb_genes(
gene_sets = "HALLMARK_INFLAMMATORY_RESPONSE",
organism = "mouse"
)
# Cross-species conversion (human -> mouse)
mouse_emt <- extract_msigdb_genes(
gene_sets = "HALLMARK_EPITHELIAL_MESENCHYMAL_TRANSITION",
organism = "human",
convert_organism = TRUE,
target_organism = "mouse"
)
# Multiple sets + metadata inspection
sets <- c("HALLMARK_P53_PATHWAY", "HALLMARK_HYPOXIA", "KEGG_CELL_CYCLE")
results <- extract_msigdb_genes(sets)
meta <- attr(results, "metadata")
# Summarize how many genes per set (requires dplyr installed)
dplyr::bind_rows(lapply(meta, as.data.frame), .id = "set_label")
| Parameter | Type | Default | Description |
|---|---|---|---|
gene_sets |
character | — | Exact names (e.g., "HALLMARK_APOPTOSIS") or case-insensitive patterns (e.g., "APOPTOSIS"). Single or vector. |
organism |
character | "human" |
"human" or "mouse" (maps to Homo sapiens/Mus musculus in msigdbr). |
gene_id_type |
character | "SYMBOL" |
Output identifiers: "SYMBOL", "ENTREZID", or "ENSEMBL". |
convert_organism |
logical | FALSE |
If TRUE, converts the result to target_organism via BioMart homology. |
target_organism |
character | NULL | Required when convert_organism = TRUE; choose "human" or "mouse". |
search_all_collections |
logical | TRUE |
Search across all MSigDB collections. Set to FALSE to restrict using collection/subcollection. |
collection |
character | NULL | One of H, C1, C2, C3, C4, C5, C6, C7, C8 (when restricting). |
subcollection |
character | NULL | Optional subcategory (e.g., CP:KEGG, CP:REACTOME, GO:BP). |
export_csv |
logical | FALSE |
If TRUE, writes each set to CSV in output_dir. |
output_dir |
character | "MSigDB_gene_lists/" |
Destination for CSV export (created if missing). |
- Single match: a character vector of gene identifiers (format per
gene_id_type). - Multiple matches: a named list of vectors. Rich metadata is attached at
attr(result, "metadata"):original_query,gene_set_name,collection,subcollection,n_genes,organism,gene_id_type,query_date.
# Example: collect metadata into a data frame
res <- extract_msigdb_genes(c("HALLMARK_APOPTOSIS", "HALLMARK_HYPOXIA", "REACTOME_TCA_CYCLE"))
meta <- attr(res, "metadata")
summary_df <- dplyr::bind_rows(lapply(meta, as.data.frame), .id = "set_label")
summary_df
- Pattern matching. If your query is a broad pattern (e.g.,
"CELL_CYCLE"), the function lists all matches and prefers exact set names when available. Otherwise, it selects the first match and reports alternatives. - ID formats. Choose the ID type that fits your downstream tools: SYMBOL for readability, ENTREZ for many enrichment functions, ENSEMBL for genomic work.
- Cross-species. Homology mapping (human ↔ mouse) uses BioMart; not all genes map 1:1, so expect some dropouts or merges.
- Exports. Set
export_csv = TRUEto archive per-set CSVs alongside the analysis. - Versions & licensing. Uses MSigDB via
msigdbr(v2023.2+). Academic use is generally free; some curated sources (e.g., KEGG) may require a commercial license—check the MSigDB site.
- msigdbr (CRAN) — MSigDB access from R
- MSigDB Human Gene Sets — browse collections
- biomaRt — organism conversion (homology)
- Liberzon A. et al. The MSigDB hallmark gene set collection. Cell Systems. 2015;1(6):417–425.
- Subramanian A. et al. Gene set enrichment analysis. PNAS. 2005;102(43):15545–15550.
Implementation note. Cross-species conversion in examples assumes a helper like convert_gene_list() is available in your codebase (BioMart-backed). Adjust to your preferred orthology strategy as needed.