Skip to content

A lightweight and simple pipeline for extracting gene lists of interest and generating aggregate gene expression lineplots between conditions/treatments over time

Notifications You must be signed in to change notification settings

hshim1/aggregate-gene-expression

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

36 Commits
 
 

Repository files navigation

aggregate-gene-expression-hshim1

Analysis and pipeline prepared by: Ha-Na Shim

Contact: hshim1@outlook.com for questions, feedback, or suggestions regarding this workflow.

Tools / Packages Used

R RStudio tidyverse ggplot2 biomaRt clusterProfiler MSigDB KEGG Reactome

RStudio hex ggplot2 hex tidyverse hex biomaRt hex clusterProfiler hex

About this repository

A lightweight and simple workflow/pipeline in R for extracting gene lists of interest from the gene ontology (GO) and MSigDB collections, as well as visualization of aggregate gene expression of a group of genes between conditions/treatments over time.

Required packages / dependencies

This workflow uses a small set of CRAN and Bioconductor packages.

Package Source Package Source
pheatmapCRAN RColorBrewerCRAN
dplyrCRAN ggplot2CRAN
tidyverseCRAN patchworkCRAN
matrixStatsCRAN biomaRtBioconductor
clusterProfilerBioconductor org.Hs.eg.dbBioconductor
org.Mm.eg.dbBioconductor BiocManagerCRAN
BiocManager::install(c(
  "pheatmap",
  "RColorBrewer",
  "dplyr",
  "ggplot2",
  "tidyverse",
  "patchwork",
  "matrixStats",
  "biomaRt",
  "clusterProfiler",
  "org.Hs.eg.db",
  "org.Mm.eg.db"
))

Repository structure

The structure of this repository and location of R scripts for functions used in this workflow are shown below.


.
├─ functions/
│  ├─ extract_GO_genes.R
│  ├─ extract_msigdb_genes.R
│  ├─ compile_gene_lists.R
│  ├─ create_aggregate_lineplot.R
│  ├─ create_clustered_heatmap.R
│  └─ compiled_functions_hshim1_v3.R
│
├─ example_workflow/
│  ├─ example_workflow.Rmd
│  └─ example_workflow.html   (knitted output)
│
└─ extracted_gene_lists/
   └─ combined_muscle_gene_lists.csv

Folders & key files

  • functions/ — self-contained functions used across the workflow.

    • extract_GO_genes.R — retrieve gene sets for Gene Ontology (GO) terms (supports descendants via GO hierarchy; Ensembl BioMart backend).
    • extract_msigdb_genes.R — fetch gene sets from MSigDB (Hallmark, KEGG, Reactome, GO, etc.) with flexible ID types and optional cross-species mapping.
    • compile_gene_lists.R — consolidate/merge curated gene lists (e.g., combine muscle-related sets into a single reference CSV or list object).
    • create_aggregate_lineplot.R — generate aggregate time-course/condition line plots from gene sets (e.g., mean/median expression with confidence bands).
    • create_clustered_heatmap.R — build clustered heatmaps for selected gene panels with publication-grade settings.
    • compiled_functions_hshim1_v3.R — convenience script bundling all functions into one file for quick sourcing in fresh environments.
  • example_workflow/ — minimal, runnable example of the pipeline.

    • example_workflow.Rmd — end-to-end vignette demonstrating how to load functions, fetch gene sets, and produce figures.
    • example_workflow.html — knitted HTML output for quick preview of results without running the code.
  • extracted_gene_lists/ — precomputed/reference gene lists.

    • combined_muscle_gene_lists.csv — compiled muscle-related gene set used throughout the figures/examples.
Quick start (source functions & run the example)
  1. Install dependencies (CRAN + Bioconductor), see the “Required packages / dependencies” section.
  2. In a fresh R session, source either individual functions or the bundled script:
    
    # Option A: source only what you need
    source("functions/extract_GO_genes.R")
    source("functions/extract_msigdb_genes.R")
    source("functions/create_clustered_heatmap.R")
    

    Option B: source all at once

    source("functions/compiled_functions_hshim1_v3.R")

  3. Knit or open the example:
    
    rmarkdown::render("example_workflow/example_workflow.Rmd")  # produces example_workflow.html
    

Notes. Figures and intermediate outputs are not committed by default if your .gitignore excludes them; adjust as needed. Consider using a project-local environment manager (e.g., renv) for reproducible package versions.

Part 1: Creating lists of genes of interest

Extracting GO Gene Sets with extract_GO_genes

extract_GO_genes() provides functionality to retrieve gene lists associated with one or more Gene Ontology (GO) terms via Ensembl BioMart. It can optionally expand each term to include descendant terms in the GO hierarchy, supports BP / MF / CC ontologies, and returns results as a list, vector, or data.frame.


# Example: Biological process "muscle structure development" (GO:0061061)
muscle_development <- extract_GO_genes(
  "GO:0061061",
  include_descendants = TRUE,
  organism = "human",
  ontology = "BP",
  output_format = "list",
  export_csv = FALSE,
  output_dir = "GO_gene_lists/"
)

Function overview
  • What it does: Connects to Ensembl BioMart and fetches genes annotated to the specified GO term(s). If include_descendants = TRUE, it expands the query to all descendant terms using GO.db and deduplicates gene symbols.
  • Why it’s useful: Produces comprehensive, up-to-date gene sets for biological concepts (e.g., “cell cycle”, “muscle development”).
  • Performance note: Requires an internet connection. Large GO terms may take longer depending on Ensembl server availability.

Parameters

Parameter Type Default Description
go_termscharacterOne or more GO IDs (e.g., "GO:0061061").
organismcharacter"human""human" or "mouse". Sets Ensembl dataset.
include_descendantslogicalTRUEIf TRUE, expands to include all descendant terms.
ontologycharacter"BP"GO domain: "BP" (Biological Process), "MF", or "CC".
output_formatcharacter"list""list" (default), "vector" (single GO), or "data.frame".
export_csvlogicalFALSEIf TRUE, exports results as CSVs in output_dir.
output_dircharacter"GO_gene_lists/"Directory for CSV exports (created if missing).

Return Value

  • Single term + "vector" → character vector of gene symbols
  • Single/multiple + "list" → named list of gene vectors (with metadata in attr(result, "metadata"))
  • "data.frame" → table with columns: gene_symbol, go_label, go_id, go_name

References:
Ashburner et al., Nat Genet, 2000.
The Gene Ontology Consortium, Nucleic Acids Res, 2019.

MSigDB: Curated Gene Sets for Enrichment Analyses

The Molecular Signatures Database (MSigDB) is a widely used collection of expert-curated and community-contributed gene sets designed for pathway and enrichment analyses (e.g., GSEA, fgsea, clusterProfiler). It organizes thousands of gene sets into thematic collections—Hallmark biology, curated pathways (KEGG, Reactome, WikiPathways), regulatory targets, ontology terms, and more—so you can move from a list of differentially expressed genes to interpretable biology with confidence. Explore the human collections here: gsea-msigdb.org › msigdb › human › genesets .

In practice, MSigDB shines when you need clean, reproducible definitions of pathways or biological programs—whether that’s a compact Hallmark signature for a figure, or a comprehensive GO category for a hypothesis test. The structure below shows how the database is organized; each collection groups gene sets with a shared purpose or source.


MSigDB Collections (Human)

H — Hallmark gene sets (50 gene sets)

Compact, non-redundant “consensus” signatures that capture major biological states and processes.

C1 — Positional gene sets (302 gene sets)

Genes grouped by chromosomal location.

Chromosomes: 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 X Y MT

C2 — Curated gene sets (7,561 gene sets)
  • CGP — Chemical & genetic perturbations (3,538)
  • CP — Canonical pathways (4,023), including:
    • CP:BIOCARTA (292)
    • CP:KEGG_MEDICUS (658)
    • CP:PID (196)
    • CP:REACTOME (1,787)
    • CP:WIKIPATHWAYS (885)
    • CP:KEGG_LEGACY (186)
C3 — Regulatory gene sets (3,713 gene sets)
  • MIR — microRNA targets (2,598)
    • MIR:MIRDB (2,377)
    • MIR:MIR_LEGACY (221)
  • TFT — Transcription factor targets (1,115)
    • TFT:GTRD (505)
    • TFT:TFT_LEGACY (610)
C4 — Computational gene sets (1,006 gene sets)
  • 3CA — Curated Cancer Cell Atlas (148)
  • CGN — Cancer gene neighborhoods (427)
  • CM — Cancer modules (431)
C5 — Ontology gene sets (16,228 gene sets)
  • GO — Gene Ontology (10,480)
    • GO:BP — Biological Process (7,583)
    • GO:CC — Cellular Component (1,042)
    • GO:MF — Molecular Function (1,855)
  • HPO — Human Phenotype Ontology (5,748)
C6 — Oncogenic signatures (189 gene sets)

Gene sets capturing oncogene activation and cancer-relevant programs.

C7 — Immunologic signatures (5,219 gene sets)
  • IMMUNESIGDB — ImmuneSigDB (4,872)
  • VAX — Vaccine response signatures (347)
C8 — Cell type signatures (866 gene sets)

Reference gene sets for cell identity and composition analyses.

Note. Counts shown above reflect the breakdown provided for the human MSigDB collections. For the latest tallies and updates, see the official MSigDB site: MSigDB Human Gene Sets.

Extracting Gene Sets with extract_msigdb_genes

Purpose. extract_msigdb_genes() pulls gene sets from the Molecular Signatures Database (MSigDB) and returns clean vectors or named lists ready for enrichment analyses (GSEA/fgsea, clusterProfiler) and figure workflows. You can search by exact set names (e.g., HALLMARK_APOPTOSIS) or by a broad pattern (e.g., "apoptosis"), restrict to specific collections (Hallmark, KEGG, Reactome, GO), choose your preferred ID type (SYMBOL/ENTREZID/ENSEMBL), and optionally convert across organisms (human ↔ mouse).

Quick usage


# Hallmark (human, SYMBOLs)
apoptosis_genes <- extract_msigdb_genes(
  gene_sets = "HALLMARK_APOPTOSIS",
  organism = "human",
  gene_id_type = "SYMBOL"
)

# KEGG glycolysis, restricting to canonical pathways (C2 > CP:KEGG)
kegg_glycolysis <- extract_msigdb_genes(
  gene_sets = "GLYCOLYSIS",
  search_all_collections = FALSE,
  collection = "C2",
  subcollection = "CP:KEGG"
)

# Reactome WNT, ENTREZ IDs
reactome_wnt <- extract_msigdb_genes(
  gene_sets = "SIGNALING_BY_WNT",
  gene_id_type = "ENTREZID",
  search_all_collections = FALSE,
  collection = "C2",
  subcollection = "CP:REACTOME"
)

# Mouse Hallmark inflammatory response
mouse_inflammation <- extract_msigdb_genes(
  gene_sets = "HALLMARK_INFLAMMATORY_RESPONSE",
  organism = "mouse"
)

# Cross-species conversion (human -> mouse)
mouse_emt <- extract_msigdb_genes(
  gene_sets = "HALLMARK_EPITHELIAL_MESENCHYMAL_TRANSITION",
  organism = "human",
  convert_organism = TRUE,
  target_organism = "mouse"
)

# Multiple sets + metadata inspection
sets    <- c("HALLMARK_P53_PATHWAY", "HALLMARK_HYPOXIA", "KEGG_CELL_CYCLE")
results <- extract_msigdb_genes(sets)
meta    <- attr(results, "metadata")

# Summarize how many genes per set (requires dplyr installed)
dplyr::bind_rows(lapply(meta, as.data.frame), .id = "set_label")

Parameters

Parameter Type Default Description
gene_sets character Exact names (e.g., "HALLMARK_APOPTOSIS") or case-insensitive patterns (e.g., "APOPTOSIS"). Single or vector.
organism character "human" "human" or "mouse" (maps to Homo sapiens/Mus musculus in msigdbr).
gene_id_type character "SYMBOL" Output identifiers: "SYMBOL", "ENTREZID", or "ENSEMBL".
convert_organism logical FALSE If TRUE, converts the result to target_organism via BioMart homology.
target_organism character NULL Required when convert_organism = TRUE; choose "human" or "mouse".
search_all_collections logical TRUE Search across all MSigDB collections. Set to FALSE to restrict using collection/subcollection.
collection character NULL One of H, C1, C2, C3, C4, C5, C6, C7, C8 (when restricting).
subcollection character NULL Optional subcategory (e.g., CP:KEGG, CP:REACTOME, GO:BP).
export_csv logical FALSE If TRUE, writes each set to CSV in output_dir.
output_dir character "MSigDB_gene_lists/" Destination for CSV export (created if missing).

Return value

  • Single match: a character vector of gene identifiers (format per gene_id_type).
  • Multiple matches: a named list of vectors. Rich metadata is attached at attr(result, "metadata"): original_query, gene_set_name, collection, subcollection, n_genes, organism, gene_id_type, query_date.

# Example: collect metadata into a data frame
res  <- extract_msigdb_genes(c("HALLMARK_APOPTOSIS", "HALLMARK_HYPOXIA", "REACTOME_TCA_CYCLE"))
meta <- attr(res, "metadata")
summary_df <- dplyr::bind_rows(lapply(meta, as.data.frame), .id = "set_label")
summary_df

Practical notes

  • Pattern matching. If your query is a broad pattern (e.g., "CELL_CYCLE"), the function lists all matches and prefers exact set names when available. Otherwise, it selects the first match and reports alternatives.
  • ID formats. Choose the ID type that fits your downstream tools: SYMBOL for readability, ENTREZ for many enrichment functions, ENSEMBL for genomic work.
  • Cross-species. Homology mapping (human ↔ mouse) uses BioMart; not all genes map 1:1, so expect some dropouts or merges.
  • Exports. Set export_csv = TRUE to archive per-set CSVs alongside the analysis.
  • Versions & licensing. Uses MSigDB via msigdbr (v2023.2+). Academic use is generally free; some curated sources (e.g., KEGG) may require a commercial license—check the MSigDB site.

See also

References

  • Liberzon A. et al. The MSigDB hallmark gene set collection. Cell Systems. 2015;1(6):417–425.
  • Subramanian A. et al. Gene set enrichment analysis. PNAS. 2005;102(43):15545–15550.

Implementation note. Cross-species conversion in examples assumes a helper like convert_gene_list() is available in your codebase (BioMart-backed). Adjust to your preferred orthology strategy as needed.

About

A lightweight and simple pipeline for extracting gene lists of interest and generating aggregate gene expression lineplots between conditions/treatments over time

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published