aggregate-gene-expression-hshim1

Analysis and pipeline prepared by: Ha-Na Shim

Contact: hshim1@outlook.com for questions, feedback, or suggestions regarding this workflow.

Tools / Packages Used

About this repository

A lightweight and simple workflow/pipeline in R for extracting gene lists of interest from the gene ontology (GO) and MSigDB collections, as well as visualization of aggregate gene expression of a group of genes between conditions/treatments over time.

Required packages / dependencies

This workflow uses a small set of CRAN and Bioconductor packages.

Package	Source	Package	Source
`pheatmap`	CRAN	`RColorBrewer`	CRAN
`dplyr`	CRAN	`ggplot2`	CRAN
`tidyverse`	CRAN	`patchwork`	CRAN
`matrixStats`	CRAN	`biomaRt`	Bioconductor
`clusterProfiler`	Bioconductor	`org.Hs.eg.db`	Bioconductor
`org.Mm.eg.db`	Bioconductor	`BiocManager`	CRAN

BiocManager::install(c(
  "pheatmap",
  "RColorBrewer",
  "dplyr",
  "ggplot2",
  "tidyverse",
  "patchwork",
  "matrixStats",
  "biomaRt",
  "clusterProfiler",
  "org.Hs.eg.db",
  "org.Mm.eg.db"
))

Repository structure

The structure of this repository and location of R scripts for functions used in this workflow are shown below.


.
├─ functions/
│  ├─ extract_GO_genes.R
│  ├─ extract_msigdb_genes.R
│  ├─ compile_gene_lists.R
│  ├─ create_aggregate_lineplot.R
│  ├─ create_clustered_heatmap.R
│  └─ compiled_functions_hshim1_v3.R
│
├─ example_workflow/
│  ├─ example_workflow.Rmd
│  └─ example_workflow.html   (knitted output)
│
└─ extracted_gene_lists/
   └─ combined_muscle_gene_lists.csv

Folders & key files

functions/ — self-contained functions used across the workflow.
- extract_GO_genes.R — retrieve gene sets for Gene Ontology (GO) terms (supports descendants via GO hierarchy; Ensembl BioMart backend).
- extract_msigdb_genes.R — fetch gene sets from MSigDB (Hallmark, KEGG, Reactome, GO, etc.) with flexible ID types and optional cross-species mapping.
- compile_gene_lists.R — consolidate/merge curated gene lists (e.g., combine muscle-related sets into a single reference CSV or list object).
- create_aggregate_lineplot.R — generate aggregate time-course/condition line plots from gene sets (e.g., mean/median expression with confidence bands).
- create_clustered_heatmap.R — build clustered heatmaps for selected gene panels with publication-grade settings.
- compiled_functions_hshim1_v3.R — convenience script bundling all functions into one file for quick sourcing in fresh environments.
example_workflow/ — minimal, runnable example of the pipeline.
- example_workflow.Rmd — end-to-end vignette demonstrating how to load functions, fetch gene sets, and produce figures.
- example_workflow.html — knitted HTML output for quick preview of results without running the code.
extracted_gene_lists/ — precomputed/reference gene lists.
- combined_muscle_gene_lists.csv — compiled muscle-related gene set used throughout the figures/examples.

Quick start (source functions & run the example)

Install dependencies (CRAN + Bioconductor), see the “Required packages / dependencies” section.

In a fresh R session, source either individual functions or the bundled script:


# Option A: source only what you need
source("functions/extract_GO_genes.R")
source("functions/extract_msigdb_genes.R")
source("functions/create_clustered_heatmap.R")
Option B: source all at once

source("functions/compiled_functions_hshim1_v3.R")

Knit or open the example:


rmarkdown::render("example_workflow/example_workflow.Rmd")  # produces example_workflow.html

Notes. Figures and intermediate outputs are not committed by default if your .gitignore excludes them; adjust as needed. Consider using a project-local environment manager (e.g., renv) for reproducible package versions.

Part 1: Creating lists of genes of interest

Extracting GO Gene Sets with `extract_GO_genes`

extract_GO_genes() provides functionality to retrieve gene lists associated with one or more Gene Ontology (GO) terms via Ensembl BioMart. It can optionally expand each term to include descendant terms in the GO hierarchy, supports BP / MF / CC ontologies, and returns results as a list, vector, or data.frame.


# Example: Biological process "muscle structure development" (GO:0061061)
muscle_development <- extract_GO_genes(
  "GO:0061061",
  include_descendants = TRUE,
  organism = "human",
  ontology = "BP",
  output_format = "list",
  export_csv = FALSE,
  output_dir = "GO_gene_lists/"
)

Function overview

What it does: Connects to Ensembl BioMart and fetches genes annotated to the specified GO term(s). If include_descendants = TRUE, it expands the query to all descendant terms using GO.db and deduplicates gene symbols.
Why it’s useful: Produces comprehensive, up-to-date gene sets for biological concepts (e.g., “cell cycle”, “muscle development”).
Performance note: Requires an internet connection. Large GO terms may take longer depending on Ensembl server availability.

Parameters

Parameter	Type	Default	Description
`go_terms`	character	—	One or more GO IDs (e.g., "GO:0061061").
`organism`	character	"human"	"human" or "mouse". Sets Ensembl dataset.
`include_descendants`	logical	TRUE	If TRUE, expands to include all descendant terms.
`ontology`	character	"BP"	GO domain: "BP" (Biological Process), "MF", or "CC".
`output_format`	character	"list"	"list" (default), "vector" (single GO), or "data.frame".
`export_csv`	logical	FALSE	If TRUE, exports results as CSVs in `output_dir`.
`output_dir`	character	"GO_gene_lists/"	Directory for CSV exports (created if missing).

Return Value

Single term + "vector" → character vector of gene symbols
Single/multiple + "list" → named list of gene vectors (with metadata in attr(result, "metadata"))
"data.frame" → table with columns: gene_symbol, go_label, go_id, go_name

References:
Ashburner et al., Nat Genet, 2000.
The Gene Ontology Consortium, Nucleic Acids Res, 2019.

MSigDB: Curated Gene Sets for Enrichment Analyses

The Molecular Signatures Database (MSigDB) is a widely used collection of expert-curated and community-contributed gene sets designed for pathway and enrichment analyses (e.g., GSEA, fgsea, clusterProfiler). It organizes thousands of gene sets into thematic collections—Hallmark biology, curated pathways (KEGG, Reactome, WikiPathways), regulatory targets, ontology terms, and more—so you can move from a list of differentially expressed genes to interpretable biology with confidence. Explore the human collections here: gsea-msigdb.org › msigdb › human › genesets .

In practice, MSigDB shines when you need clean, reproducible definitions of pathways or biological programs—whether that’s a compact Hallmark signature for a figure, or a comprehensive GO category for a hypothesis test. The structure below shows how the database is organized; each collection groups gene sets with a shared purpose or source.

MSigDB Collections (Human)

H — Hallmark gene sets (50 gene sets)

Compact, non-redundant “consensus” signatures that capture major biological states and processes.

C1 — Positional gene sets (302 gene sets)

Genes grouped by chromosomal location.

Chromosomes: 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 X Y MT

C2 — Curated gene sets (7,561 gene sets)

CGP — Chemical & genetic perturbations (3,538)
CP — Canonical pathways (4,023), including:
- CP:BIOCARTA (292)
- CP:KEGG_MEDICUS (658)
- CP:PID (196)
- CP:REACTOME (1,787)
- CP:WIKIPATHWAYS (885)
- CP:KEGG_LEGACY (186)

C3 — Regulatory gene sets (3,713 gene sets)

MIR — microRNA targets (2,598)
- MIR:MIRDB (2,377)
- MIR:MIR_LEGACY (221)
TFT — Transcription factor targets (1,115)
- TFT:GTRD (505)
- TFT:TFT_LEGACY (610)

C4 — Computational gene sets (1,006 gene sets)

3CA — Curated Cancer Cell Atlas (148)
CGN — Cancer gene neighborhoods (427)
CM — Cancer modules (431)

C5 — Ontology gene sets (16,228 gene sets)

GO — Gene Ontology (10,480)
- GO:BP — Biological Process (7,583)
- GO:CC — Cellular Component (1,042)
- GO:MF — Molecular Function (1,855)
HPO — Human Phenotype Ontology (5,748)

C6 — Oncogenic signatures (189 gene sets)

Gene sets capturing oncogene activation and cancer-relevant programs.

C7 — Immunologic signatures (5,219 gene sets)

IMMUNESIGDB — ImmuneSigDB (4,872)
VAX — Vaccine response signatures (347)

C8 — Cell type signatures (866 gene sets)

Reference gene sets for cell identity and composition analyses.

Note. Counts shown above reflect the breakdown provided for the human MSigDB collections. For the latest tallies and updates, see the official MSigDB site: MSigDB Human Gene Sets.

Extracting Gene Sets with `extract_msigdb_genes`

Purpose. extract_msigdb_genes() pulls gene sets from the Molecular Signatures Database (MSigDB) and returns clean vectors or named lists ready for enrichment analyses (GSEA/fgsea, clusterProfiler) and figure workflows. You can search by exact set names (e.g., HALLMARK_APOPTOSIS) or by a broad pattern (e.g., "apoptosis"), restrict to specific collections (Hallmark, KEGG, Reactome, GO), choose your preferred ID type (SYMBOL/ENTREZID/ENSEMBL), and optionally convert across organisms (human ↔ mouse).

Quick usage


# Hallmark (human, SYMBOLs)
apoptosis_genes <- extract_msigdb_genes(
  gene_sets = "HALLMARK_APOPTOSIS",
  organism = "human",
  gene_id_type = "SYMBOL"
)

# KEGG glycolysis, restricting to canonical pathways (C2 > CP:KEGG)
kegg_glycolysis <- extract_msigdb_genes(
  gene_sets = "GLYCOLYSIS",
  search_all_collections = FALSE,
  collection = "C2",
  subcollection = "CP:KEGG"
)

# Reactome WNT, ENTREZ IDs
reactome_wnt <- extract_msigdb_genes(
  gene_sets = "SIGNALING_BY_WNT",
  gene_id_type = "ENTREZID",
  search_all_collections = FALSE,
  collection = "C2",
  subcollection = "CP:REACTOME"
)

# Mouse Hallmark inflammatory response
mouse_inflammation <- extract_msigdb_genes(
  gene_sets = "HALLMARK_INFLAMMATORY_RESPONSE",
  organism = "mouse"
)

# Cross-species conversion (human -> mouse)
mouse_emt <- extract_msigdb_genes(
  gene_sets = "HALLMARK_EPITHELIAL_MESENCHYMAL_TRANSITION",
  organism = "human",
  convert_organism = TRUE,
  target_organism = "mouse"
)

# Multiple sets + metadata inspection
sets    <- c("HALLMARK_P53_PATHWAY", "HALLMARK_HYPOXIA", "KEGG_CELL_CYCLE")
results <- extract_msigdb_genes(sets)
meta    <- attr(results, "metadata")

# Summarize how many genes per set (requires dplyr installed)
dplyr::bind_rows(lapply(meta, as.data.frame), .id = "set_label")

Parameters

Parameter	Type	Default	Description
`gene_sets`	character	—	Exact names (e.g., `"HALLMARK_APOPTOSIS"`) or case-insensitive patterns (e.g., `"APOPTOSIS"`). Single or vector.
`organism`	character	`"human"`	`"human"` or `"mouse"` (maps to Homo sapiens/Mus musculus in `msigdbr`).
`gene_id_type`	character	`"SYMBOL"`	Output identifiers: `"SYMBOL"`, `"ENTREZID"`, or `"ENSEMBL"`.
`convert_organism`	logical	`FALSE`	If `TRUE`, converts the result to `target_organism` via BioMart homology.
`target_organism`	character	NULL	Required when `convert_organism = TRUE`; choose `"human"` or `"mouse"`.
`search_all_collections`	logical	`TRUE`	Search across all MSigDB collections. Set to `FALSE` to restrict using `collection`/`subcollection`.
`collection`	character	NULL	One of `H`, `C1`, `C2`, `C3`, `C4`, `C5`, `C6`, `C7`, `C8` (when restricting).
`subcollection`	character	NULL	Optional subcategory (e.g., `CP:KEGG`, `CP:REACTOME`, `GO:BP`).
`export_csv`	logical	`FALSE`	If `TRUE`, writes each set to CSV in `output_dir`.
`output_dir`	character	`"MSigDB_gene_lists/"`	Destination for CSV export (created if missing).

Return value

Single match: a character vector of gene identifiers (format per gene_id_type).
Multiple matches: a named list of vectors. Rich metadata is attached at attr(result, "metadata"): original_query, gene_set_name, collection, subcollection, n_genes, organism, gene_id_type, query_date.


# Example: collect metadata into a data frame
res  <- extract_msigdb_genes(c("HALLMARK_APOPTOSIS", "HALLMARK_HYPOXIA", "REACTOME_TCA_CYCLE"))
meta <- attr(res, "metadata")
summary_df <- dplyr::bind_rows(lapply(meta, as.data.frame), .id = "set_label")
summary_df

Practical notes

Pattern matching. If your query is a broad pattern (e.g., "CELL_CYCLE"), the function lists all matches and prefers exact set names when available. Otherwise, it selects the first match and reports alternatives.
ID formats. Choose the ID type that fits your downstream tools: SYMBOL for readability, ENTREZ for many enrichment functions, ENSEMBL for genomic work.
Cross-species. Homology mapping (human ↔ mouse) uses BioMart; not all genes map 1:1, so expect some dropouts or merges.
Exports. Set export_csv = TRUE to archive per-set CSVs alongside the analysis.
Versions & licensing. Uses MSigDB via msigdbr (v2023.2+). Academic use is generally free; some curated sources (e.g., KEGG) may require a commercial license—check the MSigDB site.

References

Liberzon A. et al. The MSigDB hallmark gene set collection. Cell Systems. 2015;1(6):417–425.
Subramanian A. et al. Gene set enrichment analysis. PNAS. 2005;102(43):15545–15550.

Implementation note. Cross-species conversion in examples assumes a helper like convert_gene_list() is available in your codebase (BioMart-backed). Adjust to your preferred orthology strategy as needed.

Name		Name	Last commit message	Last commit date
Latest commit History 36 Commits
README.md		README.md

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

aggregate-gene-expression-hshim1

Tools / Packages Used

About this repository

Required packages / dependencies

Repository structure

Folders & key files

Option B: source all at once

Part 1: Creating lists of genes of interest

Extracting GO Gene Sets with `extract_GO_genes`

Parameters

Return Value

MSigDB: Curated Gene Sets for Enrichment Analyses

MSigDB Collections (Human)

Extracting Gene Sets with `extract_msigdb_genes`

Quick usage

Parameters

Return value

Practical notes

See also

References

About

Uh oh!

Releases

Packages

hshim1/aggregate-gene-expression

Folders and files

Latest commit

History

Repository files navigation

aggregate-gene-expression-hshim1

Tools / Packages Used

About this repository

Required packages / dependencies

Repository structure

Folders & key files

Option B: source all at once

Part 1: Creating lists of genes of interest

Extracting GO Gene Sets with extract_GO_genes

Parameters

Return Value

MSigDB: Curated Gene Sets for Enrichment Analyses

MSigDB Collections (Human)

Extracting Gene Sets with extract_msigdb_genes

Quick usage

Parameters

Return value

Practical notes

See also

References

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Extracting GO Gene Sets with `extract_GO_genes`

Extracting Gene Sets with `extract_msigdb_genes`

Packages