Releases: epigen/enrichment_analysis
v2.0.0 - Snakemake 8 compatible
Breaking change: Requires Snakemake >= v8.20.1
Full Changelog: v1.0.1...v2.0.0
v1.0.1 - bug fixes and exception handling
Bug fixes and exception handling.
Full Changelog: v1.0.0...v1.0.1
v1.0.0 - stable version with new features, complete docs and examples
Features
-
Enrichment Analysis Methods:
- Region Set Analysis:
- LOLA: Genomic Locus Overlap Enrichment Analysis.
- GREAT: Genomic Regions Enrichment of Annotations Tool using rGREAT.
- pycisTarget: Motif enrichment analysis in region sets to identify high-confidence transcription factor (TF) cistromes.
- Gene Set Analysis:
- Over-representation Analysis (ORA): Using GSEApy's enrich() function.
- RcisTarget: Motif enrichment analysis in gene sets to identify high-confidence TF cistromes.
- Region-based Gene Set Analysis:
- Region-gene associations obtained using (r)GREAT.
- Complementary ORA using GSEApy and TFBS motif enrichment analysis using RcisTarget.
- Preranked Gene Set Analysis:
- Preranked GSEA using GSEApy's prerank() function.
- Region Set Analysis:
-
Database Support:
- Local databases for GSEApy and (r)GREAT
- GMT files e.g., from MSigDB or Enrichr.
- (custom) JSON file support.
- LOLA databases from LOLA Region Databases or custom created.
- cisTarget databases for pycisTarget and RcisTarget.
- Local databases for GSEApy and (r)GREAT
-
Group Aggregation:
- Aggregation of results per method and database.
- Filtered aggregation retaining only statistically significant terms.
-
Visualization:
- Enrichment dot plots for each query, method, and database combination.
- Hierarchically clustered heatmaps and bubble plots for group summaries.
Documentation
-
Usage Instructions:
- Steps to download relevant databases and configure the analysis.
- Commands for running the workflow and generating reports.
-
Examples: Provided example queries and databases with instructions for running a complete analysis.
-
Links and Resources:
- GitHub repository, Zenodo repository, and Snakemake Workflow Catalog entry.
- Recommended compatible MR.PARETO modules for upstream processing and analyses.
- Web versions of some tools and databases for region/gene sets.
Beware: All packages got updated/changed to their latest versions, therefore results might differ. If possible, rerunning is recommended. The workflow expanded its functionality significantly, hence many changes were introduced especially in the configuration.
Thanks to early adopters @dariarom94, @Rubbert, and @bednarsky for testing and providing constructive feedback.
Bug fixes and performance improvements are not mentioned.
Full Changelog: v0.1.1...v1.0.0
v0.1.1 - small improvements, documentation and citation information
v0.1.0 - stable version with complete docs and examples
features
- enrichment analysis methods
- region-sets
- gene-sets
- over-representation analysis (ORA) using GSEApy enrich() function performs Fisher’s exact test (i.e., hypergeometric test) and is run locally.
- preranked gene-set enrichment analysis (preranked GSEA) using GSEApy prerank() function performs preranked GSEA and is run locally.
Note: All genomic region sets are subjected to gene-set ORA, leveraging region-gene associations of each query, and background region-set obtained using GREAT. Thereby, an extended region-set enrichment perspective can be gained by querying databases, that are not supported by region-based tools.
-
resources (databases) for both gene-based analyses are either downloaded (Enrichr) or copied from local JSON or GMT files.
- all Enrichr databases can be queried (enrichr_dbs).
- local JSON database files can be queried (local_json_dbs).
- local GMT database files (e.g., from MSigDB) can be queried (local_gmt_dbs).
-
group aggregation of results per method and database
- results of all queries belonging to the same group are aggregated per method and database.
- a filtered version taking the union of all statistically significant terms per query is also saved.
-
visualization
- region/gene-set specific enrichment dot plots are generated for each query, method, and database combination where the top terms are ranked (along the y-axis) by the mean rank of statistical significance, effect-size, and overlap with the goal to make the results more balanced and interpretable.
- group summary/overview
- the union of the most significant terms per query, method, and database within a group is determined.
- their effect-size and statistical significance are visualized as hierarchically clustered heatmaps.
- a hierarchically clustered bubble plot encoding both effect-size and significance is provided.
docuemntation
- complete documentation of used software, all features, and methods
- a minimal example to test all supported features
- external resources