This repository accompanies the preprint Dissecting regulatory syntax in human development with scalable multiomics and deep learning (Liu*, Jessa*, Kim*, Ng*, ..., Kundaje+, Farh+, Greenleaf+, bioRxiv, 2025).
* equal contribution
+ co-corresponding
- The repository is on GitHub here and you can view a rendered version here
- This repository contains primarily code, see the Data availability section for links to data, and our documentation here for download instructions and explanations of data formats
- Jump to the Code to reproduce figures section for links to code and rendered HTMLs for analysis presented in each figure
This repository is meant to enhance the Materials & Methods section by providing code for the custom analyses in the manuscript, in order to improve reproducibility for the main results. However, it is not a fully executable workflow.
code
--> pipelines, scripts, and analysis notebooks for data processing and analysisutils
--> contains .R files with custom functions and palettes used throughout the analysis01-preprocessing
01-snakemake
--> config files for processing raw bcl files into fragment files and count matrices02-archr_seurat_scripts
--> per organ preprocessing scripts to create final Seurat objects and ArchR projects03-global
--> creating global objects (e.g. global peak set, marker genes)
02-global_analysis
01
--> global QC and metadata visualizations per organ and per sample02
,03
--> construction of dendrogram on cell type similarity04
--> calculate TF expression levels
03-chrombpnet
- detailed README here
00
--> prepare inputs for training ChromBPNet models01
--> train and interpret ChromBPNet models02
--> assembly of motif compendium/lexicon03
--> downstream analysis of ChromBPNet models and motif syntax/synergy
04-enhancers
01
--> export global accessible candidate cis-regulatory elements (acCREs)02
--> convert fragment files to tagalign for running Activity-By-Contact model (ABC)03
--> ABC workflow config files04
--> acCREs co-accessibility analysis05
--> acCREs peak-to-gene linkage analysis06
--> acCREs ABC enhancer-to-promoter linkage analysis07
--> overlap of HDMA acCREs with ENCODE v4 cCREs08
,09
--> overlap of HDMA acCREs with VISTA enhancers
05-misc
01
--> create global BPCells object02
--> examples for plotting tracks using BPCells04
--> examples for ChromBPNet use cases, including how to load models and make predictions
06-variants
00
to03
--> analysis related to eQTLs04
to05
--> causal variant analysis with gchromvar06
--> variant scoring using ChromBPNet models07a
to07c
--> plot variant scoring results
Code to reproduce analyses is saved in code
. This table contains pointers to code for the key analyses associated with each figure.
The links in the Analysis column lead to rendered HTMLs, where possible, and the links in the Path column lead to scripts or notebooks within the repository.
All data and analysis products (including fragment files, counts matrices, cell annotations, global acCRE annotations, ChromBPNet models, motif lexicon, motif instances, and genomic tracks) are deposited at https://zenodo.org/communities/hdma. A list of all data types and the corresponding URL and DOI is provided in Table S14 of the manuscript.
We provide a detailed description of the main data types deposited on Zenodo here, along with a demonstration of how to programmatically download files of interest.
All genomic tracks are also hosted online for interactive visualization with the WashU Genome Browser here at this link: https://epigenomegateway.wustl.edu/browser2022/?genome=hg38&hub=https://human-dev-multiome-atlas.s3.amazonaws.com/tracks/HDMA_trackhub.json. We demonstrate how to load tracks here.
We provide a few notebooks with examples of how to interact with HDMA data, analysis outputs, and trained models:
- How to download specific files or data for specific cell types from across the Zenodo records:
DATA.md
(html) - Plotting genomic tracks using BPCells:
code/05-misc/02-bp_cells_plotting_examples.Rmd
(html) - Use cases for ChromBPNet models and outputs, including visualizing predicted accessibility and contribution scores at a region of interest, loading models, making new predictions, and predicting variant effect:
code/05-misc/04-ChromBPNet_use_cases.ipynb
(html)
If you use this data or code, please cite:
Dissecting regulatory syntax in human development with scalable multiomics and deep learning. Betty B. Liu, Selin Jessa, Samuel H. Kim, Yan Ting Ng, Soon il Higashino, Georgi K. Marinov, Derek C. Chen, Benjamin E. Parks, Li Li, Tri C. Nguyen, Sean K. Wang, Austin T. Wang, Serena Y. Tan, Michael Kosicki, Len A. Pennacchio, Eyal Ben-David, Anca M. Pasca, Anshul Kundaje, Kyle K.H. Farh, William J. Greenleaf, bioRxiv 2025.04.30.651381; doi: https://doi.org/10.1101/2025.04.30.651381