Skip to content

Code & documentation for Dissecting regulatory syntax in human development with scalable multiomics and deep learning (Liu*, Jessa*, Kim*, Ng*, et al, bioRxiv, 2025).

License

Notifications You must be signed in to change notification settings

GreenleafLab/HDMA

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

18 Commits
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

Human Development Multiomic Atlas

This repository accompanies the preprint Dissecting regulatory syntax in human development with scalable multiomics and deep learning (Liu*, Jessa*, Kim*, Ng*, ..., Kundaje+, Farh+, Greenleaf+, bioRxiv, 2025).

* equal contribution
+ co-corresponding

  • The repository is on GitHub here and you can view a rendered version here
  • This repository contains primarily code, see the Data availability section for links to data, and our documentation here for download instructions and explanations of data formats
  • Jump to the Code to reproduce figures section for links to code and rendered HTMLs for analysis presented in each figure

Contents

Codebase

This repository is meant to enhance the Materials & Methods section by providing code for the custom analyses in the manuscript, in order to improve reproducibility for the main results. However, it is not a fully executable workflow.

  • code --> pipelines, scripts, and analysis notebooks for data processing and analysis
    • utils --> contains .R files with custom functions and palettes used throughout the analysis
    • 01-preprocessing
      • 01-snakemake --> config files for processing raw bcl files into fragment files and count matrices
      • 02-archr_seurat_scripts --> per organ preprocessing scripts to create final Seurat objects and ArchR projects
      • 03-global --> creating global objects (e.g. global peak set, marker genes)
    • 02-global_analysis
      • 01 --> global QC and metadata visualizations per organ and per sample
      • 02, 03 --> construction of dendrogram on cell type similarity
      • 04 --> calculate TF expression levels
    • 03-chrombpnet
      • detailed README here
      • 00 --> prepare inputs for training ChromBPNet models
      • 01 --> train and interpret ChromBPNet models
      • 02 --> assembly of motif compendium/lexicon
      • 03 --> downstream analysis of ChromBPNet models and motif syntax/synergy
    • 04-enhancers
      • 01 --> export global accessible candidate cis-regulatory elements (acCREs)
      • 02 --> convert fragment files to tagalign for running Activity-By-Contact model (ABC)
      • 03 --> ABC workflow config files
      • 04 --> acCREs co-accessibility analysis
      • 05 --> acCREs peak-to-gene linkage analysis
      • 06 --> acCREs ABC enhancer-to-promoter linkage analysis
      • 07 --> overlap of HDMA acCREs with ENCODE v4 cCREs
      • 08, 09 --> overlap of HDMA acCREs with VISTA enhancers
    • 05-misc
      • 01 --> create global BPCells object
      • 02 --> examples for plotting tracks using BPCells
      • 04 --> examples for ChromBPNet use cases, including how to load models and make predictions
    • 06-variants
      • 00 to 03 --> analysis related to eQTLs
      • 04 to 05 --> causal variant analysis with gchromvar
      • 06 --> variant scoring using ChromBPNet models
      • 07a to 07c --> plot variant scoring results

Code to produce the figures

Code to reproduce analyses is saved in code. This table contains pointers to code for the key analyses associated with each figure. The links in the Analysis column lead to rendered HTMLs, where possible, and the links in the Path column lead to scripts or notebooks within the repository.

Figure Analysis Path
Fig 1b, Fig S2b,c Global QC and metadata code/02-global_analysis/01-global_QC.Rmd
Fig 1c Dendrogram and dotplot code/02-global_analysis/02-dendrogram.Rmd
Fig 1c ChromVAR heatmap code/02-global_analysis/03-dendrogram_chromvar.Rmd
Fig 2a-e, Fig S2f ABC linking of acCREs code/04-enhancers/06-abc.Rmd
Fig 2f-g, Fig S3a, Fig S4k Analysis of VISTA-overlapping enhancers code/04-enhancers/09-overlap_VISTA.Rmd
Fig S2d-e Overlap of acCREs with ENCODE CREs code/04-enhancers/07-overlap_ENCODE_cCREs.Rmd
Fig 3b, Fig 6a, Fig S5 Plotting tracks at select loci code/03-chrombpnet/03-syntax/02-plot_tracks.Rmd
Fig 3c, Fig S4a,b,i,j ChromBPNet QC and correlation plot code/03-chrombpnet/01-train_models/03-model_QC.Rmd and code/03-chrombpnet/01-train_models/03b-plot_correlation.ipynb
Fig 3d-e, Fig 6b,d, Fig S4d-f, Fig S5b Motif lexicon/compendium code/03-chrombpnet/03-syntax/01-motif_compendium
Fig S4g-h Visualize motif instances code/03-chrombpnet/03-syntax/03-visualize_hits.ipynb
Fig 4, Fig 5a, Fig S6 Analysis of motif cooperativity/synergy and syntax code/03-chrombpnet/03-syntax/04c-plot_cooperativity_results.Rmd
Fig 5b Context-specific motif cooperativity code/03-chrombpnet/03-syntax/05b-context_specific_cooperativity.Rmd
Fig 6f, Fig S7 eQTL enrichment analysis code/06-variants/03-enrichment_test_collate_results.R
Fig 7b g-chromVAR analysis code/06-variants/04-gchromvar.R
Fig 7c-d Plot tracks for variants of interest code/06-variants/07b_rs12740374_muscle_endo_CAD.R and code/06-variants/07c_rs113892147_lung_macrophage_asthma.R
Fig S8 Plot tracks for all fetal-only variants code/06-variants/07a_plot_fetal_only_hits_variant_scoring_results.R

Data availability

All data and analysis products (including fragment files, counts matrices, cell annotations, global acCRE annotations, ChromBPNet models, motif lexicon, motif instances, and genomic tracks) are deposited at https://zenodo.org/communities/hdma. A list of all data types and the corresponding URL and DOI is provided in Table S14 of the manuscript.

We provide a detailed description of the main data types deposited on Zenodo here, along with a demonstration of how to programmatically download files of interest.

All genomic tracks are also hosted online for interactive visualization with the WashU Genome Browser here at this link: https://epigenomegateway.wustl.edu/browser2022/?genome=hg38&hub=https://human-dev-multiome-atlas.s3.amazonaws.com/tracks/HDMA_trackhub.json. We demonstrate how to load tracks here.

Vignettes

We provide a few notebooks with examples of how to interact with HDMA data, analysis outputs, and trained models:

Citation

If you use this data or code, please cite:

Dissecting regulatory syntax in human development with scalable multiomics and deep learning. Betty B. Liu, Selin Jessa, Samuel H. Kim, Yan Ting Ng, Soon il Higashino, Georgi K. Marinov, Derek C. Chen, Benjamin E. Parks, Li Li, Tri C. Nguyen, Sean K. Wang, Austin T. Wang, Serena Y. Tan, Michael Kosicki, Len A. Pennacchio, Eyal Ben-David, Anca M. Pasca, Anshul Kundaje, Kyle K.H. Farh, William J. Greenleaf, bioRxiv 2025.04.30.651381; doi: https://doi.org/10.1101/2025.04.30.651381

About

Code & documentation for Dissecting regulatory syntax in human development with scalable multiomics and deep learning (Liu*, Jessa*, Kim*, Ng*, et al, bioRxiv, 2025).

Resources

License

Stars

Watchers

Forks

Contributors 2

  •  
  •