Skip to content

🧬 Generative modeling of regulatory DNA sequences with diffusion probabilistic models πŸ’¨

License

Notifications You must be signed in to change notification settings

pinellolab/DNA-Diffusion

Folders and files

NameName
Last commit message
Last commit date

Latest commit

Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 

Repository files navigation

DNA Diffusion

Generative modeling of regulatory DNA sequences with diffusion probabilistic models.

build codecov PyPI version

All Contributors


Documentation: https://pinellolab.github.io/DNA-Diffusion

Source Code: https://github.com/pinellolab/DNA-Diffusion


Introduction

DNA-Diffusion is diffusion-based model for generation of 200bp cell type-specific synthetic regulatory elements.

Installation

Our preferred package / project manager is uv. Please follow their recommended instructions for installation.

To clone the repository and install the necessary packages, run:

git clone  https://github.com/pinellolab/DNA-Diffusion.git
cd DNA-Diffusion
uv sync

This will create a virtual environment in .venv and install all dependencies listed in the pyproject.toml file. This is compatible with both CPU and GPU, but preferred operating system is Linux with a recent GPU (e.g. A100 GPU).

Usage

Training

To train the DNA-Diffusion model, we provide a basic config file for training the diffusion model on the same subset of chromatin accessible regions from the DHS Index dataset used in our main manuscript (K562, GM12878, HepG2, hESC cell lines).

To train the model call:

uv run train.py

We also provide a base config for debugging that will use a single sequence for training. You can override the default training script to use this debugging config by calling:

uv run train.py -cn train_debug

Sequence Generation

We provide a basic config file for generating sequences using the diffusion model resulting in 1000 sequences made per cell type. Base generation utilizes a guidance scale 1.0, however this can be tuned within the sample.py with the cond_weight_to_metric parameter. To generate sequences call:

uv run sample.py

The default setup for sampling will generate 1000 sequences per cell type. You can override the default sampling script to generate one sequence per cell type with the following cli flags:

uv run sample.py sampling.number_of_samples=1 sampling.sample_batch_size=1

Contributors ✨

Thanks goes to these wonderful people (emoji key):

Lucas Ferreira da Silva
Lucas Ferreira da Silva

πŸ€” πŸ’»
Luca Pinello
Luca Pinello

πŸ€”
Simon
Simon

πŸ€” πŸ’»

This project follows the all-contributors specification. Contributions of any kind welcome!