This repository contains the code and instructions for MRPA-LegNet described in
Massively parallel characterization of transcriptional regulatory elements
Vikram Agarwal, Fumitaka Inoue, Max Schubach, Dmitry Penzar, Beth K. Martin, Pyaree Mohan Dash, Pia Keukeleire, Zicong Zhang, Ajuni Sohota, Jingjing Zhao, Ilias Georgakopoulos-Soares, William S. Noble, Galip Gürkan Yardimci, Ivan V. Kulakovskiy, Martin Kircher, Jay Shendure, Nadav Ahituv, bioRxiv
MPRA-LegNet is a variant of LegNet (Paper, Repo) specifically optimized for predicting gene expression from human lentiMPRAs (massive parallel reporter assays). The model is built and tested using the data obtained with human K562, HepG2, and WTC11 cell lines.
Please start by installing the requirements from environment.txt
or environment.yml
with conda
(see Docs) mamba
(see Docs):
cd human_legnet
mamba env create -f envs/environment.yml
mamba activate legnet
Please refer to the envs/environment.yml for technical details regarding preinstalled packages and package version numbers.
MPRA-LegNet was successfully tested on multiple Ubuntu Linux releases (including but not limited to 20.04.3).
To train model from the publication you'll need 7GB GPU memory and 32 GB CPU.
10-fold cross-validation was used for the model training and testing. The respective script is python vikram/human_legnet/core.py which should be used as follows:
python core.py --model_dir <target dir to save the models checkpoint files> --data_path <experiment data tables used in study> --epoch_num <epochnum> --use_shift --reverse_augment
Please use --help
for more details regarding the command line parameters.
This script was also used to test the impact of data augmentation on the model performance, e.g,
adding --use_reverse_channel
will add the respective channel to the input.
In the study, the shift was set to the size of the forward adapter (21 bp).
For convenience, the data used to train the model is available at datasets/original
dir.
Files with name format <cell_line>.tsv
were used for training and files with name format <cell_line>_averaged.tsv
were used for accessing model's final performance.
Pre-trained models as well as the respective experimental data can be downloaded here.
To train model only for one cross-validation split (1st fold -- test, 2st fold -- validation) use flag --demo
python core.py --model_dir demo_K562 --data_path datasets/original/K562.tsv --epoch_num 25 --use_shift --reverse_augment --demo
This command will take about 5 minutes to complete on 1 NVIDIA GeForce RTX 3090 using 8 CPUs.
It will produce demo
dir containing:
config.json
-- model config required to run predictionsmodel_2_1
containing model weightslightning_logs/version_0/checkpoints/last_model-epoch=24.ckpt
and predictions for the test fold inpredictions_new_format.tsv
file
The predictions_new_format.tsv
file will contain predictions for forward and reverse sequence orientation in columns forw_pred
and rev_pred
respectively.
To get predictions from all cross-validation models we used asb_predict.py for ADASTRA (allele-specific transcription factor binding) and coverage_predict.py for UDACHA (allele-specific chromatin accessibility) datasets.
Command-line format:
python asb_predict.py --config <model config> --model <models dir> --asb_path <path to the ADASTRA dataset> --genome <path to human genome> --out_path <out file> --device 0 --max_shift 0
python coverage_predict.py --config <model config> --model <models dir> --cov_path <path to the UDACHA dataset> --genome <path to human genome> --out_path <out file> --device 0 --max_shift 0
The datasets are available in the repository, see the datasets/asb
and datasets/coverage
folders.
The scripts compare the predicted and the real effect direction of single-nucleotide variants for the alternative against the reference allele at allele-specific regulatory sites. That is whether the reference or the alternative allele is preferable for transcription factor binding or chromatin accessibility.
The post-processing of predictions is performed with Jupyter-Notebook variant_annotation.ipynb
.
Estimating the performance metrics (Fisher's exact test, R-squared calculation) for the processed predictions is performed using R script analyze_concordance.R
.
It is possible to run the LegNet model against a user-supplied fasta to obtain predictions for each sequence. This can be achieved with fasta_predict.py
.
Example command:
python fasta_predict.py --config <model config> --model <model checkpoint> --fasta <path to fasta> --out_path <out path> --device 0
Please note that this is a fairly basic wrapper, and this version of LegNet is optimized to handle 230bp long sequences. LegNet technically will work for different sequence lengths as it uses global average pooling. However, if the sequence size differs from 230 significantly, the resulting performance will be reduced. Due to the per-batch prediction, it is impossible to predict scores for sequences of different sizes if the batch size is not equal to 1.
MRPA-LegNet and the original LegNet are distributed under MIT License.