📝 Note: This repo is currently undergoing development. To access the version using for the encode_re2g paper, go to this version. There are currently no clear instructions for stitching together the outputs from ABC, e2g features, and e2g, so use at your own discretion. We are working on creating 1 clean pipeline for the future
ENCODE-rE2G is a logistic regression pipeline built on top of ABC. Given a chromatin accessibility input file, it will generate a list of enhancer-gene predictions. You can read the preprint paper here
Clone the repo and set it up for submodule usage
git clone --recurse-submodules git@github.com:EngreitzLab/ENCODE_rE2G.git
git config --global submodule.recurse true
We use ABC
as a submodule, so this command will initialize it and set up your git config to automatically keep the submodule up to date.
You'll need to use a certain model based on your input. (e.g DNase-seq or ATAC-seq? Do you have H3K27ac data?) We've pretrained all the models and determined the right thresholding to get E-G links at 70% recall of a CRISPR-validated E-G links.
Modify the ABC_BIOSAMPLES
field in config/config.yaml
to point to your ABC config. Read more about ABC config here.
- We have not trained powerlaw models. If you don't have cell specific hic, opt to use megamap hic instead: https://s3.us-central-1.wasabisys.com/aiden-encode-hic-mirror/bifocals_iter2/tissues.hic. Megamap is a culmination of hic across many different tissues.
- [Advanced] If applying a model that includes external features, you must define an
external_features_config
in yourbiosamples_config.
See "Train model" section for details on this file.
Activate a conda environment that has mamba installed.
mamba env create -f workflow/envs/encode_re2g.yml
conda activate encode_re2g
snakemake -j1 --use-conda
Based on your biosample config, we will find the right model to use for you. If we haven't trained that model before, an exception will get raised.
Output will show up in the results/
directory
- Binarized predictions will be located at
results/{biosample_name}/{model_name}/encode_e2g_predictions_threshold.{threshold}.tsv.gz
- Non-thresholded models will be located at
results/{biosample_name}/{model_name}/encode_e2g_predictions.tsv.gz
with the score in a column named "ENCODE-rE2G.Score".
We have pre-trained ENCODE-rE2G on certain model types. You can find them in the models
directory.
Each model must have the following:
- model pickle file (
model.pkl
corresponding tomodel_full.pkl
from the model training workflow) - feature table file (
feature_table.tsv
, the corresponding feature table file from model training) - threshold file (
threshold_0.XXX
where predictions with a score greater than 0.XXX are binarized as true links.
The way we choose the model depends on the biosamples input. The code for model selection can be found here.
To override default model selection and specify a different model (either one you've trained yourself or the extended model), add a column called model_dir
to your biosample config. Multiple model directories can be specified as a comma-separated list. NOTE: The genome-wide feature tables to reproduce the ENCODE-rE2G_Extended model included in the prediction files on Synapse.org for K562 and GM12878. To use these feature tables, download the feature tables and remove the ".Feature" suffix from feature name columns.
Important: Only train models for biosamples matching the corresponding CRISPR data (in this case, K562)
- Much of the the model training code was adapted from Alireza Karbalayghareh's original implementation.
Modify config/config_training.yaml
with your model and dataset configs
model_config
has columns: model, dataset, ABC_directory, feature_table, polynomial (do you want to use polynomial features?), and override_params (are there model training parameters you would like to change from the default logistic regression settings specfied inconfig/config_training.yaml
?)- See this example
model_config
for how to specfiy override parameters. If there are no override_params, leave the column blank but still include the header. - Feature tables must be specified for each model (example:
resources/feature_tables
) with columns: feature (name in final table), input_col (name in ABC output), second_input (multiplied by input_col if provided), aggregate_function (how to combine feature values when a CRISPR element overlaps more than one ABC element), fill_value (how to replace NAs), nice_name (used when plotting) - Note that trained models generated using polynomial features cannot directly be used in the Apply model workflow
- See this example
dataset_config
is an ABC biosamples config to generate ABC predictions for datasets without an existing ABC directory.- Each dataset must correspond to a unique ABC_directory, with "biosample" in
dataset_config
equals "dataset" inmodel_config
. If no ABC_directory is indicated inmodel_config
, it must have an entry indataset_config
. - If you are including features in addition to those generated within the pipeline (e.g. a value in input_col or second_input of a feature table is not included in
reference_features
inconfig/config_training/yaml
), you must also define how to add these values with an external_features_config, which you include indataset_config
in the optional column external_features_config:- An
external_features_config
has columns feature (corresponding to the missing input_col or second_input value), source_col (column name in the source file), aggregate_function (how to combine values when merging different element definitions), join_by, and source_file - join_by must be either "TargetGene" (feature is defined per gene) or "overlap" (feature is defined per element-gene pair)
- If join_by is "TargetGene," source_file must be a .tsv with, at minimum, columns source_col and TargetGene. If join_by is "overlap," source_file must be a .tsv with, at minimum, columns chr, start, end, TargetGene, source_col.
- An
Activate a conda environment that has mamba installed.
mamba env create -f workflow/envs/encode_re2g.yml
conda activate encode_re2g
snakemake -s workflow/Snakefile_training -j1 --use-conda
results/{biosample_name}/{model_name}/model_name/model_full.pkl
: full model trained on all chromosomes
results/{biosample_name}/{model_name}/model/training_predictions.tsv
: rE2G predictions on CRISPR training data, using leave 1 chromosome out models