This project is licensed under the terms of the MIT license.
The preprint describing scE2G is available here.
Input: Single-cell ATAC-seq or paired ATAC and RNA-seq (mulitome) data per cell cluster
Output: Genomewide enhancer-gene regulatory link predictions per cell cluster
The pipeline consists of the following components:
- Compute ABC model predictions for each cell cluster
- Generate E2G features from ABC predictions
- If running scE2G (Multiome): compute Kendall correlation and/or ARC-E2G score for each cell cluster
- Combine 2 & 3 to construct a feature file to be used as input to predictive model
- (optional) Train predictive model using CRISPR-validated E-G pairs from K562 dataset
- Apply trained model to make predictions by assigning a score to each E-G pair
Clone the repo and set it up for submodule usage
git clone --recurse-submodules https://github.com/EngreitzLab/scE2G.git
git config --global submodule.recurse true
When running for the first time, the conda environments have to be setup.
We highly recommend using the environment specified in workflow/envs/run_snakemake.yml
, which specifies the exact package versions compatible with the pipeline.
For speed, we include mamba in this recommended environment.
conda config --set channel_priority flexible # Make sure your conda config uses flexible channel packaging to prevent unsatisfiable errors
conda env create -f workflow/envs/run_snakemake.yml
conda activate run_snakemake
Before running this workflow, users should perform clustering to define cell clusters through regular single-cell analysis (such as Seurat & Signac).
Required input data includes (refer to the example data in the resources/example_chr22_multiome_cluster
folder):
- Pseudobulk fragment files and their corresponding *.tbi index files in the same directory for each cell cluster
- Must be sorted by coordinates and gzipped. If it is not sorted, you can use the sortBed tool from bedtools:
sortBed atac_fragments.unsorted.tsv > atac_fragments.tsv
. - Must have 5 columns (no header) corresponding to chr, start, end, cell_name, read_count (usually just 1)
- The cell_name column should correspond to the cell names in the RNA count matrix. All cells in the RNA matrix must be represented in the fragment file.
- To create a .tbi index, use
bgzip
from HTSlib instead of gzip to compress the fragment file:bgzip atac_fragments.tsv
, then generate the corresponding .tbi index file usingtabix -p bed atac_fragments.tsv.gz
.
- Must be sorted by coordinates and gzipped. If it is not sorted, you can use the sortBed tool from bedtools:
- For scE2G (Multiome): RNA count matrix (gene x cell) for each cell cluster
- Use unnormalized (raw) counts
- Must be in either .csv.gz or .h5ad format
- Ensure there are not duplicated gene names
To configure the pipeline:
- Modify
config/config.yaml
to specify paths for results_dir. - Modify
config/config_cell_clusters.tsv
to specify the RNA matrix path, fragment file path, Hi-C file path, Hi-C data type, Hi-C resolution, TSS coordinates, and gene coordinates for each cell cluster. If running scE2G (ATAC), leave the RNA matrix path blank. - Specify model directory from those in
models/
to be applied
Running the pipeline:
snakemake -j1 --use-conda
This command make take a while the first time you run it, as it needs to build the conda environments. But if it takes more than 1 hour, that's usually a bad sign, meaning that you're not using mamba and/or need more memory to build the environment.
Output will show up in the results/
directory by default, with the structure results/cell_cluster/model_name/encode_e2g_predictions.tsv.gz
. The score column to use is E2G.Score.qnorm
.
Important: Only train models for biosamples matching the corresponding CRISPR data (in this case, K562)
Modify config/config_training.yaml
with your model and cell_cluster configs
model_config
has columns: model, dataset, ABC_directory, feature_table, polynomial (do you want to use polynomial features?) Note that trained models generated using polynomial features cannot directly be used in the Apply model workflowcell_cluster_config
has rows representing each "dataset" inmodel_config
, where each "dataset" must correspond to a "cluster" incell_cluster_config
- If an ABC_directory is not specified for a dataset, its entry in
cell_cluster_config
must also contain the required ABC biosample parameters - TO DO: specify how to generate and formats for Kendall parameters
- If an ABC_directory is not specified for a dataset, its entry in
- To apply a trained model, it must contain the following files: 1)
model.pkl
, 2)feature_table.tsv
, 3)score_threshold_.XX
, 5)tpm_threshold_YY
(YY=0 if ATAC-only model), 4)qnorm_reference.tsv.gz
(single column with headerE2G.Score
that contains raw scores for genomewide predictions)
Running the pipeline:
snakemake -s workflow/Snakefile_training -j1 --use-conda
Output will show up in the results_training/
directory by default