Source code for "A lightweight framework for chromatin loop detection at the single-cell level"
-
To get started, please clone the project repository onto your local machine, navigate to the project directory, and proceed to create a conda environment:
git clone https://github.com/fzbio/scGSLoop.git cd scGSLoop conda create -n scloop python=3.8 -
Activate the conda environment:
conda activate scloop -
Download the five folders (
models,data,preds,refined_scools,region_filter) from scGSLoop assets, and copy them to the project directory. -
Install PyTorch >= 1.8.0 according to its official documentation. We recommend using PyTorch 1.8.* for best compatibility.
-
Install PyTorch-Geometric:
conda install pyg -c pyg -c conda-forge -
Install other dependencies using the following command:
pip install -r requirements.txt
ScGSLoop accepts .scool files as input. If this format sounds unfamiliar to you, kindly check out Cooler's documentation for detailed descriptions.
The user needs to prepare their data of two resolutions: 10 kb and 100 kb. If you only have the resolution of 10 kb, you can simply coarsen it using cooler coarsen.
To use scGSLoop to predict loops, the user only needs to modify the fields in configs.py. The meanings of the fields in configs.py are listed below:
Possible fields:
SCOOL_100KB: Path to the .scool file of 100 kb resolution.
SCOOL_10KB: Path to the .scool file of 10 kb resolution.
MODEL_ID: Identifier of a trained model.
E.g., "mES_k3_GNNFINE" is the model trained on the mES
dataset; "hpc_k3_GNNFINE" is the model trained on the
hPFC dataset. These identifiers are used for specifying
a model in the `models` directory.
CHROMOSOMES: A Python literal to specify the chromosomes where loops
are called from.
Example: ['chr' + str(i) for i in range(1, 23)]
MODEL_DIR: The path to the directory where models are stored.
OUT_DIR: The path to the directory to save the predictions.
THRESHOLD: A float number as the cutoff to convert the probability
scores to binary predictions. Recommended: 0.5
MOTIF_FEATURE_PATH:
Path to the motif features. We provide motif features for
different assemblies in `data/graph_features`. Now hg19,
hg38, mm9, and mm10 are supported.
KMER_FEATURE_PATH:
Path to the k-mer features. We provide k-mer features for
different assemblies in `data/graph_features`. Now hg19,
hg38, mm9, and mm10 are supported.
IMPUTE: A boolean value specifying whether to conduct imputation.
We recommend setting this value to False when the median
number of contacts in individual cells exceeds 700,000.
OUT_IMPUTED_SCOOL_100KB:
Path to output the imputed .scool file of 100 kb resolution.
This field will be ignored when IMPUTE is set to False.
OUT_IMPUTED_SCOOL_10KB:
Path to output the imputed .scool file of 10 kb resolution.
This field will be ignored when IMPUTE is set to False.
IMPUTATION_DATASET_DIR:
Path to the location where the PyTorch dataset for
imputation will be saved.
This field will be ignored when IMPUTE is set to False.
GENOME_REGION_FILTER:
Blacklist regions of the genome assembly. If you don't want
to filter the predictions, please set this field to None.
LOADER_WORKER:
Number of workers to load PyTorch dataset. Set to 0 to
work in single-process mode.
After configuring the program, run it by:
python predict_eval.py
The loop calls of each cell at the single-cell level will be available in the directory you designated as OUT_DIR.
After the single-cell loops are detected, you can use them to generate the consensus loop list.
Note: In this step, the predictions in pred_dir must be of the same cell type.
usage:
python consensus.py [-h] [-p PERCENTILE | -n NUM_LOOP] raw_scool_path pred_dir out_path assembly_size
Arguments:
raw_scool_path: Path to the raw 10kb .scool file
pred_dir: Path to the single-cell predictions
out_path: Path to output the consensus list
assembly_size: Path to the assembly size file (e.g. hg19.sizes)
Options:
The following two options are mutually exclusive. Choose one of them to set the threshold
for generating loops.
-p, --percentile: Percentile among all loop scores. Loops with score ranking higher than
the percentile will be added to the consensus list
-n, --num-loop: The total number of loops.
Percentiles used in our study:
hpc_k3_GNNFINE: 97.35
mES_k3_GNNFINE: 98.5
You can adjust the percentile or num loop if there are too many or too few loops in the
final list.
Modify the variables in hub_discover.py:
chroms: A list containing the names of desired chromosomes
gene_coords_path: Path to a csv file containing these columns:
chr,start,end,strand,gene_id,gene_symbol
pred_dir: Path to the directory of single-cell loop preds
assembly_size: Path to the assembly size file (e.g. hg19.sizes)
consensus_path: Path to the consensus loop list
output_path: Path to the output file
usage:
python hub_discover.py
If you find scGSLoop useful in your research, please cite our paper:
F. Wang, H. Alinejad-Rokny, J. Lin, T. Gao, X. Chen, Z. Zheng, L. Meng, X. Li, K.-C. Wong, A Lightweight Framework For Chromatin Loop Detection at the Single-Cell Level. Adv. Sci. 2023, 10, 2303502. https://doi.org/10.1002/advs.202303502
@article{wangLightweightFrameworkChromatin2023,
title = {A Lightweight Framework for Chromatin Loop Detection at the Single-Cell Level},
author = {Wang, Fuzhou and {Alinejad-Rokny}, Hamid and Lin, Jiecong and Gao, Tingxiao and Chen, Xingjian and Zheng, Zetian and Meng, Lingkuan and Li, Xiangtao and Wong, Ka-Chun},
year = {2023},
journal = {Advanced Science},
volume = {10},
number = {33},
pages = {2303502},
doi = {10.1002/advs.202303502}
}