This repository is a Snakemake pipeline that we use to process public scRNA-seq data in scNavigator. The scope of the main branch is to process the raw data and run basic seurat analysis.
Read the full documentation at https://scn-pipeline.readthedocs.io/.
To install the pipeline, please, clone the main branch and install snakemake
$ git clone https://github.com/ctlab/scn-pipeline.git
$ cd scn-pipeline
$ conda install -n base -c conda-forge mamba
$ mamba create -c conda-forge -c bioconda -n snakemake snakemake
$ conda activate snakemake
For all the steps in this pipeline we have specified the minimum environment required to run the step, so please consider running this pipeline using:
$ snakemake --use-conda ...
You first have to configure the project and provide paths to relevant
files and folder. The configuration is storen in configs/config.yaml
file and
consists of several fields. The only two necessary fields that are required to fill are out_dir
and ncbi_dir
.
out_dir
is a directory that will be used to store results, preliminary results, resources, and logs.
ncbi_dir
is a directory that is configured via vdb-config
(see README.vdb-config in https://github.com/ncbi/sra-tools)
and path to this directory is usually stored in ~/.ncbi/user-settings.mkfg
out_dir: '/path/to/out/dir' # output directory with all the results
ncbi_dir: '/path/to/configured/ncbi/folder' # directory from vdb-config
Once pipeline is configured, fill the datasets in ./config/dataset.yaml
Contents of the file should be just a list of dataset IDs (see example below)
["GSE145241", "GSE116240"]
Once datasets are specified pipeline consists of two main steps:
- Acquiring meta information (we use FFQ + custom scripts to detect single-cell technology and version of chemistry)
- Processing (we use STAR + Seurat for processing and further analysis of the dataset)
To acquire all the meta information run
$ snakemake --use-conda get_all_meta
To process the datasets for which you already have meta information simply run
$ snakemake --use-conda process_all
Results of the pipeline can be found in the out_dir
directory that you configured:
/out_dir/resources
- all the resources (whitelists and genome indexes) will be stored here/out_dir/logs
- all the logs will be stored here/out_dir/meta/{dataset}
- meta information for the datasets (results ofget_all_meta
)/out_dir/data/samples/{dataset}/{sample}
- sample-level analysis and STAR results are stored here/out_dir/data/datasets/{dataset}/
- dataset-level integration analysis results are stored here