- Propose
- An open source and flexible pipeline to analyze DNBelab C SeriesTM single-cell RNA datasets.
- Language
- Workflow Description Language (WDL), Python3 and R scripts.
- Hardware/Software requirements
- x86-64 compatible processors
- require at least 36GB of RAM and 10 CPU.
- 64bit Linux
- Workflow
- bin pre-compiled executables for Linux
- config read structure configure files
- pipelines WDL pipeline
- scripts miscellaneous scripts
Setup
1. Install docker follow the official website
https://www.docker.com/
2. Then do the following for the workflow:
docker pull huangshunkai/dnbelab_c4:latest
Notes:
1. Please make sure that you run the docker container with at least 36GB memory and 10 CPU.
2. The input is sample list and output directory which descripted below (Main progarm arguments).
Prepare
cat config.json
{
"main.fastq1": "/DNBelab_C4/rawfq/demo_1.fq.gz",
"main.fastq2": "/DNBelab_C4/rawfq/demo_2.fq.gz",
"main.ID": "Demo_single",
"main.forceCell": "0",
"main.umilow": "1000",
"main.species":"GRCh38",
"main.original":"cell lines",
"main.SampleTime":"2020-06-25",
"main.ExperimentalTime":"2020-06-25"
}
Running
1. Please set the following variables on your machine:
(a) $DB_LOCAL: directory on your local machine that has the database files. Make sure that the directory must contains two subdirectories, "gtf" and "star_index". The gene annotation file named "genes.gtf" must be included under "gtf"; the genome index file for STAR under the "star_index". If you build the database youself, make sure the format of the directory path is correct.
(b) $DATA_LOCAL: directory on your local machine that has the sequence data and "config.json" file.
"config.json" must follow the format descripted bellow,
and the *PATH* in "config.json" must be absolute dicrtory of $DATA_LOCAL.
(c) $RESULT_LOCAL: directory for result.
2. Run the command:
10x sequence data:
docker run -d -P \
--name $scRNANAME \
-v $DB_LOCAL:/DNBelab_C4/database \
-v $DATA_LOCAL:/DNBelab_C4/rawfq \
-v $RESULT_LOCAL:/DNBelab_C4/result \
huangshunkai/dnbelab_c4:latest \
/bin/bash \
/DNBelab_C4/bin/10xRun.sh
mgi sequence data:
docker run -d -P \
--name $scRNANAME \
-v $DB_LOCAL:/DNBelab_C4/database \
-v $DATA_LOCAL:/DNBelab_C4/rawfq \
-v $RESULT_LOCAL:/DNBelab_C4/result \
huangshunkai/dnbelab_c4:latest \
/bin/bash \
/DNBelab_C4/bin/mgiRun.sh
3. After satisfactory result was generated:
docker rm $scRNANAME
$ git clone https://github.com/MGI-tech-bioinformatics/DNBelab_C_Series_scRNA-analysis-software.git
- java
- Cromwell
- R (3.5+) # with following R packages installed
- ggplot2
- getopt
- data.table
- cowplot
- DropletUtils(1.6.1+)
- python3 (3.6+) # with following Python3 packages installed
- numpy
- pandas
- python-igraph
- louvain
- scanpy(1.4.3+)
- jinja2(2.10.3+)
We provide the following databases for download, including fasta, gtf, and STAR(V2.7.3a) index files.
- human(GRCh38)
- mouse(GRCm38)
- Mixed Database(GRCh38 & GRCm38)
Note: Mixed dual species databases only for double species sample analysis.
Firstly, you need to prepare the fasta and gtf files of the reference database. And then build STAR index files. Please refer to the following command lines.
### Goto pipeline directory
$ ls
bin/ config/ doc/ example/ LICENSE pipelines/ README.md scripts/
### Create fold and prepare the reference files
$ cd example/database && mkdir star_index
$ gzip -d gtf/genes.gtf.gz
### build STAR index
$ ../../bin/STAR --runThreadN 8 --runMode genomeGenerate --genomeDir star_index --genomeFastaFiles ./fasta/genome.fa --sjdbGTFfile ./gtf/genes.gtf
Dec 29 20:03:24 ..... started STAR run
*** logs ignored
Dec 29 20:04:33 ..... finished successfully
For a 3G reference sequence file, bulid index takes about 1 hour. It is worth noting that the STAR version corresponds to the STAR index, we default use the V2.7.3 STAR.
An input JSON file includes all input parameters and genome reference index directory for running pipelines. Always use absolute paths in an input JSON.
-
Step 1: Prepare fastq We provide 100MB Demo sequencing data for testing. We also provide 36GB PBMCs sample fastq by pairs sequencing for download fastq.
-
Step 2: Setup configure file.
# Goto test directory
cd ./example/single_Species
# Check configure file
$ cat config.json
{
"main.fastq1": "/User/pipeline/DNBelab_C_Series_scRNA-analysis-software/example/single_Species/fastq/Demo.human.fq.1.gz",
"main.fastq2": "/User/pipeline/DNBelab_C_Series_scRNA-analysis-software/example/single_Species/fastq/Demo.human.fq.1.gz",
"main.root": "/User/pipeline/DNBelab_C_Series_scRNA-analysis-software",
"main.gtf": "/User/pipeline/DNBelab_C_Series_scRNA-analysis-software/databases/GRCh38/gtf/genes.gtf",
"main.ID": "demo",
"main.outdir": "/User/pipeline/DNBelab_C_Series_scRNA-analysis-software/example/single_Species/result",
"main.config": "/User/pipeline/DNBelab_C_Series_scRNA-analysis-software/config/DNBelabC4_scRNA_readStructure.json",
"main.Rscript":"/User/Pub/third_party/Rscript",
"main.refdir": "/User/pipeline/DNBelab_C_Series_scRNA-analysis-software/databases/GRCh38/star_index",
"main.Python3": "/User/Pub/third_party/python3",
"main.species":"human",
"main.original":"cell line",
"main.SampleTime":"2019-12-25",
"main.ExperimentalTime":"2019-12-25"
}
- Step 3: Run this pipeline.
java -jar cromwell-35.jar run -i config.json ../../pipelines/Droplet_single.wdl
- Step 4: Check results.
# After all analysis processes ending, you will get these files below:
$ cd result && ls
outs/ report/ temp/ /symbol workflowtime.log
$ ls out
cell_barcodes.txt cluster.h5ad count_mtx.tsv.gz final.bam
$ ls report
alignment_report.csv annotated_report.csv cell_report.csv cluster.csv cutoff.csv iDrop_demo.html marker.csv RNA_counts.pdf sample.csv sequencing_report.csv vln.csv
$ ls symbol
# In single_Species result,there are some follow files will be generated:
makedir_sigh.txt parseFastq_sigh.txt fastq2bam_sigh.txt sortBam_sigh.txt cellCount_sigh.txt cellCalling_sigh.txt countMatrix_sigh.txt report_sigh.txt
So the final html report is at outdir Path
/report/iDrop_*.html
- Step 0: Build reference index
Please refer to Database
-
Step 1: Prepare fastq We provide 135MB Demo sequencing data for testing. We also provide 52GB Mixed Sample(GRCh38 & mm10) pairs sequencing fastq for downloadfastq
-
Step 2: Setup configure file.
# Goto test directory
cd ./example/double_Species
# Check configure file
$ cat config.json
{
"main.fastq1": "/User/pipeline/DNBelab_C_Series_scRNA-analysis-software/example/double_Species/read_1.fq.gz",
"main.fastq2": "/User/pipeline/DNBelab_C_Series_scRNA-analysis-software/example/double_Species/read_2.fq.gz",
"main.root": "/User/pipeline/DNBelab_C_Series_scRNA-analysis-software",
"main.gtf": "/User/pipeline/DNBelab_C_Series_scRNA-analysis-software/databases/GRCh38_mm10/gtf/genes.gtf",
"main.ID": "demo",
"main.chrom":"/User/pipeline/DNBelab_C_Series_scRNA-analysis-software/config/species_binding.txt",
"main.outdir": "/User/pipeline/DNBelab_C_Series_scRNA-analysis-software/example/double_Species/result",
"main.config": "/User/pipeline/DNBelab_C_Series_scRNA-analysis-software/config/DNBelabC4_scRNA_readStructure.json",
"main.Rscript":"/User/Pub/third_party/Rscript",
"main.refdir": "/User/pipeline/DNBelab_C_Series_scRNA-analysis-software/databases/GRCh38_mm10/star_index",
"main.Python3": "/User/Pub/third_party/python3",
"main.species":"GRCh38_mm10",
"main.original":"cell line",
"main.SampleTime":"2019-12-25",
"main.ExperimentalTime":"2019-12-25"
}
- Step 3: Run this pipeline.
java -jar cromwell-35.jar run -i config.json ../../pipelines/Droplet_double.wdl
- Step 4: Check results.
# After all analysis processes ending, you will get these files below:
$ cd result && ls
outs/ report/ temp/ /symbol workflowtime.log
$ ls out
anno_species.bam
$ ls symbol
# In double_Species result, there are some follow files willbe generated:
makedir_sigh.txt parseFastq_sigh.txt fastq2bam_sigh.txt sortBam_sigh.txt cellCount_sigh.txt cellCalling_sigh.txt report_sigh.txt
$ ls report
alignment_report.csv cell_barcodes.txt cell_report.csv mix_report.csv sample.csv iDrop_Demo.html annotated_report.csv cell_count_summary.png cutoff.csv mixture_cells.png sequencing_report.csv vln.csv
So the final html report is at outdir Path
/report/iDrop_*.html
-
Does this pipeline correct UMI errors?
Yes, UMIs from same cell in same gene will be corrected using Hamming distance and frequency. This method retains only the UMI with the highest counts.
-
Does this pipeline correct Cell Barcode errors?
Yes. Cell barcode not contained in white list will be corrected using levenshtein distance. Users can edit this distance parameter at the read structure configure file. If distance set to 0, pipeline will skip the correction.
-
Can you introduce how many QC steps performed at this pipeline? And for each step what parameters are used?
For fastqs, this pipeline filter reads with low quality (< Q20) or if 2 bases < Q10 at first 15 bases. For bam, read with low mapping quality (< 20) will be filtered. For UMIs, check question 1.
-
Can I use this pipeline to analysis 10X Genomes single cell gene expression data ?
Yes, the read structure configure file can be found at
config/10X_3end_readstruct.json
. Other steps are same with the demonstration. -
Can I continuse to run the workflow if some errors were happended in the process?
Yes, the result/symbol directory records the symbol for each step, you can then keep the output path unchanged and run this pipeline after correct the error. If you use docker images, run the command
docker start $scRNANAME && docker exec -d $scRNANAME /bin/bash /DNBelab_C4/bin/10xRun.sh
ordocker start $scRNANAME && docker exec -d $scRNANAME /bin/bash /DNBelab_C4/bin/mgiRun.sh
. -
Why the inflection point is inaccurate on the total count curve?
You can specify "main.umilow" in the configure file like "main.umilow": "1000". "main.umilow" is a numeric scalar specifying the lower bound on the total UMI count, at or below which all barcodes are assumed to correspond to empty droplets, default 50.