NGS-Pipe provides analyses for large scale DNA and RNA sequencing experiments. The scope of pre-implemented functions spans the detection of germline variants, somatic single nucleotide variants (SNV) and insertion and deletion (InDel) identification, copy number event detection, and differential expression analyses. Further, it provides pre-configured workflows, such that the final mutational information as well as quality reports and all intermediate results can be generated quickly, also by inexperienced users. In addition, the pipeline can be used on a single computer or in a cluster environment where independent steps are executed in parallel. If one of the steps of the pipeline fails and produces incomplete or no results, the computation of all depending steps is halted and an error message is shown. However, after the issue is resolved the pipeline independently resumes the analyses at the appropriate point, eliminating the need to rerun the complete analysis or manual deletion of erroneous files.

See also the wiki pages of this repository for more information about NGS-pipe.

Workflows for WES, WGS, and RNA-seq data

We have implemented and tested predefined workflows for the automated analysis of WES, WGS, and RNA-seq data (Fig. 1).


The primary data analysis steps include Trimmomatic (Bolger, 2014) to process raw files, BWA (Li, 2009) or STAR (Dobin, 2013) to align reads, and Picard tools (, SAMtools (Li, 2009Samtools) and GATK (McKenna, 2010) to process the aligned reads.

Detecting genomic variants is highly dependent on properties of the input data, such as variant frequency, coverage, or contamination (Cai, 2016; Hofmann, 2017). For this reason, we included several variant callers in NGS-pipe, viz. Mutect (Cibulskis, 2013), JointSNVMix2 (Roth, 2012), VarScan2 (Koboldt, 2012), VarDict (Lai, 2016), SomaticSniper (Larson, 2011), Strelka (Saunder, 2012), and deepSNV (Gerstung, 2012). Further, we included SomaticSeq (Fang, 2015), which combines the results of multiple variant callers and ranked high in the ICGC-TCGA DREAM Somatic Mutation Calling Challenge (Ewing, 2015), and the rank aggregation scheme introduced in (Hofmann, 2017).

Copy number events are detected by FACETS (Shen, 2016), or BIC-seq2 (Xi, 2016), which has been designed specifically for whole genome data.

The results of the experiments can be annotated and manipulated using SnpEff (Cingolani 2012), SnpSift (Cingolani, 2012) and ANNOVAR (Wang, 2010).

RNA-seq data is analyzed to quantify gene expression levels. We include quality control, alignment, and gene counting using the SubRead (Liao, 2014) package. Output files are reformatted to serve as direct input to tools that perform differential gene expression analysis.


The directory examples/wes/ contains a ready to go example for the analysis of three leukemia patients (Cifola, 2015). This example downloads tumor-control matched exome data sets from the Sequence Read Archive, installs the required programs, downloads the necessary reference files and builds the essentials indices. Afterwards, an analysis starting with the mapping of the reads via BWA (Li 2009) all the way to the somatic variant calling with VarScan2 (Koboldt 2012). After the installation of all tools via conda you can proceed like:

#1. Go to examples folder:
cd examples/dna
#2. Download test data: We provide an additional snakemake pipeline to 
#   download test sequences, databases and adapter files:
# This will download 6 test data sets, the adapters, regions file,
# the human reference and build the BWA database index
#3. Execute the DNA Pipeline:
# This will execute: RAW --> QC(Trimmomatic) --> Mapping(BWA) --> Sort(Picard)
# --> Merge(Picard) --> Remove Secondary Alignments(Samtools) --> MarkDuplicates(Picard)
# --> RemoveDuplicates(Samtools) --> SNV Calling (VarScan2)

An example for RNA-seq data analysis can be found in examples/rna/ and here.


