Variant Calling at Scale is a scalable, parallel and efficient implementation of next generation sequencing data pre-processing and variant calling workflows. Our design tightly integrates most pre-processing workflow stages, using Spark built-in functions to sort reads by coordinates, and mark duplicates efficiently. A cluster scaled DeepVariant for both CPU-only and CPU+GPU clusters is also integrated in this workflow.
This approach can be used on following infraturcture(s):
- Slurm based Cluster (Like Cartesius: the Dutch supercomputer)
- Google GCP Cluster (DataProc)
- Amazon (AWS) Cloud
- Alibaba Cloud
- Microsoft Azure
This approach can be used for following purpose(s):
-
Apache Spark based cluster scaled pre-processing (BWA-MEM, Sorting and MarkDuplicate)
->
output multiple BAM files based on choromosome regions -
Apache Spark based cluster scaled Variant Calling (BWA-MEM, Sorting and MarkDuplicate and DeepVariant/Octopus)
->
output a single VCF -
Apache Spark based cluster scaled pre-processing (BWA-MEM, Sorting and MarkDuplicate and BAM merging)
->
output a single sorted/mkdup BAM file -
Apache Spark based cluster scaled Variant Calling (DeepVariant/Octopus)
->
input sorted/mkdup BAM->
output a single VCF file
- BWA-MEM
- BWA-MEM2
- Bowtie2
- Minimap2
- DeepVariant
- Octopus
- Strelka2
- GATK4 Haplotypecaller
- Freebayes
- C++ libraries
- C bindings using GLib
- Plasma Object Store: a shared-memory blob store, part of the C++ codebase
- Python libraries
- Ubuntu 16.04/18.04
- Apache Spark 3.0.1
- Singularity container
-
Install Singularity container
-
Download our Singularity script and generate singularity image (this image contains all Arrow related packges necessary for building/compiling BWA-MEM)
-
Now enter into generated image using command:
sudo singularity shell <image_name>.simg
The following steps are required to test/run this workflow.
- Before running the variant calling workflow, we have to create a custom operating system image by installing all the prerequisite applications, which will be used on GCP DataProc cluster.
- We will then create a GCP DataProc cluster and a GCP Filestore instance.
- After DataProc cluster creation, we will mount the Filestore on each node.
- Then we download the reference, query and GIAB benchmark truth datasets.
- The final step is to run different parts of this workflow on GCP console.
-
Create a bucket inside GCP storage to store a custom image like
gs://{user}/images
-
Use “gcloud config set project [PROJECT_ID]” to change to a different project.
-
Inside Cloud Shell run:
git clone https://github.com/tahashmi/custom-images cd custom-images python3 generate_custom_image.py --image-name "bwa-custom" --dataproc-version "2.0.1-ubuntu18" --customization-script bwa_standalone.sh --zone "asia-east1-a" --gcs-bucket "gs://{user}/images" --shutdown-instance-timer-sec 50 --no-smoke-test
This will create a custom image which can be used on Google GCP DataProc cluster instances.
On all master and worker nodes
#***********************************************#
sudo apt-get -y update && sudo apt-get install nfs-common
sudo mkdir -p /mnt/fs_shared
sudo mount 10.35.205.242:/fs_shared /mnt/fs_shared/
sudo chmod go+rw /mnt/fs_shared/
df -h --type=nfs
On any node
#***********************************************#
mkdir -p /mnt/fs_shared/reference
cd /mnt/fs_shared/reference
#Download reference genome
wget ftp://ftp.ncbi.nlm.nih.gov/genomes/all/GCA/000/001/405/GCA_000001405.15_GRCh38/seqs_for_alignment_pipelines.ucsc_ids/GCA_000001405.15_GRCh38_no_alt_analysis_set.fna.gz
gunzip GCA_000001405.15_GRCh38_no_alt_analysis_set.fna.gz
mv GCA_000001405.15_GRCh38_no_alt_analysis_set.fna GRCh38.fa
wget ftp://ftp.ncbi.nlm.nih.gov/genomes/all/GCA/000/001/405/GCA_000001405.15_GRCh38/seqs_for_alignment_pipelines.ucsc_ids/GCA_000001405.15_GRCh38_no_alt_analysis_set.fna.fai
mv GCA_000001405.15_GRCh38_no_alt_analysis_set.fna.fai GRCh38.fa.fai
#Download the query data
mkdir -p /mnt/fs_shared/query/ERR001268
cd /mnt/fs_shared/query/ERR001268
wget ftp://ftp.1000genomes.ebi.ac.uk/vol1/ftp/phase3/data/NA12878/sequence_read/ERR001268_1.filt.fastq.gz
gunzip ERR001268_1.filt.fastq.gz
mv ERR001268_1.filt.fastq ERR001268_1.fastq
wget ftp://ftp.1000genomes.ebi.ac.uk/vol1/ftp/phase3/data/NA12878/sequence_read/ERR001268_2.filt.fastq.gz
gunzip ERR001268_2.filt.fastq.gz
mv ERR001268_2.filt.fastq ERR001268_2.fastq
#Create output dirs
mkdir -p /mnt/fs_shared/query/ERR001268/arrow
sudo chmod go+rw /mnt/fs_shared/query/ERR001268/arrow
cd /mnt/fs_shared/query/ERR001268
mkdir bams
sudo chmod go+rw bams/
cd /mnt/fs_shared/query/ERR001268/bams
mkdir outputs
sudo chmod go+rw outputs/
cd /mnt/fs_shared
#Download vcfmerge script https://gist.github.com/tahashmi/84927410a47fd0b76a66228c1b37a744
wget https://gist.github.com/tahashmi/84927410a47fd0b76a66228c1b37a744
sudo chmod +x /mnt/fs_shared/query/ERR001268/vcfmerge.sh
FASTQ data is streamed to BWA on every cluster node, BWA output is piped into Sambamba to perform sorting, duplicates removal option is also available, if enabled sorted data is piped to this stage as well. For final output, Samtools (merge) is used to produces a single BAM output, ready for further down stream analysis.
[DataProc cluster] BWA (alignment) ->
Sambamba (Sorting) ->
Sambamba (Duplicates removal (optional)) ->
Samtools (merge BAMs)
-
Custom image is needed to be used on DataProc cluster, follow these steps to create one.
-
Create a network-attached storage system which can be used with Google Compute Engine instances. Storage -> Filestore
->
Give any name like "fs" {instance id/name}, "fs-shared" {file shared name}->
Create -
Perfrom the following steps to make GCP DataProc cluster ready to run this workflow.
-
Upload script file to your bucket
-
Run the following commands in GCP console shell:
taha_ahmad_pk_101@cloudshell:~ (organic-poetry-309513)$ gcloud dataproc jobs submit pyspark --region=us-central1 --cluster=cluster-555 -- properties=spark.pyspark.python=/usr/bin/python3.6,spark.pyspark.driver.python=/usr/bin/python3.6,spark.executor.memory=2G,spark.driver.memory=2G,spark.num.executors=2,spark.executor.cores=8 gs://bucket_taha_pk/scripts/bwa-standalone.py -- --markdup yes --ref /mnt/fs_shared/reference/GRCh38.fa --path /mnt/fs_shared/query/ERR001268/ --nodes 2 --cores 8 --aligner BWA
time bwa mem -t 8 /mnt/fs_shared/reference/GRCh38.fa /mnt/fs_shared/query/ERR001268/ERR001268_1.fastq /mnt/fs_shared/query/ERR001268/ERR001268_2.fastq -o /dev/stdout | /usr/local/bin/sambamba-0.8.0-linux-amd64-static view -t 8 -S -f bam /dev/stdin > /dev/stdout | /usr/local/bin/sambamba-0.8.0-linux-amd64-static sort -t 8 -o /mnt/fs_shared/query/ERR001268/bams/ERR001268.bam /dev/stdin
time /usr/local/bin/sambamba-0.8.0-linux-amd64-static markdup -t 8 -r /mnt/fs_shared/query/ERR001268/bams/ERR001268.bam /mnt/fs_shared/query/ERR001268/bams/ERR001268_md.bam
Any variant caller which can support regions specific variant calling can be integrated into this workflow. Here we show how to use the Octopus, a latest and an accurate/fast variant caller as a use case to demonstrate the feasibility of integrating any other variant callers in this approach.
- Download octopus singularity container on GCP Filestore
cd /mnt/fs_shared
sudo singularity pull docker://dancooke/octopus
- Repeat part
1
and2
- Run the following command inside GCP console by changing
--vcaller
toOctopus
:
gcloud dataproc jobs submit pyspark --region=us-central1 --cluster=cluster-555 --properties=spark.pyspark.python=/usr/bin/python3.6,spark.pyspark.driver.python=/usr/bin/python3.6,spark.executor.memory=2G,spark.driver.memory=2G,spark.num.executors=3,spark.executor.cores=8 gs://bucket_taha_pk/scripts/bwa.py -- --part 3 --ref /mnt/fs_shared/reference/GRCh38.fa --path /mnt/fs_shared/query/HG003/ --nodes 3 --cores 8 --aligner BWA --vcaller Octopus
- Repeat part
4
for merging VCFs and generating accuracy results usinghap.py
The output should be like this:
DeepVariant:
Type Filter TRUTH.TOTAL TRUTH.TP TRUTH.FN QUERY.TOTAL QUERY.FP QUERY.UNK FP.gt METRIC.Recall METRIC.Precision METRIC.Frac_NA METRIC.F1_Score TRUTH.TOTAL.TiTv_ratio QUERY.TOTAL.TiTv_ratio TRUTH.TOTAL.het_hom_ratio QUERY.TOTAL.het_hom_ratio
INDEL ALL 10634 10579 55 21045 24 9984 19 0.994828 0.997830 0.474412 0.996327 NaN NaN 1.749861 2.296457
INDEL PASS 10634 10579 55 21045 24 9984 19 0.994828 0.997830 0.474412 0.996327 NaN NaN 1.749861 2.296457
SNP ALL 70209 69947 262 85681 85 15619 14 0.996268 0.998787 0.182292 0.997526 2.297347 2.071024 1.884533 1.937783
SNP PASS 70209 69947 262 85681 85 15619 14 0.996268 0.998787 0.182292 0.997526 2.297347 2.071024 1.884533 1.937783
Octopus:
Type Filter TRUTH.TOTAL TRUTH.TP TRUTH.FN QUERY.TOTAL QUERY.FP QUERY.UNK FP.gt METRIC.Recall METRIC.Precision METRIC.Frac_NA METRIC.F1_Score TRUTH.TOTAL.TiTv_ratio QUERY.TOTAL.TiTv_ratio TRUTH.TOTAL.het_hom_ratio QUERY.TOTAL.het_hom_ratio
INDEL ALL 10634 10586 48 23110 89 11874 22 0.995486 0.992079 0.513804 0.993780 NaN NaN 1.749861 2.081653
INDEL PASS 10634 10579 55 20827 18 9670 9 0.994828 0.998387 0.464301 0.996604 NaN NaN 1.749861 1.879637
SNP ALL 70209 69909 300 99329 569 29170 34 0.995727 0.991890 0.293671 0.993805 2.297347 1.966237 1.884533 2.461922
SNP PASS 70209 69856 353 82612 87 12987 11 0.994972 0.998750 0.157205 0.996858 2.297347 2.147613 1.884533 1.920645