This document outlines all the steps and considerations for calling and merging a trio using DeepVariant and GLnexus. These best practices were developed and evaluated as described in the bioRxiv preprint Accurate, scalable cohort variant calls using DeepVariant and GLnexus.
The process involves 3 major stages: running DeepVariant to create individual genome call sets, running GLnexus to merge call sets, and analyzing the merged call set.
NOTE: This case study demonstrates an example of how to run DeepVariant end-to-end on one machine. The steps below were done on a machine with this example command to start a machine.
The steps in this document can be extended to merge larger cohorts as well.
See this workflow:
A few things to note before we start:
- If you are looking for ways to run DeepVariant in larger batches, please refer to the third party solutions section.
- It is recommended to use BAM files with original quality scores. In the case
that BAM files went through recalibration, optional DV flags can be used in
order to use original scores:
--parse_sam_aux_fields
,--use_original_quality_scores
. - DeepVariant optionally allows gVCF output. This option is required for further GLnexus analysis in this document.
The Whole Exome Sequencing (WES) dataset we're using is from:
ftp-trace.ncbi.nlm.nih.gov/giab/ftp/data/AshkenazimTrio/
- HG002_NA24385_son
- HG003_NA24149_father
- HG004_NA24143_mother
Just for convenience, we use aria2 to download our data. You can change it to whatever other tools (wget, curl) that you prefer.
To install aria2, you can run: sudo apt-get -y install aria2
DIR="${PWD}/trio"
aria2c -c -x10 -s10 -d "${DIR}" ftp://ftp-trace.ncbi.nlm.nih.gov/giab/ftp/data/AshkenazimTrio/HG002_NA24385_son/OsloUniversityHospital_Exome/151002_7001448_0359_AC7F6GANXX_Sample_HG002-EEogPU_v02-KIT-Av5_AGATGTAC_L008.posiSrt.markDup.bam -o HG002.bam
aria2c -c -x10 -s10 -d "${DIR}" ftp://ftp-trace.ncbi.nlm.nih.gov/giab/ftp/data/AshkenazimTrio/HG002_NA24385_son/OsloUniversityHospital_Exome/151002_7001448_0359_AC7F6GANXX_Sample_HG002-EEogPU_v02-KIT-Av5_AGATGTAC_L008.posiSrt.markDup.bai -o HG002.bai
aria2c -c -x10 -s10 -d "${DIR}" ftp://ftp-trace.ncbi.nlm.nih.gov/giab/ftp/data/AshkenazimTrio/HG003_NA24149_father/OsloUniversityHospital_Exome/151002_7001448_0359_AC7F6GANXX_Sample_HG003-EEogPU_v02-KIT-Av5_TCTTCACA_L008.posiSrt.markDup.bam -o HG003.bam
aria2c -c -x10 -s10 -d "${DIR}" ftp://ftp-trace.ncbi.nlm.nih.gov/giab/ftp/data/AshkenazimTrio/HG003_NA24149_father/OsloUniversityHospital_Exome/151002_7001448_0359_AC7F6GANXX_Sample_HG003-EEogPU_v02-KIT-Av5_TCTTCACA_L008.posiSrt.markDup.bai -o HG003.bai
aria2c -c -x10 -s10 -d "${DIR}" ftp://ftp-trace.ncbi.nlm.nih.gov/giab/ftp/data/AshkenazimTrio/HG004_NA24143_mother/OsloUniversityHospital_Exome/151002_7001448_0359_AC7F6GANXX_Sample_HG004-EEogPU_v02-KIT-Av5_CCGAAGTA_L008.posiSrt.markDup.bam -o HG004.bam
aria2c -c -x10 -s10 -d "${DIR}" ftp://ftp-trace.ncbi.nlm.nih.gov/giab/ftp/data/AshkenazimTrio/HG004_NA24143_mother/OsloUniversityHospital_Exome/151002_7001448_0359_AC7F6GANXX_Sample_HG004-EEogPU_v02-KIT-Av5_CCGAAGTA_L008.posiSrt.markDup.bai -o HG004.bai
aria2c -c -x10 -s10 -d "${DIR}" https://storage.googleapis.com/deepvariant/exome-case-study-testdata/hs37d5.fa.gz
gunzip ${DIR}/hs37d5.fa.gz
aria2c -c -x10 -s10 -d "${DIR}" https://storage.googleapis.com/deepvariant/exome-case-study-testdata/hs37d5.fa.fai
aria2c -c -x10 -s10 -d "${DIR}" https://storage.googleapis.com/deepvariant/exome-case-study-testdata/agilent_sureselect_human_all_exon_v5_b37_targets.bed
HG002:
aria2c -c -x10 -s10 -d "${DIR}" ftp://ftp-trace.ncbi.nlm.nih.gov/giab/ftp/release/AshkenazimTrio/HG002_NA24385_son/NISTv3.3.2/GRCh37/HG002_GRCh37_GIAB_highconf_CG-IllFB-IllGATKHC-Ion-10X-SOLID_CHROM1-22_v.3.3.2_highconf_triophased.vcf.gz -o HG002_truth.vcf.gz
aria2c -c -x10 -s10 -d "${DIR}" ftp://ftp-trace.ncbi.nlm.nih.gov/giab/ftp/release/AshkenazimTrio/HG002_NA24385_son/NISTv3.3.2/GRCh37/HG002_GRCh37_GIAB_highconf_CG-IllFB-IllGATKHC-Ion-10X-SOLID_CHROM1-22_v.3.3.2_highconf_triophased.vcf.gz.tbi -o HG002_truth.vcf.gz.tbi
aria2c -c -x10 -s10 -d "${DIR}" ftp://ftp-trace.ncbi.nlm.nih.gov/giab/ftp/release/AshkenazimTrio/HG002_NA24385_son/NISTv3.3.2/GRCh37/HG002_GRCh37_GIAB_highconf_CG-IllFB-IllGATKHC-Ion-10X-SOLID_CHROM1-22_v.3.3.2_highconf_noinconsistent.bed -o HG002_truth.bed
HG003:
aria2c -c -x10 -s10 -d "${DIR}" ftp://ftp-trace.ncbi.nlm.nih.gov/giab/ftp/release/AshkenazimTrio/HG003_NA24149_father/NISTv3.3.2/GRCh37/HG003_GRCh37_GIAB_highconf_CG-IllFB-IllGATKHC-Ion-10X_CHROM1-22_v.3.3.2_highconf.vcf.gz -o HG003_truth.vcf.gz
aria2c -c -x10 -s10 -d "${DIR}" ftp://ftp-trace.ncbi.nlm.nih.gov/giab/ftp/release/AshkenazimTrio/HG003_NA24149_father/NISTv3.3.2/GRCh37/HG003_GRCh37_GIAB_highconf_CG-IllFB-IllGATKHC-Ion-10X_CHROM1-22_v.3.3.2_highconf.vcf.gz.tbi -o HG003_truth.vcf.gz.tbi
aria2c -c -x10 -s10 -d "${DIR}" ftp://ftp-trace.ncbi.nlm.nih.gov/giab/ftp/release/AshkenazimTrio/HG003_NA24149_father/NISTv3.3.2/GRCh37/HG003_GRCh37_GIAB_highconf_CG-IllFB-IllGATKHC-Ion-10X_CHROM1-22_v.3.3.2_highconf_noinconsistent.bed -o HG003_truth.bed
HG004:
aria2c -c -x10 -s10 -d "${DIR}" ftp://ftp-trace.ncbi.nlm.nih.gov/giab/ftp/release/AshkenazimTrio/HG004_NA24143_mother/NISTv3.3.2/GRCh37/HG004_GRCh37_GIAB_highconf_CG-IllFB-IllGATKHC-Ion-10X_CHROM1-22_v.3.3.2_highconf.vcf.gz -o HG004_truth.vcf.gz
aria2c -c -x10 -s10 -d "${DIR}" ftp://ftp-trace.ncbi.nlm.nih.gov/giab/ftp/release/AshkenazimTrio/HG004_NA24143_mother/NISTv3.3.2/GRCh37/HG004_GRCh37_GIAB_highconf_CG-IllFB-IllGATKHC-Ion-10X_CHROM1-22_v.3.3.2_highconf.vcf.gz.tbi -o HG004_truth.vcf.gz.tbi
aria2c -c -x10 -s10 -d "${DIR}" ftp://ftp-trace.ncbi.nlm.nih.gov/giab/ftp/release/AshkenazimTrio/HG004_NA24143_mother/NISTv3.3.2/GRCh37/HG004_GRCh37_GIAB_highconf_CG-IllFB-IllGATKHC-Ion-10X_CHROM1-22_v.3.3.2_highconf_noinconsistent.bed -o HG004_truth.bed
Here are example commands used to install these tools. You should also look at http://www.htslib.org for official instructions.
sudo apt-get install -y build-essential libncurses5-dev zlib1g-dev libbz2-dev liblzma-dev tabix
wget https://github.com/samtools/samtools/releases/download/1.9/samtools-1.9.tar.bz2
tar -xvf samtools-1.9.tar.bz2
pushd samtools-1.9 && ./configure && make && sudo make install && popd
wget https://github.com/samtools/bcftools/releases/download/1.9/bcftools-1.9.tar.bz2
tar -xvf bcftools-1.9.tar.bz2
pushd bcftools-1.9 && ./configure && make && sudo make install && popd
wget https://github.com/samtools/htslib/releases/download/1.9/htslib-1.9.tar.bz2
tar -xvf htslib-1.9.tar.bz2
pushd htslib-1.9 && ./configure && make && sudo make install && popd
First, install docker if you don't have it yet: sudo apt-get -y install docker.io
With the example command below, it runs DeepVariant on the trio one by one. This is for demonstration only. If you're running this on a large cohort, running serially is not the most effective approach.
N_SHARDS=$(nproc) # Or change to the number of cores you want to use
CAPTURE_BED=agilent_sureselect_human_all_exon_v5_b37_targets.bed
VERSION=0.9.0
declare -a trio=(HG002 HG003 HG004)
for SAMPLE in "${trio[@]}"
do
BAM=${SAMPLE}.bam
OUTPUT_VCF=${SAMPLE}.vcf.gz
OUTPUT_GVCF=${SAMPLE}.g.vcf.gz
time sudo docker run \
-v "${DIR}":"/data" \
google/deepvariant:${VERSION} \
/opt/deepvariant/bin/run_deepvariant \
--model_type=WES \
--ref="/data/hs37d5.fa" \
--reads="/data/${BAM}" \
--regions="/data/${CAPTURE_BED}" \
--output_vcf="/data/${OUTPUT_VCF}" \
--output_gvcf="/data/${OUTPUT_GVCF}" \
--num_shards=${N_SHARDS}
done
And then run GLnexus with this config:
sudo docker pull quay.io/mlin/glnexus:v1.2.2
time sudo docker run \
-v "${DIR}":"/data" \
quay.io/mlin/glnexus:v1.2.2 \
/usr/local/bin/glnexus_cli \
--config DeepVariantWES \
--bed "/data/${CAPTURE_BED}" \
/data/HG004.g.vcf.gz /data/HG003.g.vcf.gz /data/HG002.g.vcf.gz \
| bcftools view - | bgzip -c > ${DIR}/deepvariant.cohort.vcf.gz
When we ran on this WES trio, it took only about 13 seconds. For more details on
performance, see
GLnexus performance guide.
And, if you are merging a WGS cohort,
please use the --config DeepVariantWGS
. The corresponding params can be found in
WGS params and
WES params.
Create an SDF template from our reference file:
sudo docker run \
-v "${DIR}":"/data" \
realtimegenomics/rtg-tools format \
-o /data/hs37d5.sdf /data/hs37d5.fa
Create a PED file $DIR/trio.ped
that looks like this (with the sample name
of the trio):
#PED format pedigree
#
#fam-id/ind-id/pat-id/mat-id: 0=unknown
#sex: 1=male; 2=female; 0=unknown
#phenotype: -9=missing, 0=missing; 1=unaffected; 2=affected
#
#fam-id ind-id pat-id mat-id sex phen
1 Sample_Diag-excap51-HG002-EEogPU Sample_Diag-excap51-HG003-EEogPU Sample_Diag-excap51-HG004-EEogPU 1 0
1 Sample_Diag-excap51-HG003-EEogPU 0 0 1 0
1 Sample_Diag-excap51-HG004-EEogPU 0 0 2 0
sudo docker run \
-v "${DIR}":"/data" \
realtimegenomics/rtg-tools mendelian \
-i /data/deepvariant.cohort.vcf.gz \
-o /data/deepvariant.annotated.vcf.gz \
--pedigree=/data/trio.ped \
-t /data/hs37d5.sdf \
| tee ${DIR}/deepvariant.input_rtg_output.txt
The output is:
Checking: /data/deepvariant.cohort.vcf.gz
Family: [Sample_Diag-excap51-HG003-EEogPU + Sample_Diag-excap51-HG004-EEogPU] -> [Sample_Diag-excap51-HG002-EEogPU]
1 non-pass records were skipped
Concordance Sample_Diag-excap51-HG002-EEogPU: F:59682/60267 (99.03%) M:60059/60245 (99.69%) F+M:59329/60107 (98.71%)
Sample Sample_Diag-excap51-HG002-EEogPU has less than 99.0 concordance with both parents. Check for incorrect pedigree or sample mislabelling.
862/60626 (1.42%) records did not conform to expected call ploidy
60430/60626 (99.68%) records were variant in at least 1 family member and checked for Mendelian constraints
266/60430 (0.44%) records had indeterminate consistency status due to incomplete calls
803/60430 (1.33%) records contained a violation of Mendelian constraints
From this report, we know that there is a 1.33% Mendelian violation rate, and
0.44% of the records had incomplete calls (with .
) so RTG couldn't determine
whether there is violation or not.
In addition to the cohort quality statistics, for completeness we generate single-sample quality metrics.
We run bcftools stats
on the 3 VCF outputs. Since our DeepVariant run already
constrained to just the capture regions, no need to specify it again here. We
had to pass in the -f PASS
flag so that only the PASS calls are considered.
declare -a trio=(HG002 HG003 HG004)
for SAMPLE in "${trio[@]}"
do
bcftools stats -f PASS \
${DIR}/${SAMPLE}.vcf.gz \
> ${DIR}/${SAMPLE}.stats
done
Sample | [3]ts | [4]tv | [5]ts/tv | [6]ts (1st ALT) | [7]tv (1st ALT) | [8]ts/tv (1st ALT) |
---|---|---|---|---|---|---|
HG002 | 30171 | 11811 | 2.55 | 30158 | 11788 | 2.56 |
HG003 | 30088 | 11900 | 2.53 | 30075 | 11874 | 2.53 |
HG004 | 30310 | 11993 | 2.53 | 30296 | 11971 | 2.53 |
If you want to restrict to the truth BED files, use this command:
declare -a trio=(HG002 HG003 HG004)
for SAMPLE in "${trio[@]}"
do
bcftools stats -f PASS \
-T ${DIR}/${SAMPLE}_truth.bed \
${DIR}/${SAMPLE}.vcf.gz \
> ${DIR}/${SAMPLE}.with_truth_bed.stats
done
Which resulted in this table:
Sample | [3]ts | [4]tv | [5]ts/tv | [6]ts (1st ALT) | [7]tv (1st ALT) | [8]ts/tv (1st ALT) |
---|---|---|---|---|---|---|
HG002 | 24472 | 9253 | 2.64 | 24467 | 9243 | 2.65 |
HG003 | 24169 | 9185 | 2.63 | 24166 | 9177 | 2.63 |
HG004 | 24315 | 9334 | 2.60 | 24308 | 9327 | 2.61 |
declare -a trio=(HG002 HG003 HG004)
for SAMPLE in "${trio[@]}"
do
sudo docker run \
-v "${DIR}":"/data" \
realtimegenomics/rtg-tools vcfstats \
/data/${SAMPLE}.vcf.gz \
> ${DIR}/${SAMPLE}.vcfstats
done
which shows the following:
HG002:
Location : /data/HG002.vcf.gz
Failed Filters : 11500
Passed Filters : 45670
SNPs : 41946
MNPs : 0
Insertions : 1921
Deletions : 1784
Indels : 17
Same as reference : 2
SNP Transitions/Transversions: 2.55 (42027/16451)
Total Het/Hom ratio : 1.53 (27585/18083)
SNP Het/Hom ratio : 1.54 (25438/16508)
MNP Het/Hom ratio : - (0/0)
Insertion Het/Hom ratio : 1.16 (1033/888)
Deletion Het/Hom ratio : 1.60 (1097/687)
Indel Het/Hom ratio : - (17/0)
Insertion/Deletion ratio : 1.08 (1921/1784)
Indel/SNP+MNP ratio : 0.09 (3722/41946)
HG003:
Location : /data/HG003.vcf.gz
Failed Filters : 12048
Passed Filters : 45622
SNPs : 41945
MNPs : 0
Insertions : 1906
Deletions : 1751
Indels : 18
Same as reference : 2
SNP Transitions/Transversions: 2.51 (41888/16665)
Total Het/Hom ratio : 1.52 (27501/18119)
SNP Het/Hom ratio : 1.53 (25364/16581)
MNP Het/Hom ratio : - (0/0)
Insertion Het/Hom ratio : 1.20 (1041/865)
Deletion Het/Hom ratio : 1.60 (1078/673)
Indel Het/Hom ratio : - (18/0)
Insertion/Deletion ratio : 1.09 (1906/1751)
Indel/SNP+MNP ratio : 0.09 (3675/41945)
HG004:
Location : /data/HG004.vcf.gz
Failed Filters : 11918
Passed Filters : 45967
SNPs : 42261
MNPs : 0
Insertions : 1910
Deletions : 1772
Indels : 21
Same as reference : 3
SNP Transitions/Transversions: 2.54 (41895/16482)
Total Het/Hom ratio : 1.61 (28344/17620)
SNP Het/Hom ratio : 1.63 (26169/16092)
MNP Het/Hom ratio : - (0/0)
Insertion Het/Hom ratio : 1.20 (1041/869)
Deletion Het/Hom ratio : 1.69 (1113/659)
Indel Het/Hom ratio : - (21/0)
Insertion/Deletion ratio : 1.08 (1910/1772)
Indel/SNP+MNP ratio : 0.09 (3703/42261)
sudo docker pull pkrusche/hap.py
declare -a trio=(HG002 HG003 HG004)
for SAMPLE in "${trio[@]}"
do
sudo docker run -i \
-v "${DIR}":"/data" \
pkrusche/hap.py /opt/hap.py/bin/hap.py \
"/data/${SAMPLE}_truth.vcf.gz" \
"/data/${SAMPLE}.vcf.gz" \
-f "/data/${SAMPLE}_truth.bed" \
-T "/data/${CAPTURE_BED}" \
-r "/data/hs37d5.fa" \
-o "/data/${SAMPLE}.happy.output" \
--engine=vcfeval > ${DIR}/${SAMPLE}.stdout
done
Accuracy F1 scores:
Sample | Indel | SNP |
---|---|---|
HG002 | 0.973295 | 0.999318 |
HG003 | 0.969842 | 0.998980 |
HG004 | 0.974818 | 0.999183 |