-
Notifications
You must be signed in to change notification settings - Fork 260
Description
Opening this issue to ask how the bcftools maintainers feel about populating novel allele information from the <NON_REF> allele when merging multiple gVCFs?
Downstream tools (the Sentieon GVCFtyper and GATK's GenotypeGVCFs) expect the FORMAT/PL field of the merged gVCF to be fully-populated, and these missing fields result in dropped variants. Unlike the more general case, where we don't know the values for the new terms, these fields can be copied from the <NON_REF> allele for GATK-style gVCFs.
Here is a more detailed explanation with an example:
I have two gVCFs with the following records.
sample1.g.vcf.gz:
chr1 1769963 . A <NON_REF> . . END=1769968 GT:DP:GQ:MIN_DP:PL 0/0:11:3:11:0,3,45
chr1 1769969 . CAAAACA C,<NON_REF> 367.74 . DP=11;ExcessHet=3.0103;MLEAC=2,0;MLEAF=1,0;RAW_MQ=39600.00 GT:AD:DP:GQ:PL:SB 1/1:0,9,0:9:27:405,27,0,405,27,405:0,0,4,5
chr1 1769976 . A <NON_REF> . . END=1769976 GT:DP:GQ:MIN_DP:PL 0/0:10:0:10:0,0,0
sample2.g.vcf.gz:
chr1 1769968 . T <NON_REF> . . END=1769968 GT:DP:GQ:MIN_DP:PL 0/0:9:18:9:0,18,270
chr1 1769969 . CAAAACAAAAACA C,<NON_REF> 144.00 . DP=7;ExcessHet=3.0103;MLEAC=2,0;MLEAF=1,0;RAW_MQ=25200.00 GT:AD:DP:GQ:PL:SB 1/1:0,4,0:4:12:181,12,0,181,12,181:0,0,3,1
chr1 1769982 . A <NON_REF> . . END=1769982 GT:DP:GQ:MIN_DP:PL 0/0:7:0:7:0,0,0
I can merge the two gvcfs using bcftools merge --gvcf. During the merge, bcftools conservatively chooses to output missing information for new alleles/genotypes at multi-allelic sites in the AD and PL fields:
chr1 1769968 . T <NON_REF> . . . GT:DP:GQ:MIN_DP:PL 0/0:11:3:11:0,3,45 0/0:9:18:9:0,18,270
chr1 1769969 . CAAAACAAAAACA CAAAACA,<NON_REF>,C 367.74 . ExcessHet=3.0103;RAW_MQ=39600;DP=18;MLEAC=2,0,2;MLEAF=1,0,1 GT:AD:DP:GQ:PL:SB 1/1:0,9,0,.:9:27:405,27,0,405,27,405,.,.,.,.:0,0,4,5 3/3:0,.,0,4:4:12:181,.,.,181,.,181,12,.,12,0:0,0,3,1
chr1 1769976 . A <NON_REF> . . . GT:DP:GQ:MIN_DP:PL 0/0:10:0:10:0,0,0 ./.:.:.:.:.
This missing information causes these variants to be skipped during later joint genotyping with the GATK's GenotypeGVCFs:
$ gatk GenotypeGVCFs --allow-old-rms-mapping-quality-annotation-data -L chr1:1768969-1770969 \
-R $ref -V merged.g.vcf.gz -O merged_gatk.vcf.gz
Using GATK jar /<path>/gatk-package-4.2.6.1-local.jar
...
13:29:32.115 WARN MinimalGenotypingEngine - No genotype contained sufficient data to recalculate site and allele qualities. Site will be skipped at location chr1:1769969
...
$ zcat merged_gatk.vcf.gz
#CHROM POS ID REF ALT QUAL FILTER INFO FORMAT HG001 HG002
chr1 1769737 . GA G 408.45 . AC=2;AF=0.500;AN=4;DP=17;ExcessHet=0.0000;FS=0.000;MLEAC=2;MLEAF=0.500;MQ=101.00;QD=25.53;RAW_MQ=61200.00;SOR=0.941 GT:AD:DP:GQ:PL 1/1:0,16:16:48:424,48,0 0/0:11,0:11:27
Or with Sentieon's GVCFtyper:
$ sentieon driver --interval chr1:1768969-1770969 -r $ref --algo GVCFtyper \
-v merged.g.vcf.gz merged_sentieon_subset.vcf.gz
$ zcat merged_sentieon_subset.vcf.gz
#CHROM POS ID REF ALT QUAL FILTER INFO FORMAT HG001 HG002
chr1 1769737 . GA G 388.16 . AC=2;AF=1;AN=2;DP=17;ExcessHet=3.0103;FS=0.000;MLEAC=2;MLEAF=1;MQ=60.00;QD=24.26;SOR=0.941 GT:AD:DP:GQ:PL 1/1:0,16:16:48:424,48,0 ./.:11,0:11:.:.
These tools expect the information for all genotypes in the FORMAT/PL field to be populated from the <NON_REF> allele, and the joint calling output from the merged gVCF matches the joint call output from the individual gVCFs if this information is filled-in.
$ bcftools view -h merged.g.vcf.gz > merged_mod.g.vcf
$ bcftools view -H -r chr1:1768969-1770969 merged.g.vcf.gz | \
sed 's|GT:AD:DP:GQ:PL:SB 1/1:0,9,0,.:9:27:405,27,0,405,27,405,.,.,.,.:0,0,4,5 3/3:0,.,0,4:4:12:181,.,.,181,.,181,12,.,12,0:0,0,3,1|GT:AD:DP:GQ:PL:SB 1/1:0,9,0,0:9:27:405,27,0,405,27,405,405,27,405,405:0,0,4,5 3/3:0,0,0,4:4:12:181,181,181,181,181,181,12,12,12,0:0,0,3,1|' \
>> merged_mod.g.vcf
$ bcftools view -o merged_mod.g.vcf.gz -O z merged_mod.g.vcf
$ bcftools index -t merged_mod.g.vcf.gz
$ sentieon driver --interval chr1:1768969-1770969 -r $ref --algo GVCFtyper \
-v merged_mod.g.vcf.gz merged_sentieon_subset_mod.vcf.gz
$ zcat merged_sentieon_subset_mod.vcf.gz
#CHROM POS ID REF ALT QUAL FILTER INFO FORMAT HG001 HG002
chr1 1769737 . GA G 388.16 . AC=2;AF=1;AN=2;DP=17;ExcessHet=3.0103;FS=0.000;MLEAC=2;MLEAF=1;MQ=60.00;QD=24.26;SOR=0.941 GT:AD:DP:GQ:PL 1/1:0,16:16:48:424,48,0 ./.:11,0:11:.:.
chr1 1769969 . CAAAACAAAAACA CAAAACA,C 550.30 . AC=2,2;AF=0.5,0.5;AN=4;DP=18;ExcessHet=3.0103;FS=0.000;MLEAC=2,2;MLEAF=0.5,0.5;MQ=46.90;QD=42.33;SOR=0.836 GT:AD:DP:GQ:PL 1/1:0,9,0:9:27:405,27,0,405,27,405 2/2:0,0,4:4:12:181,181,181,12,12,0