Skip to content

Feature request: Populate new allele-specific annotations from the <NON_REF> allele in bcftools merge #1888

@DonFreed

Description

@DonFreed

Opening this issue to ask how the bcftools maintainers feel about populating novel allele information from the <NON_REF> allele when merging multiple gVCFs?

Downstream tools (the Sentieon GVCFtyper and GATK's GenotypeGVCFs) expect the FORMAT/PL field of the merged gVCF to be fully-populated, and these missing fields result in dropped variants. Unlike the more general case, where we don't know the values for the new terms, these fields can be copied from the <NON_REF> allele for GATK-style gVCFs.

Here is a more detailed explanation with an example:

I have two gVCFs with the following records.

sample1.g.vcf.gz:

chr1    1769963 .       A       <NON_REF>       .       .       END=1769968     GT:DP:GQ:MIN_DP:PL      0/0:11:3:11:0,3,45
chr1    1769969 .       CAAAACA C,<NON_REF>     367.74  .       DP=11;ExcessHet=3.0103;MLEAC=2,0;MLEAF=1,0;RAW_MQ=39600.00      GT:AD:DP:GQ:PL:SB     1/1:0,9,0:9:27:405,27,0,405,27,405:0,0,4,5
chr1    1769976 .       A       <NON_REF>       .       .       END=1769976     GT:DP:GQ:MIN_DP:PL      0/0:10:0:10:0,0,0

sample2.g.vcf.gz:

chr1    1769968 .       T       <NON_REF>       .       .       END=1769968     GT:DP:GQ:MIN_DP:PL      0/0:9:18:9:0,18,270
chr1    1769969 .       CAAAACAAAAACA   C,<NON_REF>     144.00  .       DP=7;ExcessHet=3.0103;MLEAC=2,0;MLEAF=1,0;RAW_MQ=25200.00       GT:AD:DP:GQ:PL:SB      1/1:0,4,0:4:12:181,12,0,181,12,181:0,0,3,1
chr1    1769982 .       A       <NON_REF>       .       .       END=1769982     GT:DP:GQ:MIN_DP:PL      0/0:7:0:7:0,0,0

I can merge the two gvcfs using bcftools merge --gvcf. During the merge, bcftools conservatively chooses to output missing information for new alleles/genotypes at multi-allelic sites in the AD and PL fields:

chr1    1769968 .       T       <NON_REF>       .       .       .       GT:DP:GQ:MIN_DP:PL      0/0:11:3:11:0,3,45      0/0:9:18:9:0,18,270
chr1    1769969 .       CAAAACAAAAACA   CAAAACA,<NON_REF>,C     367.74  .       ExcessHet=3.0103;RAW_MQ=39600;DP=18;MLEAC=2,0,2;MLEAF=1,0,1     GT:AD:DP:GQ:PL:SB      1/1:0,9,0,.:9:27:405,27,0,405,27,405,.,.,.,.:0,0,4,5    3/3:0,.,0,4:4:12:181,.,.,181,.,181,12,.,12,0:0,0,3,1
chr1    1769976 .       A       <NON_REF>       .       .       .       GT:DP:GQ:MIN_DP:PL      0/0:10:0:10:0,0,0       ./.:.:.:.:.

This missing information causes these variants to be skipped during later joint genotyping with the GATK's GenotypeGVCFs:

$ gatk GenotypeGVCFs --allow-old-rms-mapping-quality-annotation-data -L chr1:1768969-1770969 \
  -R $ref -V merged.g.vcf.gz -O merged_gatk.vcf.gz
Using GATK jar /<path>/gatk-package-4.2.6.1-local.jar
...
13:29:32.115 WARN  MinimalGenotypingEngine - No genotype contained sufficient data to recalculate site and allele qualities. Site will be skipped at location chr1:1769969
...
$ zcat merged_gatk.vcf.gz
#CHROM  POS     ID      REF     ALT     QUAL    FILTER  INFO    FORMAT  HG001   HG002
chr1    1769737 .       GA      G       408.45  .      AC=2;AF=0.500;AN=4;DP=17;ExcessHet=0.0000;FS=0.000;MLEAC=2;MLEAF=0.500;MQ=101.00;QD=25.53;RAW_MQ=61200.00;SOR=0.941     GT:AD:DP:GQ:PL  1/1:0,16:16:48:424,48,0 0/0:11,0:11:27

Or with Sentieon's GVCFtyper:

$ sentieon driver --interval chr1:1768969-1770969 -r $ref --algo GVCFtyper \
  -v merged.g.vcf.gz merged_sentieon_subset.vcf.gz
$ zcat merged_sentieon_subset.vcf.gz
#CHROM  POS     ID      REF     ALT     QUAL    FILTER  INFO    FORMAT  HG001   HG002
chr1    1769737 .       GA      G       388.16  .       AC=2;AF=1;AN=2;DP=17;ExcessHet=3.0103;FS=0.000;MLEAC=2;MLEAF=1;MQ=60.00;QD=24.26;SOR=0.941     GT:AD:DP:GQ:PL  1/1:0,16:16:48:424,48,0 ./.:11,0:11:.:.

These tools expect the information for all genotypes in the FORMAT/PL field to be populated from the <NON_REF> allele, and the joint calling output from the merged gVCF matches the joint call output from the individual gVCFs if this information is filled-in.

$ bcftools view -h merged.g.vcf.gz > merged_mod.g.vcf
$ bcftools view -H -r chr1:1768969-1770969 merged.g.vcf.gz | \
    sed 's|GT:AD:DP:GQ:PL:SB    1/1:0,9,0,.:9:27:405,27,0,405,27,405,.,.,.,.:0,0,4,5    3/3:0,.,0,4:4:12:181,.,.,181,.,181,12,.,12,0:0,0,3,1|GT:AD:DP:GQ:PL:SB  1/1:0,9,0,0:9:27:405,27,0,405,27,405,405,27,405,405:0,0,4,5 3/3:0,0,0,4:4:12:181,181,181,181,181,181,12,12,12,0:0,0,3,1|' \
    >> merged_mod.g.vcf
$ bcftools view -o merged_mod.g.vcf.gz -O z merged_mod.g.vcf
$ bcftools index -t merged_mod.g.vcf.gz

$ sentieon driver --interval chr1:1768969-1770969 -r $ref --algo GVCFtyper \
    -v merged_mod.g.vcf.gz merged_sentieon_subset_mod.vcf.gz

$ zcat merged_sentieon_subset_mod.vcf.gz
#CHROM	POS	ID	REF	ALT	QUAL	FILTER	INFO	FORMAT	HG001	HG002
chr1	1769737	.	GA	G	388.16	.	AC=2;AF=1;AN=2;DP=17;ExcessHet=3.0103;FS=0.000;MLEAC=2;MLEAF=1;MQ=60.00;QD=24.26;SOR=0.941 GT:AD:DP:GQ:PL	1/1:0,16:16:48:424,48,0	./.:11,0:11:.:.
chr1	1769969	.	CAAAACAAAAACA	CAAAACA,C	550.30	.	AC=2,2;AF=0.5,0.5;AN=4;DP=18;ExcessHet=3.0103;FS=0.000;MLEAC=2,2;MLEAF=0.5,0.5;MQ=46.90;QD=42.33;SOR=0.836	GT:AD:DP:GQ:PL	1/1:0,9,0:9:27:405,27,0,405,27,405	2/2:0,0,4:4:12:181,181,181,12,12,0

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions