Using
bcftools norm -m - --multi-overlaps . test.vcf
on the following input
##fileformat=VCFv4.2
##reference=ref.fasta
##contig=<ID=1,length=51304566>
##FORMAT=<ID=GT,Number=1,Type=String,Description="Genotype">
#CHROM POS ID REF ALT QUAL FILTER INFO FORMAT A B
1 511 . C G,T . . . GT 1|0 0|1
1 511 . C T,G . . . GT 2|0 0|2
produces the following variants:
1 511 . C G . . . GT 1|0 0|1
1 511 . C T . . . GT .|. .|.
1 511 . C T . . . GT .|0 0|.
1 511 . C G . . . GT 1|. .|1
I would expect the output variants for both input variants to look the same, as the only difference in the input variants is the order of the alt alleles. bcftools norm --atomize --atom-overlaps outputs the same variants regardless of the order of alleles in the input variant:
1 511 . C G . . . GT 1|0 0|1
1 511 . C T . . . GT .|0 0|.
1 511 . C G . . . GT 1|0 0|1
1 511 . C T . . . GT .|0 0|.
I tracked the difference down to this line of code:
https://github.com/samtools/bcftools/blob/develop/vcfnorm.c#L875
which only keeps refs as refs if this is the first (split) variant in the output, otherwise it sets it to unknown.
Is there a specific reason for treating refs in the first output variant differently?