-
Notifications
You must be signed in to change notification settings - Fork 260
Description
Hi,
I have a VCF file with the following line:
##fileformat=VCFv4.2
##FORMAT=<ID=GT,Number=1,Type=String,Description="Genotype">
##FORMAT=<ID=AD,Number=R,Type=Integer,Description="Allelic depths (counting only informative reads out of the total reads) for the ref and alt alleles in the order listed">
##FORMAT=<ID=AF,Number=A,Type=Float,Description="Allele fractions for alt alleles in the order listed">
##FORMAT=<ID=DP,Number=1,Type=Integer,Description="Approximate read depth (reads with MQ=255 or with bad mates are filtered)">
#CHROM POS ID REF ALT QUAL FILTER INFO FORMAT Samp1 Samp2
chr1 939398 . GCCTCCCCAGCCACGGTGAGGACCCACCCTGGCATGATCCCCCTCATCA G,GCCTCCCCAGCCACGGTGAGGACCCACCCTGGCATGATCCCCCTCATCACCTCCCCAGCCACGGTGAGGACCCACCCTGGCATGATCCCCCTCATCA 5075.01 . . GT:AD:AF:DP 0/0:73,0,0:.:35 0/2:36,2,50:0.023,0.568:88
For the first sample Samp1, the AF field in FORMAT column is missing(.).
After bcftools norm -m -any -f [reference], I've got:
##fileformat=VCFv4.2
##FILTER=<ID=PASS,Description="All filters passed">
##FORMAT=<ID=GT,Number=1,Type=String,Description="Genotype">
##FORMAT=<ID=AD,Number=R,Type=Integer,Description="Allelic depths (counting only informative reads out of the total reads) for the ref and alt alleles in the order listed">
##FORMAT=<ID=AF,Number=A,Type=Float,Description="Allele fractions for alt alleles in the order listed">
##FORMAT=<ID=DP,Number=1,Type=Integer,Description="Approximate read depth (reads with MQ=255 or with bad mates are filtered)">
##contig=<ID=chr1>
##bcftools_normVersion=1.16+htslib-1.16
##bcftools_normCommand=norm -m -any -f /media/NFS/ref/b38/Homo_sapiens_assembly38.fasta -O z -o demo1.norm.vcf.gz demo1.vcf.gz; Date=Tue Nov 15 15:58:25 2022
#CHROM POS ID REF ALT QUAL FILTER INFO FORMAT Samp1 Samp2
chr1 939398 . GCCTCCCCAGCCACGGTGAGGACCCACCCTGGCATGATCCCCCTCATCA G 5075.01 . . GT:AD:AF:DP 0/0:73,0:.:35 0/0:36,2:0.023:88
chr1 939398 . G GCCTCCCCAGCCACGGTGAGGACCCACCCTGGCATGATCCCCCTCATCA 5075.01 . . GT:AD:AF:DP 0/0:73,0::35 0/1:36,50:0.568:88
Samp2 output is as I expected.
But for Samp1, I expected that both lines should have missing value (.) for AF
(its value was missing before split, thus it makes sense to have missing values for both lines after split).
The --force option didn't make any difference, here.
However, when I ran the same command with only Samp1, I got results as I expected:
##fileformat=VCFv4.2
##FILTER=<ID=PASS,Description="All filters passed">
##FORMAT=<ID=GT,Number=1,Type=String,Description="Genotype">
##FORMAT=<ID=AD,Number=R,Type=Integer,Description="Allelic depths (counting only informative reads out of the total reads) for the ref and alt alleles in the order listed">
##FORMAT=<ID=AF,Number=A,Type=Float,Description="Allele fractions for alt alleles in the order listed">
##FORMAT=<ID=DP,Number=1,Type=Integer,Description="Approximate read depth (reads with MQ=255 or with bad mates are filtered)">
##contig=<ID=chr1>
##bcftools_normVersion=1.16+htslib-1.16
##bcftools_normCommand=norm -m -any -f /media/NFS/ref/b38/Homo_sapiens_assembly38.fasta -O z -o demo2.norm.vcf.gz demo2.vcf.gz; Date=Tue Nov 15 15:58:32 2022
#CHROM POS ID REF ALT QUAL FILTER INFO FORMAT Samp1
chr1 939398 . GCCTCCCCAGCCACGGTGAGGACCCACCCTGGCATGATCCCCCTCATCA G 5075.01 . . GT:AD:AF:DP 0/0:73,0:.:35
chr1 939398 . G GCCTCCCCAGCCACGGTGAGGACCCACCCTGGCATGATCCCCCTCATCA 5075.01 . . GT:AD:AF:DP 0/0:73,0:.:35
I've experimented with various inputs, and concluded that the issue happens only when the field-to-be-split is missing for some samples. I had no problem when all samples had values or when all samples were missing.
Thank you,
In-Hee Lee