-
Notifications
You must be signed in to change notification settings - Fork 260
Description
Hello. I have a msVCF with multiallelic sites, which I need to normalise to the biallelic representation.
Here is an example of the msVCF:
##fileformat=VCFv4.2
##FORMAT=<ID=GT,Number=1,Type=String,Description="Genotype">
##FORMAT=<ID=LAD,Number=.,Type=Integer,Description="Localized field: Allelic Depths">
##FORMAT=<ID=LAA,Number=.,Type=Integer,Description="Mapping of alt allele index from original gVCF to msVCF, comma-separated, 1-based (each value is the allele index in the msVCF)">
#CHROM POS ID REF ALT QUAL FILTER INFO FORMAT HG001 HG002 HG003 HG004 HG005 HG006 HG007
chr21 46668751 . CTT C,CT 50.42 PASS GT:LAD:LAA 2/1:0,12,8:2,1 0/1:19,7:1 0/2:16,12:2 1/2:0,17,14:1,2 1/1:0,25:1 1/2:0,17,11:1,2 1/1:0,43:1
LAD-AD conversion:
bcftools +tag2tag example_vcf.gz -- --LAD-to-AD | bcftools view -Oz -o example_vcf_transformed.vcf.gz
##fileformat=VCFv4.2
##FILTER=<ID=PASS,Description="All filters passed">
##FORMAT=<ID=GT,Number=1,Type=String,Description="Genotype">
##FORMAT=<ID=LAD,Number=.,Type=Integer,Description="Localized field: Allelic Depths">
##FORMAT=<ID=LAA,Number=.,Type=Integer,Description="Mapping of alt allele index from original gVCF to msVCF, comma-separated, 1-based (each value is the allele index in the msVCF)">
##contig=<ID=chr21>
##FORMAT=<ID=AD,Number=R,Type=Integer,Description="Allelic depths">
#CHROM POS ID REF ALT QUAL FILTER INFO FORMAT HG001 HG002 HG003 HG004 HG005 HG006 HG007
chr21 46668751 . CTT C,CT 50.42 PASS . GT:LAD:LAA:AD 2/1:0,12,8:2,1:0,8,12 0/1:19,7:1:19,7,. 0/2:16,12:2:16,.,12 1/2:0,17,14:1,2:0,17,14 1/1:0,25:1:0,25,. 1/2:0,17,11:1,2:0,17,11 1/1:0,43:1:0,43,.
Normalisation:
bcftools norm -f $reference -m -both --multi-overlaps 0 example_vcf_transformed.vcf.gz -Oz -o example_vcf_normalised.vcf.gz
##fileformat=VCFv4.2
##FILTER=<ID=PASS,Description="All filters passed">
##FORMAT=<ID=GT,Number=1,Type=String,Description="Genotype">
##FORMAT=<ID=LAD,Number=.,Type=Integer,Description="Localized field: Allelic Depths">
##FORMAT=<ID=LAA,Number=.,Type=Integer,Description="Mapping of alt allele index from original gVCF to msVCF, comma-separated, 1-based (each value is the allele index in the msVCF)">
##contig=<ID=chr21>
##FORMAT=<ID=AD,Number=R,Type=Integer,Description="Allelic depths">
#CHROM POS ID REF ALT QUAL FILTER INFO FORMAT HG001 HG002 HG003 HG004 HG005 HG006 HG007
chr21 46668751 . CTT C 50.42 PASS . GT:LAD:LAA:AD 0/1:0,12,8:2,1:0,8 0/1:19,7:1:19,7 0/0:16,12:2:16,. 1/0:0,17,14:1,2:0,17 1/1:0,25:1:0,25 1/0:0,17,11:1,2:0,17 1/1:0,43:1:0,43
chr21 46668751 . CT C 50.42 PASS . GT:LAD:LAA:AD 1/0:0,12,8:2,1:0,12 0/0:19,7:1:19,. 0/1:16,12:2:16,12 0/1:0,17,14:1,2:0,14 0/0:0,25:1:0,. 0/1:0,17,11:1,2:0,11 0/0:0,43:1:0,.
I would like to calculate the ratio of allele depths (AB ratio) for each entry, ideally using the normalised msVCF as I have to represent the results in a normalised msVCF-compatible format. To do this, I need the AD value for each allele. I have noticed that tag2tag LAD>AD conversion at decomposed e.g. 1/2 genotypes (not originally having a reference allele) automatically pulls the AD of the reference allele for the biallelic representation instead of the other alternate allele, as illustrated below:
Original entry:
POS GT LAD LAA
CTT/C,CT 2/1 0,12,8 2,1
After LAD>AD conversion:
POS GT LAD LAA AD
chr21:46668751-CTT/C,CT 2/1 0,12,8 2,1 0,8,12
After normalisation:
POS GT LAD LAA AD
CTT/C 0/1 0,12,8 2,1 0,8
CTT/CT 1/0 0,12,8 2,1 0,12
I have several questions:
Is it an expected behaviour that LAD>AD conversion will result with an AD field with rearranged LAD values? (e.g. 0,12,8 -> 0,8,12)
Is it an expected behaviour that the AD of the reference allele is pulled after normalisation, instead of the other alt's AD?
Apart from the simple AB ratio, I would also like to use binom(AD) > 0.01 filter on the normalised msVCF, but with the above happening, I will get wrong results for normalised old multiallelic sites. I would like to perform this per-allele, so I need to do this on the normalised VCF.
- What would be your suggestion to force bcftools to use the correct AD/LAD values for the binomial test (correct = representing the AD of the other allele, rather than reference allele when it is a 2/1 or 2/3 genotype)?
I realised that all "potentially problematic" genotypes still had the LAA field with 2 values, whereas real heterozygotes with a reference call (e.g. true, non-decomposed 0/1) would just have a "1" or a "2" - a single value. I almost had a workaround by using the COUNT function in the filter field to select heterozygotes with more than 1 LAA value and then force bcftools to look at the correct values while performing the binomial test with binom(LAD[:1],LAD[:2])>0.01, something like this:
bcftools query \
-r chr21:46668751 \
-f '[%CHROM:%POS-%REF/%ALT\t%GT\t%LAA\t%LAD\t%SAMPLE\n]' \
-i 'GT="het" & SMPL_COUNT(LAA)>1' \
example_vcf_transformed.vcf.gz
but unfortunately it cannot be applied sample-wise as SMPL_COUNT is not supported. I tried a workaround with a custom perl script, but it also did not work sample wise. So here is the last question:
4. Is there any way I can count LAA elements per sample? Could this be implemented?
I am aware this is quite a hacky solution, but I will be very grateful for any help overcoming this :)