Skip to content

Accessing correct AD values after LAD>AD transformation and site normalisation #2420

@magdadrozdz

Description

@magdadrozdz

Hello. I have a msVCF with multiallelic sites, which I need to normalise to the biallelic representation.

Here is an example of the msVCF:

##fileformat=VCFv4.2
##FORMAT=<ID=GT,Number=1,Type=String,Description="Genotype">
##FORMAT=<ID=LAD,Number=.,Type=Integer,Description="Localized field: Allelic Depths">
##FORMAT=<ID=LAA,Number=.,Type=Integer,Description="Mapping of alt allele index from original gVCF to msVCF, comma-separated, 1-based (each value is the allele index in the msVCF)">
#CHROM POS    ID     REF    ALT    QUAL   FILTER INFO   FORMAT HG001  HG002  HG003  HG004  HG005  HG006    HG007
chr21  46668751       .      CTT    C,CT   50.42  PASS GT:LAD:LAA 2/1:0,12,8:2,1 0/1:19,7:1 0/2:16,12:2 1/2:0,17,14:1,2 1/1:0,25:1 1/2:0,17,11:1,2 1/1:0,43:1

LAD-AD conversion:

bcftools +tag2tag example_vcf.gz -- --LAD-to-AD | bcftools view -Oz -o example_vcf_transformed.vcf.gz 
##fileformat=VCFv4.2
##FILTER=<ID=PASS,Description="All filters passed">
##FORMAT=<ID=GT,Number=1,Type=String,Description="Genotype">
##FORMAT=<ID=LAD,Number=.,Type=Integer,Description="Localized field: Allelic Depths">
##FORMAT=<ID=LAA,Number=.,Type=Integer,Description="Mapping of alt allele index from original gVCF to msVCF, comma-separated, 1-based (each value is the allele index in the msVCF)">
##contig=<ID=chr21>
##FORMAT=<ID=AD,Number=R,Type=Integer,Description="Allelic depths">
#CHROM POS    ID     REF    ALT    QUAL   FILTER INFO   FORMAT HG001  HG002  HG003  HG004  HG005  HG006  HG007
chr21  46668751       .      CTT    C,CT   50.42  PASS   .      GT:LAD:LAA:AD  2/1:0,12,8:2,1:0,8,12  0/1:19,7:1:19,7,.      0/2:16,12:2:16,.,12    1/2:0,17,14:1,2:0,17,14 1/1:0,25:1:0,25,.      1/2:0,17,11:1,2:0,17,11 1/1:0,43:1:0,43,.

Normalisation:

bcftools norm -f $reference -m -both --multi-overlaps 0 example_vcf_transformed.vcf.gz -Oz -o example_vcf_normalised.vcf.gz
##fileformat=VCFv4.2
##FILTER=<ID=PASS,Description="All filters passed">
##FORMAT=<ID=GT,Number=1,Type=String,Description="Genotype">
##FORMAT=<ID=LAD,Number=.,Type=Integer,Description="Localized field: Allelic Depths">
##FORMAT=<ID=LAA,Number=.,Type=Integer,Description="Mapping of alt allele index from original gVCF to msVCF, comma-separated, 1-based (each value is the allele index in the msVCF)">
##contig=<ID=chr21>
##FORMAT=<ID=AD,Number=R,Type=Integer,Description="Allelic depths">
#CHROM POS    ID     REF    ALT    QUAL   FILTER INFO   FORMAT HG001  HG002  HG003  HG004  HG005  HG006  HG007
chr21  46668751       .      CTT    C      50.42  PASS   .      GT:LAD:LAA:AD  0/1:0,12,8:2,1:0,8     0/1:19,7:1:19,7 0/0:16,12:2:16,.       1/0:0,17,14:1,2:0,17   1/1:0,25:1:0,25 1/0:0,17,11:1,2:0,17   1/1:0,43:1:0,43
chr21  46668751       .      CT     C      50.42  PASS   .      GT:LAD:LAA:AD  1/0:0,12,8:2,1:0,12    0/0:19,7:1:19,. 0/1:16,12:2:16,12      0/1:0,17,14:1,2:0,14   0/0:0,25:1:0,. 0/1:0,17,11:1,2:0,11   0/0:0,43:1:0,.

I would like to calculate the ratio of allele depths (AB ratio) for each entry, ideally using the normalised msVCF as I have to represent the results in a normalised msVCF-compatible format. To do this, I need the AD value for each allele. I have noticed that tag2tag LAD>AD conversion at decomposed e.g. 1/2 genotypes (not originally having a reference allele) automatically pulls the AD of the reference allele for the biallelic representation instead of the other alternate allele, as illustrated below:

Original entry:

POS   GT	LAD	LAA
CTT/C,CT	2/1	0,12,8	2,1

After LAD>AD conversion:

POS	GT	LAD	LAA    AD	
chr21:46668751-CTT/C,CT 2/1     0,12,8  2,1     0,8,12

After normalisation:

POS	GT	LAD	LAA	AD
CTT/C	0/1	0,12,8	2,1	0,8
CTT/CT	1/0	0,12,8	2,1	0,12

I have several questions:
Is it an expected behaviour that LAD>AD conversion will result with an AD field with rearranged LAD values? (e.g. 0,12,8 -> 0,8,12)
Is it an expected behaviour that the AD of the reference allele is pulled after normalisation, instead of the other alt's AD?

Apart from the simple AB ratio, I would also like to use binom(AD) > 0.01 filter on the normalised msVCF, but with the above happening, I will get wrong results for normalised old multiallelic sites. I would like to perform this per-allele, so I need to do this on the normalised VCF.

  1. What would be your suggestion to force bcftools to use the correct AD/LAD values for the binomial test (correct = representing the AD of the other allele, rather than reference allele when it is a 2/1 or 2/3 genotype)?

I realised that all "potentially problematic" genotypes still had the LAA field with 2 values, whereas real heterozygotes with a reference call (e.g. true, non-decomposed 0/1) would just have a "1" or a "2" - a single value. I almost had a workaround by using the COUNT function in the filter field to select heterozygotes with more than 1 LAA value and then force bcftools to look at the correct values while performing the binomial test with binom(LAD[:1],LAD[:2])>0.01, something like this:

bcftools query \
 -r chr21:46668751 \
 -f '[%CHROM:%POS-%REF/%ALT\t%GT\t%LAA\t%LAD\t%SAMPLE\n]' \
 -i 'GT="het" & SMPL_COUNT(LAA)>1' \
 example_vcf_transformed.vcf.gz

but unfortunately it cannot be applied sample-wise as SMPL_COUNT is not supported. I tried a workaround with a custom perl script, but it also did not work sample wise. So here is the last question:
4. Is there any way I can count LAA elements per sample? Could this be implemented?

I am aware this is quite a hacky solution, but I will be very grateful for any help overcoming this :)

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions