-
Notifications
You must be signed in to change notification settings - Fork 260
Description
Continuing from this post: #1651 about the "Error: cannot use arithmetic operators to compare strings and numbers" error.
After upgrading to v1.16 I am still unable to filter by certain columns, after further testing I have figured out that the error occurs when filtering fields which have been added as custom annotations [see here]. gnomAD allele frequencies are often added by custom annotations as the built-in option in VEP does not use the newer WGS data.
I have used a slightly altered version of the sample data from the bottom of that page as an example here (edited only to add a contig definition): vep_eg.vcf.gz. The ClinVar field is a added via a custom annotation, while the cDNA_position field is a standard VEP field.
bcftools +split-vep -d -c ClinVar:Int,cDNA_position:Int -f "%CHROM\t%POS\t%ClinVar\t%cDNA_position\n" vep_eg.vcf.gz
1 230710048 18068 1018
1 230710048 18068 .
When filtering by a field added as a standard VEP option, such as cDNA_position it works as expected:
bcftools +split-vep -d -c ClinVar:Int,cDNA_position:Int -f "%CHROM\t%POS\t%ClinVar\t%cDNA_position\n" vep_eg.vcf.gz -i "cDNA_position>1000"
1 230710048 18068 1018
but fields from custom annotations such as ClinVar do not:
bcftools +split-vep -d -c ClinVar:Int,cDNA_position:Int -f "%CHROM\t%POS\t%ClinVar\t%cDNA_position\n" vep_eg.vcf.gz -i "ClinVar>1000"
Error: cannot use arithmetic operators to compare strings and numbers
I believe this is due to the fact that custom annotation fields are also defined individually in the VCF header, as well as in the CSQ definition, and are listed as Type=String as default.
##INFO=<ID=CSQ,Number=.,Type=String,Description="Consequence annotations from Ensembl VEP. Format: Allele|Consequence|IMPACT|SYMBOL|Gene|Feature_type|Feature|BIOTYPE|EXON|INTRON|HGVSc|HGVSp|cDNA_position|CDS_position|Protein_position|Amino_acids|Codons|Existing_variation|DISTANCE|STRAND|FLAGS|SYMBOL_SOURCE|HGNC_ID|SOURCE|ClinVar|ClinVar_CLNSIG|ClinVar_CLNREVSTAT|ClinVar_CLNDN">
##INFO=<ID=ClinVar,Number=.,Type=String,Description="/path/to/custom_files/clinvar.vcf.gz (exact)">
If I manually change the type in the VCF header to Type=Integer for the ClinVar field filtering works as expected. I am unsure if it possible to change this when adding custom annotations, but as it is not even mentioned in the custom annotation page, the vast majority of pre-existing files have all custom annotation fields listed as string so it would be beneficial to have the type set in split-vep override the VCF header.
Relatedly, while testing this I have also realised it doesn't seem to be possible to filter for missing values for numeric fields within the VEP CSQ field. This is especially important for annotations like gnomAD allele frequencies as a missing value means the variant has not been observed in the dataset so is essentially equivalent to AF=0. The expressions-i "cDNA_position='.'" or -i "cDNA_position=''" both work with cDNA_position:String but not with cDNA_position:Int