Skip to content

filtering expression - handling of missing values by != and !~ #2355

@aheinzel

Description

@aheinzel

Dear all,

I have a question regarding the handling of missing values when using != and !~ in filtering expression in bcftools v1.19+htslib-1.19. To support my question I include a minimal vcf file with one info tag named TAG.

##fileformat=VCFv4.2
##FILTER=<ID=PASS,Description="All filters passed">
##contig=<ID=chr1,length=5>
##INFO=<ID=TAG,Number=.,Type=String,Description="Some tag">
##FORMAT=<ID=GT,Number=1,Type=String,Description="Genotype">
##bcftools_viewVersion=1.19+htslib-1.19
#CHROM	POS	ID	REF	ALT	QUAL	FILTER	INFO	FORMAT	S1
chr1	1	.	*	*	.	.	TAG=a,b,c	GT	0/0
chr1	2	.	*	*	.	.	TAG=a	GT	0/0
chr1	3	.	*	*	.	.	TAG=.	GT	0/.
chr1	4	.	*	*	.	.	TAG=.,.	GT	./0
chr1	5	.	*	*	.	.	TAG=a,.,c	GT	./.

I can use the standard string comparison operator in view -i TAG[*]!="." to include only sites with at least one non missing value for TAG:

##fileformat=VCFv4.2
##FILTER=<ID=PASS,Description="All filters passed">
##contig=<ID=chr1,length=5>
##INFO=<ID=TAG,Number=.,Type=String,Description="Some tag">
##FORMAT=<ID=GT,Number=1,Type=String,Description="Genotype">
##bcftools_viewVersion=1.19+htslib-1.19
##bcftools_viewCommand=view -i TAG[*]!="." input.vcf; Date=Sat Jan 18 17:33:45 2025
#CHROM	POS	ID	REF	ALT	QUAL	FILTER	INFO	FORMAT	S1
chr1	1	.	*	*	.	.	TAG=a,b,c	GT	0/0
chr1	2	.	*	*	.	.	TAG=a	GT	0/0
chr1	5	.	*	*	.	.	TAG=a,.,c	GT	./.

Doing the same with the regex operator view -i TAG[*]!~"\." does not filter out any variants:

##fileformat=VCFv4.2
##FILTER=<ID=PASS,Description="All filters passed">
##contig=<ID=chr1,length=5>
##INFO=<ID=TAG,Number=.,Type=String,Description="Some tag">
##FORMAT=<ID=GT,Number=1,Type=String,Description="Genotype">
##bcftools_viewVersion=1.19+htslib-1.19
##bcftools_viewCommand=view -i TAG[*]!~"\." input.vcf; Date=Sat Jan 18 17:37:27 2025
#CHROM	POS	ID	REF	ALT	QUAL	FILTER	INFO	FORMAT	S1
chr1	1	.	*	*	.	.	TAG=a,b,c	GT	0/0
chr1	2	.	*	*	.	.	TAG=a	GT	0/0
chr1	3	.	*	*	.	.	TAG=.	GT	0/.
chr1	4	.	*	*	.	.	TAG=.,.	GT	./0
chr1	5	.	*	*	.	.	TAG=a,.,c	GT	./.

Sorry just realized that this is maybe non conclusive, thus I add another example with view -i TAG[*]!~"[A-z]"

##fileformat=VCFv4.2
##FILTER=<ID=PASS,Description="All filters passed">
##contig=<ID=chr1,length=5>
##INFO=<ID=TAG,Number=.,Type=String,Description="Some tag">
##FORMAT=<ID=GT,Number=1,Type=String,Description="Genotype">
##bcftools_viewVersion=1.19+htslib-1.19
##bcftools_viewCommand=view -i TAG[*]!~"[A-z]" input.vcf; Date=Sun Jan 19 15:15:24 2025
#CHROM	POS	ID	REF	ALT	QUAL	FILTER	INFO	FORMAT	S1
chr1	3	.	*	*	.	.	TAG=.	GT	0/.
chr1	4	.	*	*	.	.	TAG=.,.	GT	./0
chr1	5	.	*	*	.	.	TAG=a,.,c	GT	./.

Does the negated regex operator automatically evaluate to true for missing values?

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions