Skip to content

norm remove duplicates doesn't handle SVLEN, removes non-duplicate symbolic variants #2182

@davmlaw

Description

@davmlaw

The following VCF contains 3 deletions of length 1kb, 2kb and 3kb:

##fileformat=VCFv4.1
##INFO=<ID=SVLEN,Number=.,Type=Integer,Description="Difference in length between REF and ALT alleles">
##INFO=<ID=SVTYPE,Number=1,Type=String,Description="Type of structural variant">
##contig=<ID=NC_000012.11,length=141213431>
#CHROM	POS	ID	REF	ALT	QUAL	FILTER	INFO
NC_000012.11	88520131	23651	C	<DEL>	.	.	SVLEN=-1000;SVTYPE=DEL
NC_000012.11	88520131	24042	C	<DEL>	.	.	SVLEN=-2000;SVTYPE=DEL
NC_000012.11	88520131	24043	C	<DEL>	.	.	SVLEN=-3000;SVTYPE=DEL

If you run (even with "exact") it removes the records with the same chrom/pos/ref/alt even though SVLEN is different (and thus separate variants)

bcftools norm --remove-duplicates --rm-dup=exact symbolic_uniq.vcf

If this is difficult, it would be good to at the least raise a warning about this, as current behavior is silent data loss. Thanks

bcftools --version
bcftools 1.20
Using htslib 1.20

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions