Skip to content

should bcftools norm --multiallelics - respect lexicographical order? #1484

@freeseek

Description

@freeseek

This is not an issue, but a behavior that maybe could be more optimal.

I have noticed that when splitting multiallelic variants in different VCF files, you don't necessarily end up with variants in the same order. Here is an example:

(echo "##fileformat=VCFv4.1"
echo "##contig=<ID=chr19>"
echo -e "#CHROM\tPOS\tID\tREF\tALT\tQUAL\tFILTER\tINFO"
echo -e "chr19\t50359054\t.\tC\tT,A\t.\t.\t.") > in.vcf

Now if I split with bcftools norm --multiallelics -:

$ bcftools norm --multiallelics - --no-version in.vcf
##fileformat=VCFv4.1
##FILTER=<ID=PASS,Description="All filters passed">
##contig=<ID=chr19>
#CHROM	POS	ID	REF	ALT	QUAL	FILTER	INFO
chr19	50359054	.	C	T	.	.	.
chr19	50359054	.	C	A	.	.	.
Lines   total/split/realigned/skipped:	1/1/0/0

But if I further sort the file:

$ bcftools norm --multiallelics - --no-version in.vcf | bcftools sort
Writing to /tmp/bcftools-sort.HmQM3B
Lines   total/split/realigned/skipped:	1/1/0/0
Merging 1 temporary files
##fileformat=VCFv4.1
##FILTER=<ID=PASS,Description="All filters passed">
##contig=<ID=chr19>
#CHROM	POS	ID	REF	ALT	QUAL	FILTER	INFO
chr19	50359054	.	C	A	.	.	.
chr19	50359054	.	C	T	.	.	.
Cleaning
Done

The order of the variants has changed. I suppose it would be quite a bit of work to rewrite bcftools norm to sort the variants after splitting them so that you would not require a sort but reporting nevertheless just in case.

I have noticed this as some HTSlib tools (such as IMPUTE5 v1.1.4) do not work if VCFs have different orders in the variants.

Metadata

Metadata

Assignees

No one assigned

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions