Hi, I ran into this while investigating pysam issue pysam-developers/pysam#1407.
With a symbolic deletion record that has both END and SVLEN, htslib-backed parsing now reports an rlen/stop one base larger than INFO/END implies.
Minimal VCF:
##fileformat=VCFv4.2
##contig=<ID=chr1,length=5000000>
##INFO=<ID=END,Number=1,Type=Integer,Description="End position">
##INFO=<ID=SVLEN,Number=1,Type=Integer,Description="SV length">
##INFO=<ID=SVTYPE,Number=1,Type=String,Description="SV type">
##FORMAT=<ID=GT,Number=1,Type=String,Description="Genotype">
#CHROM POS ID REF ALT QUAL FILTER INFO FORMAT SAMPLE
chr1 2651001 . N <DEL> . PASS END=2658000;SVLEN=7000;SVTYPE=DEL GT 0/1
Observed via pysam, which exposes htslib's bcf1_t.rlen:
import pysam
rec = next(pysam.VariantFile("symbolic_deletion.vcf"))
print(rec.start, rec.stop, rec.rlen)
On pysam==0.24.0 / bundled htslib 1.23.1, this prints:
On pysam==0.23.3, the same record prints:
From reading vcf.c:get_rlen(), the new value appears to come from SVLEN=7000 being converted to end_svlen = v->pos + len + 1, then taking the maximum of END and end_svlen. Since v->pos is 0-based and INFO/END is 1-based inclusive, this makes the effective 0-based exclusive stop one base beyond END for this record.
Question: when END is present for a symbolic <DEL>, should it remain authoritative for the 0-based exclusive interval exposed as pos + rlen, or is this SVLEN interpretation expected under the newer VCF 4.4/4.5 rlen logic? If this is intended behavior, it would help to clarify so pysam can adjust expectations/docs. If not, I am happy to help with a small regression test/fix.
Hi, I ran into this while investigating pysam issue pysam-developers/pysam#1407.
With a symbolic deletion record that has both
ENDandSVLEN, htslib-backed parsing now reports an rlen/stop one base larger thanINFO/ENDimplies.Minimal VCF:
Observed via pysam, which exposes htslib's
bcf1_t.rlen:On
pysam==0.24.0/ bundled htslib 1.23.1, this prints:On
pysam==0.23.3, the same record prints:From reading
vcf.c:get_rlen(), the new value appears to come fromSVLEN=7000being converted toend_svlen = v->pos + len + 1, then taking the maximum ofENDandend_svlen. Sincev->posis 0-based andINFO/ENDis 1-based inclusive, this makes the effective 0-based exclusive stop one base beyondENDfor this record.Question: when
ENDis present for a symbolic<DEL>, should it remain authoritative for the 0-based exclusive interval exposed aspos + rlen, or is thisSVLENinterpretation expected under the newer VCF 4.4/4.5 rlen logic? If this is intended behavior, it would help to clarify so pysam can adjust expectations/docs. If not, I am happy to help with a small regression test/fix.