Skip to content

Commit

Permalink
Add new -c ~INFO/END feature to match also by INFO/END tag
Browse files Browse the repository at this point in the history
  • Loading branch information
pd3 committed Aug 23, 2021
1 parent 7956618 commit b4195dd
Show file tree
Hide file tree
Showing 8 changed files with 109 additions and 15 deletions.
5 changes: 4 additions & 1 deletion NEWS
Original file line number Diff line number Diff line change
Expand Up @@ -11,7 +11,10 @@
- while non-symbolic variation are uniquely identified by POS,REF,ALT,
symbolic alleles starting at the same position were undistinguishable.
This prevented correct matching of records with the same positions and
variant type but different length given by INFO/END (samtools/htslib@60977f2)
variant type but different length given by INFO/END (samtools/htslib@60977f2).
When annotating froma VCF/BCF, the matching is done automatically. When
annotating from a tab-delimited text file, this feature can be invoked
by using `-c INFO/END`.

- add a new '.' modifier to control wheter missing values should be carried
over from a tab-delimited file or not. For example:
Expand Down
28 changes: 22 additions & 6 deletions doc/bcftools.1
Original file line number Diff line number Diff line change
Expand Up @@ -2,12 +2,12 @@
.\" Title: bcftools
.\" Author: [see the "AUTHOR(S)" section]
.\" Generator: Asciidoctor 2.0.15.dev
.\" Date: 2021-08-20
.\" Date: 2021-08-23
.\" Manual: \ \&
.\" Source: \ \&
.\" Language: English
.\"
.TH "BCFTOOLS" "1" "2021-08-20" "\ \&" "\ \&"
.TH "BCFTOOLS" "1" "2021-08-23" "\ \&" "\ \&"
.ie \n(.g .ds Aq \(aq
.el .ds Aq '
.ss \n[.ss] 0
Expand Down Expand Up @@ -51,7 +51,7 @@ standard input (stdin) and outputs to the standard output (stdout). Several
commands can thus be combined with Unix pipes.
.SS "VERSION"
.sp
This manual page was last updated \fB2021\-08\-20 13:50 BST\fP and refers to bcftools git version \fB1.13\-25\-g41093eb+\fP.
This manual page was last updated \fB2021\-08\-23 13:49 BST\fP and refers to bcftools git version \fB1.13\-27\-g7956618+\fP.
.SS "BCF1"
.sp
The BCF1 format output by versions of samtools <= 0.1.19 is \fBnot\fP
Expand Down Expand Up @@ -575,13 +575,29 @@ number of annotation columns. BED files are expected to have
the ".bed" or ".bed.gz" suffix (case\-insensitive), otherwise a tab\-delimited file is assumed.
Note that in case of tab\-delimited file, the coordinates POS, FROM and TO are
one\-based and inclusive. When REF and ALT are present, only matching VCF
records will be annotated.
records will be annotated. If the END coordinate is present in the annotation file
and given on command line as "\f(CR\-c ~INFO/END\fP", then VCF records will be matched also by the INFO/END coordinate.
If ID is present in the annotation file and given as "\f(CR\-c ~ID\fP", then VCF records will be matched
also by the ID column.
\~
.br
\~
.br
When multiple ALT alleles are present in the annotation file (given as
comma\-separated list of alleles), at least one must match one of the
alleles in the corresponding VCF record. Similarly, at least one
alternate allele from a multi\-allelic VCF record must be present in the
annotation file.
Missing values can be added by providing "." in place of actual value.
\~
.br
\~
.br
Missing values can be added by providing "." in place of actual value
and using the missing value modifier with \fB\-c\fP, such as ".TAG".
\~
.br
\~
.br
Note that flag types, such as "INFO/FLAG", can be annotated by including
a field with the value "1" to set the flag, "0" to remove it, or "." to
keep existing flags.
Expand Down Expand Up @@ -640,7 +656,7 @@ exist in the target file; existing tags will not be overwritten.
To append to existing values (rather than replacing or leaving untouched), use "=TAG"
(instead of "TAG" or "+TAG").
To replace only existing values without modifying missing annotations, use "\-TAG".
To match the record also by ID, in addition to REF and ALT, use "~ID".
To match the record also by ID or INFO/END, in addition to REF and ALT, use "~ID" or "~INFO/END".
\~
.br
\~
Expand Down
20 changes: 15 additions & 5 deletions doc/bcftools.html
Original file line number Diff line number Diff line change
Expand Up @@ -50,7 +50,7 @@ <h2 id="_description">DESCRIPTION</h2>
<div class="sect2">
<h3 id="_version">VERSION</h3>
<div class="paragraph">
<p>This manual page was last updated <strong>2021-08-20 13:50 BST</strong> and refers to bcftools git version <strong>1.13-25-g41093eb+</strong>.</p>
<p>This manual page was last updated <strong>2021-08-23 13:49 BST</strong> and refers to bcftools git version <strong>1.13-27-g7956618+</strong>.</p>
</div>
</div>
<div class="sect2">
Expand Down Expand Up @@ -411,13 +411,23 @@ <h3 id="annotate">bcftools annotate <em>[OPTIONS]</em> <em>FILE</em></h3>
the ".bed" or ".bed.gz" suffix (case-insensitive), otherwise a tab-delimited file is assumed.
Note that in case of tab-delimited file, the coordinates POS, FROM and TO are
one-based and inclusive. When REF and ALT are present, only matching VCF
records will be annotated.
records will be annotated. If the END coordinate is present in the annotation file
and given on command line as "<code>-c ~INFO/END</code>", then VCF records will be matched also by the INFO/END coordinate.
If ID is present in the annotation file and given as "<code>-c ~ID</code>", then VCF records will be matched
also by the ID column.
&#160;<br>
&#160;<br>
When multiple ALT alleles are present in the annotation file (given as
comma-separated list of alleles), at least one must match one of the
alleles in the corresponding VCF record. Similarly, at least one
alternate allele from a multi-allelic VCF record must be present in the
annotation file.
Missing values can be added by providing "." in place of actual value.
&#160;<br>
&#160;<br>
Missing values can be added by providing "." in place of actual value
and using the missing value modifier with <strong>-c</strong>, such as ".TAG".
&#160;<br>
&#160;<br>
Note that flag types, such as "INFO/FLAG", can be annotated by including
a field with the value "1" to set the flag, "0" to remove it, or "." to
keep existing flags.
Expand Down Expand Up @@ -469,7 +479,7 @@ <h3 id="annotate">bcftools annotate <em>[OPTIONS]</em> <em>FILE</em></h3>
To append to existing values (rather than replacing or leaving untouched), use "=TAG"
(instead of "TAG" or "+TAG").
To replace only existing values without modifying missing annotations, use "-TAG".
To match the record also by ID, in addition to REF and ALT, use "~ID".
To match the record also by ID or INFO/END, in addition to REF and ALT, use "~ID" or "~INFO/END".
&#160;<br>
&#160;<br>
If the annotation file is not a VCF/BCF, all new annotations must be
Expand Down Expand Up @@ -4696,7 +4706,7 @@ <h2 id="_copying">COPYING</h2>
</div>
<div id="footer">
<div id="footer-text">
Last updated 2021-08-20 13:44:49 +0100
Last updated 2021-08-23 12:14:58 +0100
</div>
</div>
</body>
Expand Down
16 changes: 13 additions & 3 deletions doc/bcftools.txt
Original file line number Diff line number Diff line change
Expand Up @@ -291,13 +291,23 @@ Add or remove annotations.
the ".bed" or ".bed.gz" suffix (case-insensitive), otherwise a tab-delimited file is assumed.
Note that in case of tab-delimited file, the coordinates POS, FROM and TO are
one-based and inclusive. When REF and ALT are present, only matching VCF
records will be annotated.
records will be annotated. If the END coordinate is present in the annotation file
and given on command line as "`-c ~INFO/END`", then VCF records will be matched also by the INFO/END coordinate.
If ID is present in the annotation file and given as "`-c ~ID`", then VCF records will be matched
also by the ID column.
{nbsp} +
{nbsp} +
When multiple ALT alleles are present in the annotation file (given as
comma-separated list of alleles), at least one must match one of the
alleles in the corresponding VCF record. Similarly, at least one
alternate allele from a multi-allelic VCF record must be present in the
annotation file.
Missing values can be added by providing "." in place of actual value.
{nbsp} +
{nbsp} +
Missing values can be added by providing "." in place of actual value
and using the missing value modifier with *-c*, such as ".TAG".
{nbsp} +
{nbsp} +
Note that flag types, such as "INFO/FLAG", can be annotated by including
a field with the value "1" to set the flag, "0" to remove it, or "." to
keep existing flags.
Expand Down Expand Up @@ -341,7 +351,7 @@ Add or remove annotations.
To append to existing values (rather than replacing or leaving untouched), use "=TAG"
(instead of "TAG" or "+TAG").
To replace only existing values without modifying missing annotations, use "-TAG".
To match the record also by ID, in addition to REF and ALT, use "~ID".
To match the record also by ID or INFO/END, in addition to REF and ALT, use "~ID" or "~INFO/END".
{nbsp} +
{nbsp} +
If the annotation file is not a VCF/BCF, all new annotations must be
Expand Down
9 changes: 9 additions & 0 deletions test/annotate25.1.out
Original file line number Diff line number Diff line change
@@ -0,0 +1,9 @@
##fileformat=VCFv4.2
##FILTER=<ID=PASS,Description="All filters passed">
##contig=<ID=1,length=248956422>
##ALT=<ID=DUP,Description="Duplication">
##INFO=<ID=END,Number=1,Type=Integer,Description="End position of the variant described in this record">
#CHROM POS ID REF ALT QUAL FILTER INFO
1 100 dupXXX;dup110;dupYYY G <DUP> . . END=110
1 100 dup120 G <DUP> . . END=120
1 100 dup130 G <DUP> . . END=130
3 changes: 3 additions & 0 deletions test/annotate25.tab
Original file line number Diff line number Diff line change
@@ -0,0 +1,3 @@
1 100 dup130 G <DUP> 130
1 100 dupXXX;dup110;dupYYY G <DUP> 110
1 100 dup120 G <DUP> 120
8 changes: 8 additions & 0 deletions test/annotate25.vcf
Original file line number Diff line number Diff line change
@@ -0,0 +1,8 @@
##fileformat=VCFv4.2
##contig=<ID=1,length=248956422>
##ALT=<ID=DUP,Description="Duplication">
##INFO=<ID=END,Number=1,Type=Integer,Description="End position of the variant described in this record">
#CHROM POS ID REF ALT QUAL FILTER INFO
1 100 . G <DUP> . . END=110
1 100 . G <DUP> . . END=120
1 100 . G <DUP> . . END=130
35 changes: 35 additions & 0 deletions vcfannotate.c
Original file line number Diff line number Diff line change
Expand Up @@ -140,6 +140,7 @@ typedef struct _args_t
annot_col_t *cols; // column indexes and setters
int ncols;
int match_id; // set iff `-c ~ID` given
int match_end; // set iff `-c ~INFO/END` is given

char *set_ids_fmt;
convert_t *set_ids;
Expand Down Expand Up @@ -2204,6 +2205,19 @@ static void init_columns(args_t *args)
col->hdr_key_dst = strdup(str.s);
if ( replace & MATCH_VALUE ) args->match_id = icol;
}
else if ( !strcasecmp("~INFO/END",str.s) && !args->tgts_is_vcf )
{
replace = MATCH_VALUE;
args->ncols++; args->cols = (annot_col_t*) realloc(args->cols,sizeof(annot_col_t)*args->ncols);
annot_col_t *col = &args->cols[args->ncols-1];
memset(col,0,sizeof(*col));
col->icol = icol;
col->replace = replace;
col->setter = NULL;
col->hdr_key_src = strdup(str.s);
col->hdr_key_dst = strdup(str.s);
args->match_end = icol;
}
else if ( !strncasecmp("ID:=",str.s,4) ) // transfer a tag from INFO to ID column
{
if ( !args->tgts_is_vcf ) error("The annotation source must be a VCF for \"%s\"\n",str.s);
Expand Down Expand Up @@ -2423,6 +2437,11 @@ static void init_columns(args_t *args)
key_dst = str.s + 5;
explicit_dst_info = 1;
}
else if ( !strcasecmp("~INFO/END",str.s) )
{
key_dst = str.s + 6;
explicit_dst_info = 1;
}
else
key_dst = str.s;
char *key_src = strstr(key_dst,":=");
Expand Down Expand Up @@ -2958,6 +2977,7 @@ static void annotate(args_t *args, bcf1_t *line)
for (j=0; j<args->ncols; j++)
{
if ( args->cols[j].done==1 || args->cols[j].merge_method == MM_FIRST ) continue;
if ( !args->cols[j].setter ) continue;
if ( args->cols[j].setter(args,line,&args->cols[j],NULL) < 0 )
error("fixme: Could not set %s at %s:%"PRId64"\n", args->cols[j].hdr_key_src,bcf_seqname(args->hdr,line),(int64_t) line->pos+1);
}
Expand All @@ -2976,6 +2996,10 @@ static void annotate(args_t *args, bcf1_t *line)
if ( args->nalines >= 0xffff || line->n_allele >= 0xffff )
error("Error: too many alleles or annotation lines in the buffer at %s:%"PRId64" (todo:skip?)\n",bcf_seqname(args->hdr,line),(int64_t) line->pos+1);

kstring_t match_end = {0,0,0};
if ( args->match_end>=0 && bcf_get_info_int32(args->hdr,line,"END",&args->tmpi,&args->mtmpi)==1 )
kputw(args->tmpi[0],&match_end);

// Find matching lines
for (i=0; i<args->nalines; i++)
{
Expand All @@ -2995,6 +3019,7 @@ static void annotate(args_t *args, bcf1_t *line)
ialt++;
}
if ( args->match_id>=0 && !strstr_match(line->d.id,args->alines[i].cols[args->match_id]) ) continue;
if ( match_end.l && strcmp(match_end.s,args->alines[i].cols[args->match_end]) ) continue;
args->srt_alines[args->nsrt_alines++] = (ialt<<16) | i;
has_overlap = 1;
break;
Expand All @@ -3006,6 +3031,9 @@ static void annotate(args_t *args, bcf1_t *line)
has_overlap = 1;
}
}

free(match_end.s);

// Sort lines if needed
if ( args->has_append_mode )
{
Expand Down Expand Up @@ -3034,6 +3062,7 @@ static void annotate(args_t *args, bcf1_t *line)
{
if ( args->cols[j].merge_method != MM_APPEND_MISSING ) continue;
if ( args->cols[j].done==1 ) continue;
if ( !args->cols[j].setter ) continue;
int ret = args->cols[j].setter(args,line,&args->cols[j],args->aline_missing);
if ( ret < 0 )
error("fixme: Could not set missing %s at %s:%"PRId64"\n", args->cols[j].hdr_key_src,bcf_seqname(args->hdr,line),(int64_t) line->pos+1);
Expand All @@ -3046,6 +3075,7 @@ static void annotate(args_t *args, bcf1_t *line)
for (j=0; j<args->ncols; j++)
{
if ( args->cols[j].done==1 ) continue;
if ( !args->cols[j].setter ) continue;
int ret = args->cols[j].setter(args,line,&args->cols[j],&args->alines[ilin]);
if ( ret < 0 )
error("fixme: Could not set %s at %s:%"PRId64"\n", args->cols[j].hdr_key_src,bcf_seqname(args->hdr,line),(int64_t) line->pos+1);
Expand All @@ -3066,6 +3096,7 @@ static void annotate(args_t *args, bcf1_t *line)
{
if ( args->cols[j].merge_method != MM_APPEND_MISSING ) continue;
if ( args->cols[j].done==1 ) continue;
if ( !args->cols[j].setter ) continue;
int ret = args->cols[j].setter(args,line,&args->cols[j],args->aline_missing);
if ( ret < 0 )
error("fixme: Could not set missing %s at %s:%"PRId64"\n", args->cols[j].hdr_key_src,bcf_seqname(args->hdr,line),(int64_t) line->pos+1);
Expand All @@ -3078,6 +3109,7 @@ static void annotate(args_t *args, bcf1_t *line)
for (j=0; j<args->ncols; j++)
{
if ( args->cols[j].done==1 || args->cols[j].merge_method == MM_FIRST ) continue;
if ( !args->cols[j].setter ) continue;
int ret = args->cols[j].setter(args,line,&args->cols[j],NULL);
if ( ret < 0 )
error("fixme: Could not set %s at %s:%"PRId64"\n", args->cols[j].hdr_key_src,bcf_seqname(args->hdr,line),(int64_t) line->pos+1);
Expand All @@ -3090,8 +3122,11 @@ static void annotate(args_t *args, bcf1_t *line)
{
bcf1_t *aline = bcf_sr_get_line(args->files,1);
for (j=0; j<args->ncols; j++)
{
if ( !args->cols[j].setter ) continue;
if ( args->cols[j].setter(args,line,&args->cols[j],aline) )
error("fixme: Could not set %s at %s:%"PRId64"\n", args->cols[j].hdr_key_src,bcf_seqname(args->hdr,line),(int64_t) line->pos+1);
}

has_overlap = 1;
}
Expand Down

0 comments on commit b4195dd

Please sign in to comment.