-
Notifications
You must be signed in to change notification settings - Fork 260
Description
The problem
I would like to propose a small tweak to bcftools query with the -H flag, which prints header names as the first line of output. Currently, the header line begins with # (a hash sign followed by space):
$ bcftools query -H -f '%CHROM' input.vcf | head -2
# CHROM
chr1This can confuse many downstream tools trying to parse data into columns. The first line will appear to have one more column in the eyes of many standard tools, such as awk, cut, datamash, R, and others, including spreadsheet apps.
Consider the following example:
$ bcftools query -H -f '%CHROM %POS %REF %ALT\n' input.vcf | awk 'NR < 4 {print "ncol:", NF, "(col $1: "$1")"}'
ncol: 5 (col $1: #)
ncol: 4 (col $1: chr1)
ncol: 4 (col $1: chr1)In this toy example, we can see that the first line has more columns, and that the name of first column is "#", rather than e.g. "CHROM", as we asked in the query format. It makes the header less useful by default, requiring additional processing. For example, I often end up piping through sed, e.g.
bcftools query -H -f '%CHROM %POS %REF %ALT\n' input.vcf | sed '1s/# //' | awk ...The proposed change
Depending on your preference regarding the hash sign # in the header, I propose removing either the space following the hash signs, or remove both the hash sign and the space. This would result in a name such as either #CHROM or just CHROM, respectively.