Skip to content

query -H format improvement #1856

@janxkoci

Description

@janxkoci

The problem

I would like to propose a small tweak to bcftools query with the -H flag, which prints header names as the first line of output. Currently, the header line begins with # (a hash sign followed by space):

$ bcftools query -H -f '%CHROM' input.vcf | head -2
# CHROM
chr1

This can confuse many downstream tools trying to parse data into columns. The first line will appear to have one more column in the eyes of many standard tools, such as awk, cut, datamash, R, and others, including spreadsheet apps.

Consider the following example:

$ bcftools query -H -f '%CHROM %POS %REF %ALT\n' input.vcf | awk 'NR < 4 {print "ncol:", NF, "(col $1: "$1")"}'
ncol: 5 (col $1: #)
ncol: 4 (col $1: chr1)
ncol: 4 (col $1: chr1)

In this toy example, we can see that the first line has more columns, and that the name of first column is "#", rather than e.g. "CHROM", as we asked in the query format. It makes the header less useful by default, requiring additional processing. For example, I often end up piping through sed, e.g.

bcftools query -H -f '%CHROM %POS %REF %ALT\n' input.vcf | sed '1s/# //' | awk ...

The proposed change

Depending on your preference regarding the hash sign # in the header, I propose removing either the space following the hash signs, or remove both the hash sign and the space. This would result in a name such as either #CHROM or just CHROM, respectively.

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions