Annotate VCF files at millions of variants per minute. Saturates pigz/gunzip on a 4-core CPU.
go get github.com/akotlar/bystro-vcf && go install $_;
pigz -p 1 -d -c in.vcf.gz | bystro-vcf --keepId --keepInfo | pigz -c - > output
Performs several important functions:
- Splits multiallelics and MNP alleles, keeping track of each allele's index with respect to the original alleles for downstream INFO property segregation
- Performs QC on variants: checks whether allele contains ACTG, that padding bases match reference, and more
- Allows filtering of variants by any number of FILTER properties (by default allows PASS/. variants)
- Normalizes indel representations by removing padding, left shifting alleles to their parsimonious representations
- Calculates whether site is transition, transversion, or neither
- Processes all available samples
- calculates homozygosity, heterozygosity, missingness
- labels samples as homozygous, heterozygous, or missing
bystro-vcf is used to pre-proces VCF files for Bystro (github)
If you use bystro-vcf please cite https://genomebiology.biomedcentral.com/articles/10.1186/s13059-018-1387-3
Millions of variants/rows per minute. Performance is dependent on the # of samples.
Ex:
Amazon i3.2xlarge (4 core), 1K Genomes Phase 3 (2,504 samples): chromosome 1 (6.2M variants) in ~2 minutes 45s
- Runs @ ~ pigz -p 1 streaming decompression limit (97% CPU, 2% sys post-Meltdown/Spectre).
==> ( time pigz -d -c -p 1 ../../../mnt/annotator/ALL.chr1.phase3_shapeit2_mvncall_integrated_v5a.20130502.genotypes.vcf.gz | bystro-vcf &> /dev/null ; )
real 2m45.134s
user 16m20.512s
sys 0m25.940s
go get github.com/akotlar/bystro-vcf && go install $_;
Via pipe:
pigz -d -c in.vcf.gz | bystro-vcf --keepId --keepInfo --allowFilter "PASS,." | pigz -c - > out.gz
Via inPath
argument:
bystro-vcf --in in.vcf --keepId --keepInfo --allowFilter "PASS,." > out
chrom <String> pos <Int> type <String[SNP|DEL|INS|MULTIALLELIC]> ref <String> alt <String> trTv <Int[0|1|2]> heterozygotes <String> heterozygosity <Float64> homozygotes <String> homozygosity <Float64> missingGenos <String> missingness <Float64> sampleMaf <Float64> id <String?> alleleIndex <Int?> info <String?>
--keepId <Bool>
Retain the "ID" field in the output.
--keepInfo <Bool>
Retain the "INFO" field in the output.
- Since we decompose multiallelics, an "alleleIdx" field is added to the output. It contains the 0-based index of that allele in the multiallelic
- This is necessary for downstream programs to decompose the INFO field per-allele
Results in 2 output fields, following missingGenos
or id
should --keepId
be set
alleleIdx
will contain the index of allele in a split multiallelic. 0 by default.info
will contain the entireINFO
string
--allowFilter <String>
Which FILTER
values to keep. Comma separated. Defaults to "PASS,."
.
If passed "" (empty string) or "*" (wildcard) will allow all FILTER values.
- Similar to https://samtools.github.io/bcftools/bcftools.html
-f, --apply-filters LIST
--excludeFilter <String>
Which FILTER
values to exclude. Comma separated. Defaults to ""
- Opposite of https://samtools.github.io/bcftools/bcftools.html
-f, --apply-filters LIST
--in /path/to/uncompressedFile.vcf
An input file path, to an uncompressed VCF file. Defaults to stdin
--out <String>
Send the output here instead of STDOUT
--err /path/to/log.txt
Where to store log messages. Defaults to stderr
--emptyField "!"
Which value to assign to missing data. Defaults to !
--fieldDelimiter ";"
Which delimiter to use when joining multiple values. Defaults to ;