forked from samtools/bcftools
-
Notifications
You must be signed in to change notification settings - Fork 0
/
Copy pathbcftools.txt
3865 lines (2956 loc) · 149 KB
/
bcftools.txt
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
335
336
337
338
339
340
341
342
343
344
345
346
347
348
349
350
351
352
353
354
355
356
357
358
359
360
361
362
363
364
365
366
367
368
369
370
371
372
373
374
375
376
377
378
379
380
381
382
383
384
385
386
387
388
389
390
391
392
393
394
395
396
397
398
399
400
401
402
403
404
405
406
407
408
409
410
411
412
413
414
415
416
417
418
419
420
421
422
423
424
425
426
427
428
429
430
431
432
433
434
435
436
437
438
439
440
441
442
443
444
445
446
447
448
449
450
451
452
453
454
455
456
457
458
459
460
461
462
463
464
465
466
467
468
469
470
471
472
473
474
475
476
477
478
479
480
481
482
483
484
485
486
487
488
489
490
491
492
493
494
495
496
497
498
499
500
501
502
503
504
505
506
507
508
509
510
511
512
513
514
515
516
517
518
519
520
521
522
523
524
525
526
527
528
529
530
531
532
533
534
535
536
537
538
539
540
541
542
543
544
545
546
547
548
549
550
551
552
553
554
555
556
557
558
559
560
561
562
563
564
565
566
567
568
569
570
571
572
573
574
575
576
577
578
579
580
581
582
583
584
585
586
587
588
589
590
591
592
593
594
595
596
597
598
599
600
601
602
603
604
605
606
607
608
609
610
611
612
613
614
615
616
617
618
619
620
621
622
623
624
625
626
627
628
629
630
631
632
633
634
635
636
637
638
639
640
641
642
643
644
645
646
647
648
649
650
651
652
653
654
655
656
657
658
659
660
661
662
663
664
665
666
667
668
669
670
671
672
673
674
675
676
677
678
679
680
681
682
683
684
685
686
687
688
689
690
691
692
693
694
695
696
697
698
699
700
701
702
703
704
705
706
707
708
709
710
711
712
713
714
715
716
717
718
719
720
721
722
723
724
725
726
727
728
729
730
731
732
733
734
735
736
737
738
739
740
741
742
743
744
745
746
747
748
749
750
751
752
753
754
755
756
757
758
759
760
761
762
763
764
765
766
767
768
769
770
771
772
773
774
775
776
777
778
779
780
781
782
783
784
785
786
787
788
789
790
791
792
793
794
795
796
797
798
799
800
801
802
803
804
805
806
807
808
809
810
811
812
813
814
815
816
817
818
819
820
821
822
823
824
825
826
827
828
829
830
831
832
833
834
835
836
837
838
839
840
841
842
843
844
845
846
847
848
849
850
851
852
853
854
855
856
857
858
859
860
861
862
863
864
865
866
867
868
869
870
871
872
873
874
875
876
877
878
879
880
881
882
883
884
885
886
887
888
889
890
891
892
893
894
895
896
897
898
899
900
901
902
903
904
905
906
907
908
909
910
911
912
913
914
915
916
917
918
919
920
921
922
923
924
925
926
927
928
929
930
931
932
933
934
935
936
937
938
939
940
941
942
943
944
945
946
947
948
949
950
951
952
953
954
955
956
957
958
959
960
961
962
963
964
965
966
967
968
969
970
971
972
973
974
975
976
977
978
979
980
981
982
983
984
985
986
987
988
989
990
991
992
993
994
995
996
997
998
999
1000
// bcftools.txt -- asciidoc template for the bcftools man page and html.
//
// Please do not modify bcftools.1 or bcftools.html directly,
// edit this file and convert using the following commands:
//
// make docs
//
// or
//
// a2x --doctype manpage --format manpage bcftools.txt
// a2x --doctype manpage --format xhtml bcftools.txt
//
// Contributions are welcome, simply edit this file and send
// a pull request or email the modified file directly.
//
bcftools(1)
===========
:doctype: manpage
NAME
----
bcftools - utilities for variant calling and manipulating VCFs and BCFs.
SYNOPSIS
--------
*bcftools* [--version|--version-only] [--help] ['COMMAND'] ['OPTIONS']
DESCRIPTION
-----------
BCFtools is a set of utilities that manipulate variant calls in the Variant
Call Format (VCF) and its binary counterpart BCF. All commands work
transparently with both VCFs and BCFs, both uncompressed and BGZF-compressed.
Most commands accept VCF, bgzipped VCF and BCF with filetype detected
automatically even when streaming from a pipe. Indexed VCF and BCF
will work in all situations. Un-indexed VCF and BCF and streams will
work in most, but not all situations. In general, whenever multiple VCFs are
read simultaneously, they must be indexed and therefore also compressed.
(Note that files with non-standard index names can be accessed as e.g.
"`bcftools view -r X:2928329 file.vcf.gz##idx##non-standard-index-name`".)
BCFtools is designed to work on a stream. It regards an input file "-" as the
standard input (stdin) and outputs to the standard output (stdout). Several
commands can thus be combined with Unix pipes.
=== VERSION
This manual page was last updated *{date}* and refers to bcftools git version *{version}*.
=== BCF1
The obsolete BCF1 format output by versions of samtools \<= 0.1.19 is *not*
compatible with this version of bcftools. To read BCF1 files one can use
the view command from old versions of bcftools packaged with samtools
versions \<= 0.1.19 to convert to VCF, which can then be read by
this version of bcftools.
----
samtools-0.1.19/bcftools/bcftools view file.bcf1 | bcftools view
----
=== VARIANT CALLING
See 'bcftools call' for variant calling from the output of the
'samtools mpileup' command. In versions of samtools \<= 0.1.19 calling was
done with 'bcftools view'. Users are now required to choose between the old
samtools calling model ('-c/--consensus-caller') and the new multiallelic
calling model ('-m/--multiallelic-caller'). The multiallelic calling model
is recommended for most tasks.
=== FILTERING EXPRESSIONS
See *<<expressions,EXPRESSIONS>>*
LIST OF COMMANDS
----------------
For a full list of available commands, run *bcftools* without arguments. For a full
list of available options, run *bcftools* 'COMMAND' without arguments.
- *<<annotate,annotate>>* .. edit VCF files, add or remove annotations
- *<<call,call>>* .. SNP/indel calling (former "view")
- *<<cnv,cnv>>* .. Copy Number Variation caller
- *<<concat,concat>>* .. concatenate VCF/BCF files from the same set of samples
- *<<consensus,consensus>>* .. create consensus sequence by applying VCF variants
- *<<convert,convert>>* .. convert VCF/BCF to other formats and back
- *<<csq,csq>>* .. haplotype aware consequence caller
- *<<filter,filter>>* .. filter VCF/BCF files using fixed thresholds
- *<<gtcheck,gtcheck>>* .. check sample concordance, detect sample swaps and contamination
- *<<head,head>>* .. view VCF/BCF file headers
- *<<index,index>>* .. index VCF/BCF
- *<<isec,isec>>* .. intersections of VCF/BCF files
- *<<merge,merge>>* .. merge VCF/BCF files files from non-overlapping sample sets
- *<<mpileup,mpileup>>* .. multi-way pileup producing genotype likelihoods
- *<<norm,norm>>* .. normalize indels
- *<<plugin,plugin>>* .. run user-defined plugin
- *<<polysomy,polysomy>>* .. detect contaminations and whole-chromosome aberrations
- *<<query,query>>* .. transform VCF/BCF into user-defined formats
- *<<reheader,reheader>>* .. modify VCF/BCF header, change sample names
- *<<roh,roh>>* .. identify runs of homo/auto-zygosity
- *<<sort,sort>>* .. sort VCF/BCF files
- *<<stats,stats>>* .. produce VCF/BCF stats (former vcfcheck)
- *<<view,view>>* .. subset, filter and convert VCF and BCF files
LIST OF SCRIPTS
---------------
Some helper scripts are bundled with the bcftools code.
- *<<gff2gff,gff2gff>>* .. converts a GFF file to the format required by *<<csq,csq>>*
- *<<plot-vcfstats,plot-vcfstats>>* .. plots the output of *<<stats,stats>>*
COMMANDS AND OPTIONS
--------------------
[[common_options]]
=== Common Options
The following options are common to many bcftools commands. See usage for
specific commands to see if they apply.
'FILE'::
Files can be both VCF or BCF, uncompressed or BGZF-compressed. The file "-"
is interpreted as standard input. Some tools may require tabix- or
CSI-indexed files.
*-c, --collapse* 'snps'|'indels'|'both'|'all'|'some'|'none'|'id'::
Controls how to treat records with duplicate positions and defines compatible
records across multiple input files. Here by "compatible" we mean records which
should be considered as identical by the tools. For example, when performing
line intersections, the desire may be to consider as identical all sites with
matching positions (*bcftools isec -c* 'all'), or only sites with matching variant
type (*bcftools isec -c* 'snps'{nbsp} *-c* 'indels'), or only sites with all alleles
identical (*bcftools isec -c* 'none').
'none';;
only records with identical REF and ALT alleles are compatible
'some';;
only records where some subset of ALT alleles match are compatible
'all';;
all records are compatible, regardless of whether the ALT alleles
match or not. In the case of records with the same position, only
the first will be considered and appear on output.
'snps';;
any SNP records are compatible, regardless of whether the ALT
alleles match or not. For duplicate positions, only the first SNP
record will be considered and appear on output.
'indels';;
all indel records are compatible, regardless of whether the REF
and ALT alleles match or not. For duplicate positions, only the
first indel record will be considered and appear on output.
'both';;
abbreviation of "*-c* 'indels'{nbsp} *-c* 'snps'"
'id';;
only records with identical ID column are compatible.
Supported by *<<merge,bcftools merge>>* only.
*-f, --apply-filters* 'LIST'::
Skip sites where FILTER column does not contain any of the strings listed
in 'LIST'. For example, to include only sites which have no filters set,
use *-f* '.,PASS'.
*--no-version*::
Do not append version and command line information to the output VCF header.
*-o, --output* 'FILE'::
When output consists of a single stream, write it to 'FILE' rather than
to standard output, where it is written by default.
The file type is determined automatically from the file name suffix and in
case a conflicting *-O* option is given, the file name suffix takes precedence.
*-O, --output-type* 'b'|'u'|'z'|'v'[0-9]::
Output compressed BCF ('b'), uncompressed BCF ('u'), compressed VCF ('z'), uncompressed VCF ('v').
Use the -Ou option when piping between bcftools subcommands to speed up
performance by removing unnecessary compression/decompression and
VCF<-->BCF conversion.
{nbsp}
The compression level of the compressed formats ('b' and 'z') can be set by
by appending a number between 0-9.
*-r, --regions* 'chr'|'chr:pos'|'chr:beg-end'|'chr:beg-'[,...]::
Comma-separated list of regions, see also *-R, --regions-file*. Overlapping
records are matched even when the starting coordinate is outside of the
region, unlike the *-t/-T* options where only the POS coordinate is checked.
Note that *-r* cannot be used in combination with *-R*.
*-R, --regions-file* 'FILE'::
Regions can be specified either on command line or in a VCF, BED, or
tab-delimited file (the default). The columns of the tab-delimited file
can contain either positions (two-column format: CHROM, POS) or intervals
(three-column format: CHROM, BEG, END), but not both. Positions are 1-based
and inclusive. The columns of the tab-delimited BED file are also
CHROM, POS and END (trailing columns are ignored), but coordinates
are 0-based, half-open. To indicate that a file be treated as BED rather
than the 1-based tab-delimited file, the file must have the ".bed" or
".bed.gz" suffix (case-insensitive). Uncompressed files are stored in
memory, while bgzip-compressed and tabix-indexed region files are streamed.
Note that sequence names must match exactly, "chr20" is not the same as
"20". Also note that chromosome ordering in 'FILE' will be respected,
the VCF will be processed in the order in which chromosomes first appear
in 'FILE'. However, within chromosomes, the VCF will always be
processed in ascending genomic coordinate order no matter what order they
appear in 'FILE'. Note that overlapping regions in 'FILE' can result in
duplicated out of order positions in the output.
This option requires indexed VCF/BCF files. Note that *-R* cannot be used
in combination with *-r*.
*--regions-overlap* 'pos'|'record'|'variant'|'0'|'1'|'2'::
This option controls how overlapping records are determined:
set to *pos* or *0* if the VCF record has to have POS inside a region
(this corresponds to the default behavior of *-t/-T*);
set to *record* or *1* if also overlapping records with POS outside a region
should be included (this is the default behavior of *-r/-R*, and includes indels
with POS at the end of a region, which are technically outside the region); or set
to *variant* or *2* to include only true overlapping variation (compare
the full VCF representation "`TA>T-`" vs the true sequence variation "`A>-`").
*-s, --samples* \[^]'LIST'::
Comma-separated list of samples to include or exclude if prefixed
with "^." (Note that when multiple samples are to be excluded,
the "^" prefix is still present only once, e.g. "^SAMPLE1,SAMPLE2".)
The sample order is updated to reflect that given on the command line.
Note that in general tags such as INFO/AC, INFO/AN, etc are not updated
to correspond to the subset samples. *<<view,bcftools view>>* is the
exception where some tags will be updated (unless the *-I, --no-update*
option is used; see *<<view,bcftools view>>* documentation). To use updated
tags for the subset in another command one can pipe from *view* into
that command. For example:
----
bcftools view -Ou -s sample1,sample2 file.vcf | bcftools query -f %INFO/AC\t%INFO/AN\n
----
*-S, --samples-file* \[^]'FILE'[[samples_file]]::
File of sample names to include or exclude if prefixed with "^".
One sample per line. See also the note above for the *-s, --samples*
option.
The sample order is updated to reflect that given in the input file.
The command *<<call,bcftools call>>* accepts an optional second
column indicating ploidy (0, 1 or 2) or sex (as defined by
*<<ploidy,--ploidy>>*, for example "F" or "M"), for example:
----
sample1 1
sample2 2
sample3 2
----
or
----
sample1 M
sample2 F
sample3 F
----
If the second column is not present, the sex "F" is assumed.
With *<<call,bcftools call>> -C* 'trio', PED file is expected.
The program ignores the first column and the last indicates sex (1=male, 2=female), for example:
----
ignored_column daughterA fatherA motherA 2
ignored_column sonB fatherB motherB 1
----
*-t, --targets* \[^]'chr'|'chr:pos'|'chr:from-to'|'chr:from-'[,...]::
Similar as *-r, --regions*, but the next position is accessed by streaming the
whole VCF/BCF rather than using the tbi/csi index. Both *-r* and *-t* options
can be applied simultaneously: *-r* uses the index to jump to a region
and *-t* discards positions which are not in the targets. Unlike *-r*, targets
can be prefixed with "^" to request logical complement. For example, "^X,Y,MT"
indicates that sequences X, Y and MT should be skipped.
Yet another difference between the *-t/-T* and *-r/-R* is that *-r/-R* checks for
proper overlaps and considers both POS and the end position of an indel, while *-t/-T*
considers the POS coordinate only (by default; see also *--regions-overlap* and *--targets-overlap*).
Note that *-t* cannot be used in combination with *-T*.
*-T, --targets-file* \[^]'FILE'::
Same *-t, --targets*, but reads regions from a file. Note that *-T*
cannot be used in combination with *-t*.
+
With the *call -C* 'alleles' command, third column of the targets file must
be comma-separated list of alleles, starting with the reference allele.
Note that the file must be compressed and indexed.
Such a file can be easily created from a VCF using:
----
bcftools query -f'%CHROM\t%POS\t%REF,%ALT\n' file.vcf | bgzip -c > als.tsv.gz && tabix -s1 -b2 -e2 als.tsv.gz
----
*--targets-overlap* 'pos'|'record'|'variant'|'0'|'1'|'2'::
Same as *--regions-overlap* but for *-t/-T*.
*--threads* 'INT'::
Use multithreading with 'INT' worker threads. The option is currently used only for the compression of the
output stream, only when '--output-type' is 'b' or 'z'. Default: 0.
*--write-index*::
Automatically index the output files. Can be used only for compressed BCF and VCF output.
[[annotate]]
=== bcftools annotate '[OPTIONS]' 'FILE'
Add or remove annotations.
*-a, --annotations* 'file'::
Bgzip-compressed and tabix-indexed file with annotations. The file
can be VCF, BED, or a tab-delimited file with mandatory columns CHROM, POS
(or, alternatively, FROM and TO), optional columns REF and ALT, and arbitrary
number of annotation columns. BED files are expected to have
the ".bed" or ".bed.gz" suffix (case-insensitive), otherwise a tab-delimited file is assumed.
Note that in case of tab-delimited file, the coordinates POS, FROM and TO are
one-based and inclusive. When REF and ALT are present, only matching VCF
records will be annotated. If the END coordinate is present in the annotation file
and given on command line as "`-c ~INFO/END`", then VCF records will be matched also by the INFO/END coordinate.
If ID is present in the annotation file and given as "`-c ~ID`", then VCF records will be matched
also by the ID column.
{nbsp} +
{nbsp} +
When multiple ALT alleles are present in the annotation file (given as
comma-separated list of alleles), at least one must match one of the
alleles in the corresponding VCF record. Similarly, at least one
alternate allele from a multi-allelic VCF record must be present in the
annotation file.
{nbsp} +
{nbsp} +
Missing values can be added by providing "." in place of actual value
and using the missing value modifier with *-c*, such as ".TAG".
{nbsp} +
{nbsp} +
Note that flag types, such as "INFO/FLAG", can be annotated by including
a field with the value "1" to set the flag, "0" to remove it, or "." to
keep existing flags.
See also *-c, --columns* and *-h, --header-lines*.
----
# Sample annotation file with columns CHROM, POS, STRING_TAG, NUMERIC_TAG
1 752566 SomeString 5
1 798959 SomeOtherString 6
----
*-c, --columns* 'list'::
Comma-separated list of columns or tags to carry over from the annotation file
(see also *-a, --annotations*). If the annotation file is not a VCF/BCF,
'list' describes the columns of the annotation file and must include CHROM,
POS (or, alternatively, FROM and TO), and optionally REF and ALT. Unused
columns which should be ignored can be indicated by "-".
{nbsp} +
{nbsp} +
If the annotation file is a VCF/BCF, only the edited columns/tags must be present and their
order does not matter. The columns ID, QUAL, FILTER, INFO and FORMAT
can be edited, where INFO tags can be written both as "INFO/TAG" or simply "TAG",
and FORMAT tags can be written as "FORMAT/TAG" or "FMT/TAG".
The imported VCF annotations can be renamed as "DST_TAG:=SRC_TAG" or "FMT/DST_TAG:=FMT/SRC_TAG".
{nbsp} +
{nbsp} +
To carry over all INFO annotations, use "INFO". To add all INFO annotations except
"TAG", use "^INFO/TAG". By default, existing values are replaced.
{nbsp} +
{nbsp} +
By default, existing tags are overwritten unless the source value is a missing value (i.e. ".").
If also missing values should be carried over (and overwrite existing tags), use ".TAG" instead of "TAG".
To add annotations without overwriting existing values (that is, to add tags that are absent or
to add values to existing tags with missing values), use "+TAG" instead of "TAG". These can be combined,
for example ".+TAG" can be used to add TAG even if the source value is missing but only if TAG does not
exist in the target file; existing tags will not be overwritten.
To append to existing values (rather than replacing or leaving untouched), use "=TAG"
(instead of "TAG" or "+TAG").
To replace only existing values without modifying missing annotations, use "-TAG".
To match the record also by ID or INFO/END, in addition to REF and ALT, use "~ID" or "~INFO/END".
If position needs to be replaced, mark the column with the new position as "~POS".
{nbsp} +
{nbsp} +
If the annotation file is not a VCF/BCF, all new annotations must be
defined via *-h, --header-lines*.
{nbsp} +
{nbsp} +
See also the *-l, --merge-logic* option.
*-C, --columns-file* 'file'::
Read the list of columns from a file (normally given via the *-c, --columns* option).
"-" to skip a column of the annotation file.
One column name per row, an additional space- or tab-separated field can
be present to indicate the merge logic (normally given via the *-l, --merge-logic* option).
This is useful when many annotations are added at once.
*-e, --exclude* 'EXPRESSION'::
exclude sites for which 'EXPRESSION' is true. For valid expressions see
*<<expressions,EXPRESSIONS>>*.
*--force*::
continue even when parsing errors, such as undefined tags, are encountered. Note
this can be an unsafe operation and can result in corrupted BCF files. If this
option is used, make sure to sanity check the result thoroughly.
*-h, --header-lines* 'file'::
Lines to append to the VCF header, see also *-c, --columns* and *-a, --annotations*. For example:
----
##INFO=<ID=NUMERIC_TAG,Number=1,Type=Integer,Description="Example header line">
##INFO=<ID=STRING_TAG,Number=1,Type=String,Description="Yet another header line">
----
*-I, --set-id* \[+]'FORMAT'::
assign ID on the fly. The format is the same as in the *<<query,query>>*
command (see below). By default all existing IDs are replaced. If the
format string is preceded by "+", only missing IDs will be set. For example,
one can use
----
bcftools annotate --set-id +'%CHROM\_%POS\_%REF\_%FIRST_ALT' file.vcf
----
*-i, --include* 'EXPRESSION'::
include only sites for which 'EXPRESSION' is true. For valid expressions see
*<<expressions,EXPRESSIONS>>*.
*-k, --keep-sites*::
keep sites which do not pass *-i* and *-e* expressions instead of discarding them
*-l, --merge-logic* 'tag:first'|'append'|'append-missing'|'unique'|'sum'|'avg'|'min'|'max'[,...]::
When multiple regions overlap a single record, this option defines how to treat multiple
annotation values when setting 'tag' in the destination file: use the first encountered value ignoring
the rest ('first'); append allowing duplicates ('append'); append even if the appended value is missing,
i.e. is a dot ('append-missing'); append discarding duplicate values ('unique');
sum the values ('sum', numeric fields only); average the values ('avg'); use the minimum value ('min') or
the maximum ('max').
+
Note that this option is intended for use with BED or TAB-delimited annotation files only. Moreover,
it is effective only when either 'REF' and 'ALT' or 'BEG' and 'END' *--columns* are present .
+
Multiple rules can be given either as a comma-separated list or giving the option multiple times.
This is an experimental feature.
*-m, --mark-sites* [+-]'TAG'::
annotate sites which are present ("+") or absent ("-") in the *-a* file with a new INFO/TAG flag
*--min-overlap* 'ANN':'VCF'::
minimum overlap required as a fraction of the variant in the annotation *-a* file ('ANN'), in the
target VCF file (':VCF'), or both for reciprocal overlap ('ANN:VCF').
By default overlaps of arbitrary length are sufficient.
The option can be used only with the tab-delimited annotation *-a* file and with 'BEG' and 'END'
columns present.
*--no-version*::
see *<<common_options,Common Options>>*
*-o, --output* 'FILE'::
see *<<common_options,Common Options>>*
*-O, --output-type* 'b'|'u'|'z'|'v'[0-9]::
see *<<common_options,Common Options>>*
*--pair-logic* 'snps'|'indels'|'both'|'all'|'some'|'exact'::
Controls how to match records from the annotation file to the target VCF.
Effective only when *-a* is a VCF or BCF. The option replaces the former
uninuitive *--collapse*.
See *<<common_options,Common Options>>* for more.
*-r, --regions* 'chr'|'chr:pos'|'chr:from-to'|'chr:from-'[,...]::
see *<<common_options,Common Options>>*
*-R, --regions-file* 'file'::
see *<<common_options,Common Options>>*
*--regions-overlap* '0'|'1'|'2'::
see *<<common_options,Common Options>>*
*--rename-annots* 'file'::
rename annotations according to the map in 'file', with
"old_name new_name\n" pairs separated by whitespaces, each on a separate
line. The old name must be prefixed with the annotation type:
INFO, FORMAT, or FILTER.
*--rename-chrs* 'file'::
rename chromosomes according to the map in 'file', with
"old_name new_name\n" pairs separated by whitespaces, each on a separate
line.
*-s, --samples* \[^]'LIST'::
subset of samples to annotate, see also *<<common_options,Common Options>>*
*-S, --samples-file* 'FILE'::
subset of samples to annotate. If the samples are named differently in the
target VCF and the *-a, --annotations* VCF, the name mapping can be
given as "src_name dst_name\n", separated by whitespaces, each pair on a
separate line.
*--single-overlaps*::
use this option to keep memory requirements low with very large annotation
files. Note, however, that this comes at a cost, only single overlapping intervals
are considered in this mode. This was the default mode until the commit
af6f0c9 (Feb 24 2019).
*--threads* 'INT'::
see *<<common_options,Common Options>>*
*-x, --remove* 'list'::
List of annotations to remove. Use "FILTER" to remove all filters or
"FILTER/SomeFilter" to remove a specific filter. Similarly, "INFO" can
be used to remove all INFO tags and "FORMAT" to remove all FORMAT tags
except GT. To remove all INFO tags except "FOO" and "BAR", use
"^INFO/FOO,INFO/BAR" (and similarly for FORMAT and FILTER).
"INFO" can be abbreviated to "INF" and "FORMAT" to "FMT".
*--write-index*::
Automatically index the output file
*Examples:*
----
# Remove three fields
bcftools annotate -x ID,INFO/DP,FORMAT/DP file.vcf.gz
# Remove all INFO fields and all FORMAT fields except for GT and PL
bcftools annotate -x INFO,^FORMAT/GT,FORMAT/PL file.vcf
# Add ID, QUAL and INFO/TAG, not replacing TAG if already present
bcftools annotate -a src.bcf -c ID,QUAL,+TAG dst.bcf
# Carry over all INFO and FORMAT annotations except FORMAT/GT
bcftools annotate -a src.bcf -c INFO,^FORMAT/GT dst.bcf
# Annotate from a tab-delimited file with six columns (the fifth is ignored),
# first indexing with tabix. The coordinates are 1-based.
tabix -s1 -b2 -e2 annots.tab.gz
bcftools annotate -a annots.tab.gz -h annots.hdr -c CHROM,POS,REF,ALT,-,TAG file.vcf
# Annotate from a tab-delimited file with regions (1-based coordinates, inclusive)
tabix -s1 -b2 -e3 annots.tab.gz
bcftools annotate -a annots.tab.gz -h annots.hdr -c CHROM,FROM,TO,TAG input.vcf
# Annotate from a bed file (0-based coordinates, half-closed, half-open intervals)
bcftools annotate -a annots.bed.gz -h annots.hdr -c CHROM,FROM,TO,TAG input.vcf
# Transfer the INFO/END tag, matching by POS,REF,ALT and ID. This example assumes
# that INFO/END is already present in the VCF header.
bcftools annotate -a annots.tab.gz -c CHROM,POS,~ID,REF,ALT,INFO/END input.vcf
# For more examples see http://samtools.github.io/bcftools/howtos/annotate.html
----
[[call]]
=== bcftools call '[OPTIONS]' 'FILE'
This command replaces the former *bcftools view* caller. Some of the original
functionality has been temporarily lost in the process of transition under
http://github.com/samtools/htslib[htslib], but will be added back on popular
demand. The original calling model can be invoked with the *-c* option.
==== File format options:
*--no-version*::
see *<<common_options,Common Options>>*
*-o, --output* 'FILE'::
see *<<common_options,Common Options>>*
*-O, --output-type* 'b'|'u'|'z'|'v'[0-9]::
see *<<common_options,Common Options>>*
*--ploidy* 'ASSEMBLY'['?'][[ploidy]]::
predefined ploidy, use 'list' (or any other unused word) to print a list
of all predefined assemblies. Append a question mark to print the actual
definition. See also *--ploidy-file*.
*--ploidy-file* 'FILE'::
ploidy definition given as a space/tab-delimited list of
CHROM, FROM, TO, SEX, PLOIDY. The SEX codes are arbitrary and
correspond to the ones used by *<<samples_file,--samples-file>>*.
The default ploidy can be given using the starred records (see
below), unlisted regions have ploidy 2. The default ploidy definition is
----
X 1 60000 M 1
X 2699521 154931043 M 1
Y 1 59373566 M 1
Y 1 59373566 F 0
MT 1 16569 M 1
MT 1 16569 F 1
* * * M 2
* * * F 2
----
*-r, --regions* 'chr'|'chr:pos'|'chr:from-to'|'chr:from-'[,...]::
see *<<common_options,Common Options>>*
*-R, --regions-file* 'file'::
see *<<common_options,Common Options>>*
*--regions-overlap* '0'|'1'|'2'::
see *<<common_options,Common Options>>*
*-s, --samples* 'LIST'::
see *<<common_options,Common Options>>*
*-S, --samples-file* 'FILE'::
see *<<common_options,Common Options>>*
*-t, --targets* 'LIST'::
see *<<common_options,Common Options>>*
*-T, --targets-file* 'FILE'::
see *<<common_options,Common Options>>*
*--targets-overlap* '0'|'1'|'2'::
see *<<common_options,Common Options>>*
*--threads* 'INT'::
see *<<common_options,Common Options>>*
*--write-index*::
Automatically index the output file
==== Input/output options:
*-A, --keep-alts*::
output all alternate alleles present in the alignments even if they do not
appear in any of the genotypes
*-f, --format-fields* 'list'::
comma-separated list of FORMAT fields to output for each sample. Currently
GQ and GP fields are supported. For convenience, the fields can be given
as lower case letters. Prefixed with "^" indicates a request for tag
removal of auxiliary tags useful only for calling.
*-F, --prior-freqs* 'AN','AC'::
take advantage of prior knowledge of population allele frequencies. The
workflow looks like this:
----
# Extract AN,AC values from an existing VCF, such 1000Genomes
bcftools query -f'%CHROM\t%POS\t%REF\t%ALT\t%AN\t%AC\n' 1000Genomes.bcf | bgzip -c > AFs.tab.gz
# If the tags AN,AC are not already present, use the +fill-tags plugin
bcftools +fill-tags 1000Genomes.bcf | bcftools query -f'%CHROM\t%POS\t%REF\t%ALT\t%AN\t%AC\n' | bgzip -c > AFs.tab.gz
tabix -s1 -b2 -e2 AFs.tab.gz
# Create a VCF header description, here we name the tags REF_AN,REF_AC
cat AFs.hdr
##INFO=<ID=REF_AN,Number=1,Type=Integer,Description="Total number of alleles in reference genotypes">
##INFO=<ID=REF_AC,Number=A,Type=Integer,Description="Allele count in reference genotypes for each ALT allele">
# Now before calling, stream the raw mpileup output through `bcftools annotate` to add the frequencies
bcftools mpileup [...] -Ou | bcftools annotate -a AFs.tab.gz -h AFs.hdr -c CHROM,POS,REF,ALT,REF_AN,REF_AC -Ou | bcftools call -mv -F REF_AN,REF_AC [...]
----
*-G, --group-samples* [TAG:]'FILE'|'-'::
by default, all samples are assumed to come from a single population. This option allows to group samples
into populations and apply the HWE assumption within but not across the populations. 'FILE' is a tab-delimited
text file with sample names in the first column and group names in the second column. If '-' is
given instead, no HWE assumption is made at all and single-sample calling is performed. (Note that
in low coverage data this inflates the rate of false positives.) The *-G* option requires the presence of
per-sample FORMAT/QS or FORMAT/AD tag generated with *bcftools mpileup -a QS* (or *-a AD*).
*-g, --gvcf* 'INT'::
output also gVCF blocks of homozygous REF calls. The parameter 'INT' is the
minimum per-sample depth required to include a site in the non-variant
block.
*-i, --insert-missed* 'INT'::
output also sites missed by mpileup but present in *-T, --targets-file*.
*-M, --keep-masked-ref*::
output sites where REF allele is N
*-V, --skip-variants* 'snps'|'indels'::
skip indel/SNP sites
*-v, --variants-only*::
output variant sites only
==== Consensus/variant calling options:
*-c, --consensus-caller*::
the original *samtools*/*bcftools* calling method (conflicts with *-m*)
*-C, --constrain* 'alleles'|'trio'::
'alleles';;
call genotypes given alleles. See also *-T, --targets-file*.
'trio';;
call genotypes given the father-mother-child constraint. See also
*-s, --samples* and *-n, --novel-rate*.
*-m, --multiallelic-caller*::
alternative model for multiallelic and rare-variant calling designed to
overcome known limitations in *-c* calling model (conflicts with *-c*)
*-n, --novel-rate* 'float'[,...]::
likelihood of novel mutation for constrained *-C* 'trio' calling. The trio
genotype calling maximizes likelihood of a particular combination of
genotypes for father, mother and the child
P(F=i,M=j,C=k) = P(unconstrained) * Pn + P(constrained) * (1-Pn).
By providing three values, the mutation rate Pn is set explicitly for SNPs,
deletions and insertions, respectively. If two values are given, the first
is interpreted as the mutation rate of SNPs and the second is used to
calculate the mutation rate of indels according to their length as
Pn='float'*exp(-a-b*len), where a=22.8689, b=0.2994 for insertions and
a=21.9313, b=0.2856 for deletions [pubmed:23975140]. If only one value is
given, the same mutation rate Pn is used for SNPs and indels.
*-p, --pval-threshold* 'float'::
with *-c*, accept variant if P(ref|D) < 'float'.
*-P, --prior* 'float'::
expected substitution rate, or 0 to disable the prior. Only with *-m*.
*-t, --targets* 'file'|'chr'|'chr:pos'|'chr:from-to'|'chr:from-'[,...]::
see *<<common_options,Common Options>>*
*-X, --chromosome-X*::
haploid output for male samples (requires PED file with *-s*)
*-Y, --chromosome-Y*::
haploid output for males and skips females (requires PED file with *-s*)
[[cnv]]
=== bcftools cnv '[OPTIONS]' 'FILE'
Copy number variation caller, requires a VCF annotated with the Illumina's
B-allele frequency (BAF) and Log R Ratio intensity (LRR) values. The HMM
considers the following copy number states: CN 2 (normal), 1 (single-copy
loss), 0 (complete loss), 3 (single-copy gain).
==== General Options:
*-c, --control-sample* 'string'::
optional control sample name. If given, pairwise calling is performed
and the *-P* option can be used
*-f, --AF-file* 'file'::
read allele frequencies from a tab-delimited file with the columns CHR,POS,REF,ALT,AF
*-o, --output-dir* 'path'::
output directory
*-p, --plot-threshold* 'float'::
call *matplotlib* to produce plots for chromosomes with quality at least 'float',
useful for visual inspection of the calls. With *-p 0*, plots for all chromosomes will be
generated. If not given, a *matplotlib* script will be created but not called.
*-r, --regions* 'chr'|'chr:pos'|'chr:from-to'|'chr:from-'[,...]::
see *<<common_options,Common Options>>*
*-R, --regions-file* 'file'::
see *<<common_options,Common Options>>*
*--regions-overlap* '0'|'1'|'2'::
see *<<common_options,Common Options>>*
*-s, --query-sample* 'string'::
query sample name
*-t, --targets* 'LIST'::
see *<<common_options,Common Options>>*
*-T, --targets-file* 'FILE'::
see *<<common_options,Common Options>>*
*--targets-overlap* '0'|'1'|'2'::
see *<<common_options,Common Options>>*
==== HMM Options:
*-a, --aberrant* 'float'[,'float']::
fraction of aberrant cells in query and control. The hallmark of
duplications and contaminations is the BAF value of heterozygous markers
which is dependent on the fraction of aberrant cells. Sensitivity to
smaller fractions of cells can be increased by setting *-a* to a lower value. Note
however, that this comes at the cost of increased false discovery rate.
*-b, --BAF-weight* 'float'::
relative contribution from BAF
*-d, --BAF-dev* 'float'[,'float']::
expected BAF deviation in query and control, i.e. the noise observed
in the data.
*-e, --err-prob* 'float'::
uniform error probability
*-l, --LRR-weight* 'float'::
relative contribution from LRR. With noisy data, this option can have big effect
on the number of calls produced. In truly random noise (such as in simulated data),
the value should be set high (1.0), but in the presence of systematic noise
when LRR are not informative, lower values result in cleaner calls (0.2).
*-L, --LRR-smooth-win* 'int'::
reduce LRR noise by applying moving average given this window size
*-O, --optimize* 'float'::
iteratively estimate the fraction of aberrant cells, down to the given fraction.
Lowering this value from the default 1.0 to say, 0.3, can help discover more
events but also increases noise
*-P, --same-prob* 'float'::
the prior probability of the query and the control sample being the same.
Setting to 0 calls both independently, setting to 1 forces the same copy
number state in both.
*-x, --xy-prob* 'float'::
the HMM probability of transition to another copy number state. Increasing this
values leads to smaller and more frequent calls.
[[concat]]
=== bcftools concat '[OPTIONS]' 'FILE1' 'FILE2' [...]
Concatenate or combine VCF/BCF files. All source files must have the same sample
columns appearing in the same order. Can be used, for example, to
concatenate chromosome VCFs into one VCF, or combine a SNP VCF and an indel
VCF into one. The input files must be sorted by chr and position. The files
must be given in the correct order to produce sorted VCF on output unless
the *-a, --allow-overlaps* option is specified. With the --naive option, the files
are concatenated without being recompressed, which is very fast..
*-a, --allow-overlaps*::
First coordinate of the next file can precede last record of the current file.
*-c, --compact-PS*::
Do not output PS tag at each site, only at the start of a new phase set block.
*-d, --rm-dups* 'snps'|'indels'|'both'|'all'|'exact'::
Output duplicate records of specified type present in multiple files only once.
Note that records duplicate within one file are not removed with this option,
for that use *<<norm,bcftools norm -d>>* instead. +
In other words, the default behavior of the program is similar to unix "`cat`" in
that when two files contain a record with the same position, that position will appear
twice on output. With *-d*, every line that finds a matching record in another
file will be printed only once. +
Requires *-a, --allow-overlaps*.
*-D, --remove-duplicates*::
Alias for *-d exact*
*-f, --file-list* 'FILE'::
Read file names from 'FILE', one file name per line.
*-l, --ligate*::
Ligate phased VCFs by matching phase at overlapping haplotypes.
Note that the option is intended for VCFs with perfect overlap, sites
in overlapping regions present in one but missing in the other are dropped.
*--ligate-force*::
Keep all sites and ligate even non-overlapping chunks and chunks with imperfect overlap
*--ligate-warn*::
Drop sites in imperfect overlaps
*--no-version*::
see *<<common_options,Common Options>>*
*-n, --naive*::
Concatenate VCF or BCF files without recompression. This is very fast but requires
that all files are of the same type (all VCF or all BCF) and have the same headers.
This is because all tags and chromosome names in the BCF body rely on the order
of the contig and tag definitions in the header. A header check compatibility
is performed and the program throws an error if it is not safe to use the option.
*--naive-force*::
Same as --naive, but header compatibility is not checked. Dangerous, use with caution.
*-o, --output* 'FILE'::
see *<<common_options,Common Options>>*
*-O, --output-type* 'b'|'u'|'z'|'v'[0-9]::
see *<<common_options,Common Options>>*
*-q, --min-PQ* 'INT'::
Break phase set if phasing quality is lower than 'INT'
*-r, --regions* 'chr'|'chr:pos'|'chr:from-to'|'chr:from-'[,...]::
see *<<common_options,Common Options>>*. Requires *-a, --allow-overlaps*.
*-R, --regions-file* 'FILE'::
see *<<common_options,Common Options>>*. Requires *-a, --allow-overlaps*.
*--regions-overlap* '0'|'1'|'2'::
see *<<common_options,Common Options>>*
*--threads* 'INT'::
see *<<common_options,Common Options>>*
*--write-index*::
Automatically index the output file
[[consensus]]
=== bcftools consensus '[OPTIONS]' 'FILE'
Create consensus sequence by applying VCF variants to a reference fasta file.
By default, the program will apply all ALT variants to the reference fasta to
obtain the consensus sequence. Using the *--sample* (and, optionally,
*--haplotype*) option will apply genotype (haplotype) calls from FORMAT/GT.
Note that the program does not act as a primitive variant caller and ignores allelic
depth information, such as INFO/AD or FORMAT/AD. For that, consider using the
*setGT* plugin.
*-a, --absent* 'CHAR'::
replace positions absent from VCF with CHAR
*-c, --chain* 'FILE'::
write a chain file for liftover
*-e, --exclude* 'EXPRESSION'::
exclude sites for which 'EXPRESSION' is true. For valid expressions see
*<<expressions,EXPRESSIONS>>*.
*-f, --fasta-ref* 'FILE'::
reference sequence in fasta format
*-H, --haplotype* N|'R'|'A'|'I'|'LR'|'LA'|'SR'|'SA'|'NpIu'::
choose which allele from the FORMAT/GT field to use (the codes are case-insensitive):
'N';;
N={1,2,3,...}, the allele index within the genotype, regardless of phasing
'R';;
the REF allele (in heterozygous genotypes)
'A';;
the ALT allele (in heterozygous genotypes)
'I';;
IUPAC code for all genotypes
'LR, LA';;
the longer allele. If both have the same length, use the REF allele (LR), or the ALT allele (LA)
'SR, SA';;
the shorter allele. If both have the same length, use the REF allele (SR), or the ALT allele (SA)
'NpIu';;
N={1,2,3,...}, the allele index within genotype for phased genotypes and IUPAC code for unphased genotypes.
For example, '1pIu' or '2pIu'
Note that the *-H, --haplotype* option requires the *-s, --samples* option, unless exactly one sample is present in the VCF
*-i, --include* 'EXPRESSION'::
include only sites for which 'EXPRESSION' is true. For valid expressions see
*<<expressions,EXPRESSIONS>>*.
*-I, --iupac-codes*::
output variants in the form of IUPAC ambiguity codes determined from FORMAT/GT fields. By default all
samples are used and can be subset with *-s, --samples* and *-S, --samples-file*. Use *-s -* to ignore
samples and use only the REF and ALT columns. NOTE: prior to version 1.17 the IUPAC codes were determined solely
from REF,ALT columns and sample genotypes were not considered.
*--mark-del* 'CHAR'::
instead of removing sequence, insert character CHAR for deletions
*--mark-ins* 'uc'|'lc'|'CHAR'::
highlight inserted sequence in uppercase (uc), lowercase (lc), or a provided character CHAR, leaving the rest of the sequence as is
*--mark-snv* 'uc'|'lc'::
highlight substitutions in uppercase (uc), lowercase (lc), or a provided character CHAR, leaving the rest of the sequence as is
*-m, --mask* 'FILE'::
BED file or TAB file with regions to be replaced with N (the default) or as specified by
the next *--mask-with* option. See discussion
of *--regions-file* in *<<common_options,Common Options>>* for file
format details.
*--mask-with* 'CHAR'|'lc'|'uc'::
replace sequence from *--mask* with CHAR, skipping overlapping variants, or change to lowercase (lc) or uppercase (uc)
*-M, --missing* 'CHAR'::
instead of skipping the missing genotypes, output the character CHAR (e.g. "?")
*-o, --output* 'FILE'::
write output to a file
*-s, --samples* 'LIST'::
apply variants of the listed samples. See also the option *-I, --iupac-codes*
*-S, --samples-file* 'FILE'::
apply variants of the samples listed in the file. See also the option *-I, --iupac-codes*
*Examples:*
----
# Apply variants present in sample "NA001", output IUPAC codes for hets
bcftools consensus -i -s NA001 -f in.fa in.vcf.gz > out.fa
# Create consensus for one region. The fasta header lines are then expected
# in the form ">chr:from-to". Ignore samples and consider only the REF and ALT columns
samtools faidx ref.fa 8:11870-11890 | bcftools consensus -s - in.vcf.gz -o out.fa
# For more examples see http://samtools.github.io/bcftools/howtos/consensus-sequence.html
----