-
Notifications
You must be signed in to change notification settings - Fork 196
Automatic indexing for read mapping and downstream inference
The VG toolkit consists of many different tools, but one of the tasks it is most known for is read mapping. Indeed, VG contains three separate mapping algorithms.
-
vg map
: the original, highly accurate mapping algorithm -
vg giraffe
: the much faster and still accurate haplotype-based mapping algorithm -
vg mpmap
: the splice-aware RNA-seq mapping algorithm that can produce multipath alignments
Each of these mapping tools depends on a sizeable body of research in specialized data structures and algorithms. As a result, they each require a different set of indexes to be built before mapping. Historically, navigating the indexing process has been a pain point for our users. The indexing algorithms and their documentation are spread across several vg
subcommands, and the steps need to be applied in a specific order to produce valid, usable results.
The vg autoindex
subcommand is designed to alleviate the pain involved in this process. Rather than having an interface based on which index you want to produce, it has an interface based on which mapping tool you want to run. The inputs are all common interchange formats like FASTA, VCF, and GFA. Internally, vg autoindex
has the logic of our best practice indexing pipelines built in. Power users might still find need for the individual indexing subcommands, but our goal is to produce indexes for any common use case in a single, easily-understood shell command.
Most invocations for vg autoindex
require very little configuration. The primary arguments are the --workflow
, which indicates which mapper you want to run, and a set of interchange data formats.
# starting from a FASTA and VCF
vg autoindex --workflow map --prefix /path/to/output --ref-fasta reference.fasta --vcf variants.vcf.gz
# starting from a GFA
vg autoindex --workflow map --prefix /path/to/output --gfa graph.gfa
This invocation creates the indexes required for mapping with vg mpmap
as well as indexes that are used for transcript inference using rpvg
.
vg autoindex --workflow mpmap --prefix /path/to/output --ref-fasta reference.fasta --vcf variants.vcf.gz \
--tx-gff transcripts.gff
vg giraffe
is haplotype-based and performs best with phased variants, but indexes can still be constructed with an unphased VCF.
vg autoindex --workflow giraffe --prefix /path/to/output --ref-fasta reference.fasta --vcf variants.vcf.gz
Most computational parameters used in vg autoindex
are determined automatically, but users may need to manually adjust a few of them. vg autoindex
assumes that a fairly large amount of disk space will be available, so the --tmp-dir
should be directed to a large volume. It is reasonable to expect the indexing for a eukaryotic pangenome to require 100s of GB of temporary storage and 10s-100s of GB of permanent storage (directed with --prefix
). The --threads
argument can limit the number of threads used for shared compute environments. Large jobs tend to be inconveniently slow with less than 8 threads. Finally, vg autoindex
provides an interface to restrict memory usage by setting a target maximum memory occupancy (--target-mem
). This target should be interpreted as an approximate limit (in general, it is hard to robustly predict future memory usage), but it can be decreased to encourage greater memory thrift at the expense of reduced ability to utilize available threads.