Variation graphs

This page is intended to explain variation graphs to non-computer scientists and people new to the field. Also see our explainer videos.

How to represent a pangenome

A pangenome is a collection of genome sequences and the homology (similarity due to shared ancestry) among them. There are many potential ways of representing a pangenome. Naively, a pangenome could be stored in a file containing the full haplotype sequences of all of the assemblies.

three haplotypes represented as a matrix, with haplotype labels on the rows, each column being a position, and the DNA base filled in for each cell; sequences are h1: AAATAGAATCCACACCTTTTAACTAAACGGTAGGCTG, h2: AAATAGTATCCAC---------CTAAACGGTACGCTG, and h3: AAATAGTATCCACACCTATTAACTAAACGGTA-GCTG

But at 3 billion base pairs for a human genome, and hundreds of genomes per pangenome, a file like this will quickly become too big to work with efficiently. Additionally, due to the similarity of human genomes, there will be a lot of redundant sequence that is stored multiple times. A more compact representation of the pangenome is to store the sequences common to all genomes only once. Then for each haplotype in the pangenome, store only the sequence at each site of variation.

the same matrix but with identical sequences moved to a new top row and blank space left in the other rows

This collapsing of homologous sequences is the basis for creating a variation graph. In a variation graph, a node represents a nucleotide sequence and an edge occurs between sequences that can be connected. Homologous sequences in the pangenome are collapsed into a single node, and variants unique to each genome become separate nodes. Edges occur between nodes that are adjacent in the original sequences.

the matrix as a graph, with homologous sequences now being nodes chained together, with each alternative sequence being a node connected by edges in between

The graph can be collapsed further if there is homology within variants at one site. For example, haplotypes 1 and 3 have an insertion that is different by a single SNP. The homologous sequences in nodes 5 and 6 can be collapsed to form a nested site of variation, representing the SNP nested within an insertion.

a modified version of the previous graph, with the two separate insertions collapsed into one subchain with a SNP variant site/bubble in the middle

The original haplotype sequences can be found by concatenating the sequences in nodes. For example, haplotype 1 can be found by taking node 1, node 2, node 4, node 5, node 7, node 8, and node 10. A sequence of nodes like this is known as a walk or a path through the graph. For a path through the graph to be valid, there must exist an edge between each pair of consecutive nodes. For example, there is no valid path walking from node 1 directly to node 4, without taking node 2 or node 4.

the graph with h1's path traced out, showing a node sequence which forms the h1 sequence

This structure is a variation graph.

Variation graphs

A variation graph is a sequence graph (the nodes and edges) and a collection of haplotype paths through the graph.

The sequence graph model used by the vg toolkit is a bi-directed graph with some extra restrictions. Nodes in sequence graphs have two sides, which are arbitrarily labelled as the left and right node side. Edges connect pairs of node sides. A valid path through the graph must enter and exit each node through opposite node sides. This is intuitive if we consider a node traversal to be a reading of the sequence. We cannot visit a node if we don't read its sequence, and the sequence must be read from left to right, or from right to left. A right-to-left traversal of a node corresponds to the reverse complement of its sequence. A valid path must specify the orientation of each node in the path. For example, the blue path representing haplotype 1 above would be node 1 traversing forwards, node 2 traversing forwards, etc. We refer to a node and orientation as a visit or a traversal of that node.

Variation graphs can also represent complex variants such as duplications and inversions. For more details about variant representation, please read Snarls and chains.

the same graph, with an edge added to connect the end of node 8 (end of the insertion) to the start of node 5 (start of the insertion), and an added inversion between node 12 (end of original graph) and a new node 14; in between in a node 13 which has both ends connected to each of 12's end and 14's start

In this example, the insertion represented by nodes 5-8 can be duplicated by taking the edge between the left side of node 5 and the right side of node 8. Node 13 represents an inversion; a path from node 12 to node 14 can traverse node 13 forwards (left-to-right) or backwards (right-to-left).

Variation graphs

How to represent a pangenome

Variation graphs

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Clone this wiki locally