Skip to content

Effect of kmer size

Ryan Wick edited this page Apr 29, 2016 · 3 revisions

The structure of an assembly graph is highly dependent on the k-mer size used for assembly. Small k-mers result in shorter contigs with lots of connections, while large k-mers can result in longer contigs with fewer connections.

The ideal k-mer size depends on the read length and the read depth and sequence complexity. If you have longer reads and/or higher read depth, you can use larger k-mers which are useful in resolving complex areas of the graph. Conversely, if you have shorter reads and/or lower read depth, you may have to use shorter k-mers.

When assembling 100 bp reads in Velvet, a k-mer of 61 would be a good starting point, and then adjust up or down as needed. SPAdes conducts assembly multiple times using different k-mers, so you can look at the FASTG files for each assembly (in folders named like K21, K33, etc.) to find the best graph for viewing in Bandage.

If your graph consists of many separate disconnected subgraphs (i.e. there are many small groups of contigs that have no connections to the rest of the graph), then your k-mer size may be too large. Alternatively, if your graph is connected (i.e. all contigs are tied together in a single graph structure) but is very dense and tangled, then your k-mer size may be too small.

Velvet example

For this example I assembled a Salmonella genome from 100 bp Illumina reads using Velvet. Which graph is best depends on your priorities and which sequences you are interested in, though the 61-mer and 71-mer graphs are both pretty good.

51-mer assembly

This k-mer size is too small, resulting in a complex and tangled graph with 4618 nodes and 6070 edges.

51-mer assembly graph

61-mer assembly

This graph is better than the 51-mer graph – it is much less complex (1357 nodes and 1768 edges) but still has very few dead ends.

61-mer assembly graph

71-mer assembly

While the complexity of the graph has improved (it has 611 nodes and 765 edges), it now shows many more dead ends.

71-mer assembly graph

81-mer assembly

As compared to the 71-mer graph, the complexity has slightly improved (it has 490 nodes and 512 edges), but it has broken into many disconnected parts.

81-mer assembly graph

91-mer assembly

This graph has 2386 nodes and 304 edges and mostly consists of disconnected nodes. This k-mer size is definitely too large.

91-mer assembly graph

SPAdes

SPAdes doesn't use a single k-mer size per assembly but rather a range of k-mer sizes, where each subsequent graph is built on the previous one. The result is that SPAdes graphs are less prone to breaking apart at high k-mers than Velvet graphs. But even in SPAdes, k-mer ranges that go too high can result in less ideal assemblies.

A maximum k-mer size of about 80% of the read length may yield good results in SPAdes. E.g. if assembling 100 bp reads, a k-mer range of 21,33,55,77 may work well. If assembling somewhat longer reads (such as 300 bp MiSeq reads), you can try going all the way to the SPAdes maximum k-mer of 127.