Skip to content

Subtleties of pairwise alignments

Frédéric Mahé edited this page Nov 27, 2022 · 4 revisions

As pointed by Géraldine Pascal and Frédéric Escudié (from Inra Toulouse, France), when using d = 2 or greater d values, there are borderline cases where the number of differences d as identified by the pairwise alignment is not as expected.

Here is an example (I = insertion, D = deletion, S = substitution):

A) G*ACGCCG      GACGCCG
   .I....D.  or  .SSSS..
B) GCACGC*G      GCACGCG

When compared to sequence A, sequence B has an insertion and a deletion (d = 2). When the two indels are close-by (3 or 4 nucleotides), the pairwise alignment considers that it is less costly (i.e. "optimal") to consider that the two sequences have 4 mismatches (d = 4). The matchs/mismatches, gap opening and gap extension costs used in swarm are the same than in other pairwise alignment softwares. These parameters were established empirically and acknowledge the fact that indels are less frequent than substitutions in natural sequences. This is also the case in sequences produced by Illumina: sequencing errors are dominantly substitutions, and insertions/deletions are far less frequent.

So, three or four mismatches are preferred over two (close-by) indels events in pairwise alignment. Is it a problem? and how to change that?

First, these borderline cases are limited in scope (only close-by indels), and should not have a significant impact on global clustering results. In many cases, intermediate amplicons will link the amplicons A and B, eliminating the problem. However, If you feel adventurous or know what you are doing, you can change the cost of gap openings by setting --gap-opening-penalty 0.

Tweaking the pairwise alignment parameters is not the easiest way to go. I recommend working at d = 1 (where the pairwise alignment problem disappears) and to use the --fastidious option. The fastidious option will graft small clusters onto big ones by postulating the existence of intermediate amplicons, mimicking in practice a clustering with d = 2, but with a better resolution. Borderline cases like described above should disappear when doing so.