'Cleverly' align sets of redundant sequences #64

AndreaGuarracino · 2023-01-25T17:50:48Z

In our applications, it can happen that SPOA is given sets that have many duplicate sequences. For example, this multi-FASTA

smoothxg_into_spoa_pad311_621639_in_1884956ms.zip

has 9280 sequences, of which 2416 are unique.

Is there a way to tweak SPOA to only work on the 2416 sequences, but to weigh them properly with respect to their frequencies in the non-deduplicated set? I smell it could be done, at least theoretically. We would need this feature when using SPOA as a submodule in other projects. The aim is to avoid redundant work while keeping consensus sequences that make sense.

rvaser · 2023-01-27T11:33:25Z

Hi Andrea,
finding which sequences are identical sounds to me as a preprocessing step which is not directly tied to the SPOA library. If you know which are duplicates (either prior knowledge or determined with an algorithm), you can add them only once to the POA graph with increased weights (either coverage or per base sum of quality values).

Best regards,
Robert

AndreaGuarracino · 2023-01-31T02:36:04Z

Ah, I hadn't noticed that I can already set the weight of the sequence every time I add a new alignment (https://github.com/rvaser/spoa/blob/master/include/spoa/graph.hpp#L137). So the only thing that needed to be "cleverized" was me! Thank you for your prompt reply! Keep pushing in POA (random ping #31).

AndreaGuarracino closed this as completed Jan 31, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

'Cleverly' align sets of redundant sequences #64

'Cleverly' align sets of redundant sequences #64

AndreaGuarracino commented Jan 25, 2023

rvaser commented Jan 27, 2023

AndreaGuarracino commented Jan 31, 2023

'Cleverly' align sets of redundant sequences #64

'Cleverly' align sets of redundant sequences #64

Comments

AndreaGuarracino commented Jan 25, 2023

rvaser commented Jan 27, 2023

AndreaGuarracino commented Jan 31, 2023