-
Notifications
You must be signed in to change notification settings - Fork 0
/
Copy pathmaklin2022.tex
2791 lines (2531 loc) · 155 KB
/
maklin2022.tex
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
335
336
337
338
339
340
341
342
343
344
345
346
347
348
349
350
351
352
353
354
355
356
357
358
359
360
361
362
363
364
365
366
367
368
369
370
371
372
373
374
375
376
377
378
379
380
381
382
383
384
385
386
387
388
389
390
391
392
393
394
395
396
397
398
399
400
401
402
403
404
405
406
407
408
409
410
411
412
413
414
415
416
417
418
419
420
421
422
423
424
425
426
427
428
429
430
431
432
433
434
435
436
437
438
439
440
441
442
443
444
445
446
447
448
449
450
451
452
453
454
455
456
457
458
459
460
461
462
463
464
465
466
467
468
469
470
471
472
473
474
475
476
477
478
479
480
481
482
483
484
485
486
487
488
489
490
491
492
493
494
495
496
497
498
499
500
501
502
503
504
505
506
507
508
509
510
511
512
513
514
515
516
517
518
519
520
521
522
523
524
525
526
527
528
529
530
531
532
533
534
535
536
537
538
539
540
541
542
543
544
545
546
547
548
549
550
551
552
553
554
555
556
557
558
559
560
561
562
563
564
565
566
567
568
569
570
571
572
573
574
575
576
577
578
579
580
581
582
583
584
585
586
587
588
589
590
591
592
593
594
595
596
597
598
599
600
601
602
603
604
605
606
607
608
609
610
611
612
613
614
615
616
617
618
619
620
621
622
623
624
625
626
627
628
629
630
631
632
633
634
635
636
637
638
639
640
641
642
643
644
645
646
647
648
649
650
651
652
653
654
655
656
657
658
659
660
661
662
663
664
665
666
667
668
669
670
671
672
673
674
675
676
677
678
679
680
681
682
683
684
685
686
687
688
689
690
691
692
693
694
695
696
697
698
699
700
701
702
703
704
705
706
707
708
709
710
711
712
713
714
715
716
717
718
719
720
721
722
723
724
725
726
727
728
729
730
731
732
733
734
735
736
737
738
739
740
741
742
743
744
745
746
747
748
749
750
751
752
753
754
755
756
757
758
759
760
761
762
763
764
765
766
767
768
769
770
771
772
773
774
775
776
777
778
779
780
781
782
783
784
785
786
787
788
789
790
791
792
793
794
795
796
797
798
799
800
801
802
803
804
805
806
807
808
809
810
811
812
813
814
815
816
817
818
819
820
821
822
823
824
825
826
827
828
829
830
831
832
833
834
835
836
837
838
839
840
841
842
843
844
845
846
847
848
849
850
851
852
853
854
855
856
857
858
859
860
861
862
863
864
865
866
867
868
869
870
871
872
873
874
875
876
877
878
879
880
881
882
883
884
885
886
887
888
889
890
891
892
893
894
895
896
897
898
899
900
901
902
903
904
905
906
907
908
909
910
911
912
913
914
915
916
917
918
919
920
921
922
923
924
925
926
927
928
929
930
931
932
933
934
935
936
937
938
939
940
941
942
943
944
945
946
947
948
949
950
951
952
953
954
955
956
957
958
959
960
961
962
963
964
965
966
967
968
969
970
971
972
973
974
975
976
977
978
979
980
981
982
983
984
985
986
987
988
989
990
991
992
993
994
995
996
997
998
999
1000
\documentclass[officiallayout]{tktla}
%\documentclass[officiallayout,a4frame]{tktla}
\usepackage[utf8]{inputenc}
\usepackage{latexsym}
\usepackage{graphicx}
\usepackage[
backend=biber,
bibstyle=ieee,
citestyle=numeric-comp,
sortlocale=en_US,
natbib=true,
url=false,
doi=false,
isbn=false,
eprint=false
]{biblatex}
\usepackage{software-biblatex}
\usepackage{pdfpages}
\usepackage[hidelinks]{hyperref}
% Math environments and symbols
\usepackage{amsmath}
\usepackage{amsfonts}
% Always place floats inside their respective sections
\usepackage[section]{placeins}
% Independence symbol
\newcommand\indept{\protect\mathpalette{\protect\independenT}{\perp}}
\def\independenT#1#2{\mathrel{\rlap{$#1#2$}\mkern2mu{#1#2}}}
%% Hunting those pesky unicode characters
%%\DeclareUnicodeCharacter{0301}{*************************************}
% For thesis papers section
\usepackage{geometry}
\def \dvWHITE{white}
\def \dvBLACK{black}
\def \dvBLUE{blue}
\def \dvGREEN{green}
\def \dvheight{231pt}
% Creates black box with the text given as first parameter in white
\newcommand\note[3] {\marginpar{\vspace{#2}\colorbox{#3}{\parbox[c][\dvheight][t]{34.8pt}{\vspace{0.3cm}\color{white}\centering\Huge{\textbf{#1}}}}}}
% Footnotes without numbering
\let\svthefootnote\thefootnote
\addbibresource{maklin2022.bib}
\title{Probabilistic Methods for \\ High-Resolution Metagenomics}
\author{Tommi M\"aklin} \authorcontact{tommi.maklin@helsinki.fi\par
\url{https://maklin.fi/}} \pubtime{October}{2022} \reportno{12}
\isbnpaperback{978-951-51-8694-2} \isbnpdf{978-951-51-8695-9}
\issn{1238-8645} \issnonline{2814-4031} \printhouse{Unigrafia}
\pubpages{86+94 pages} % --- remember to update this!
% For monographs, the number of the last page of the list of references
% For article-based theses, the number of the last page of the list of
% references of the preamble part + the total number of the pages of
% the original articles and interleaf pages.
\supervisorlist{Antti Honkela, University of Helsinki, Finland \\ \hspace{8pt} Jukka Corander, University of Helsinki, Finland}
\preexaminera{Ashlee Earl, Broad Institute of MIT and Harvard, USA}
\preexaminerb{Tommi Vatanen, University of Helsinki, Finland}
\opponent{Leo Lahti, University of Turku, Finland}
\custos{Antti Honkela, University of Helsinki, Finland}
\generalterms{Algorithms, Experimentation}
\additionalkeywords{genomic epidemiology, plate sweeps, probabilistic modeling, pathogen surveillance, taxonomic profiling, taxonomic binning, metagenomics}
% Computing Reviews 1998 style
%\crcshort{A.0, C.0.0}
%\crclong{
%\item[A.0] Example Category
%\item[C.0.0] Another Example
%}
% Computing Reviews 2012 style
\crclong{
\item Mathematics of computing $\rightarrow$ Probability and statistics $\rightarrow$ Statistical computing
\item Computing applications $\rightarrow$ Biosciences
}
\permissionnotice{
Doctoral dissertation, to be presented for public examination with
the permission of the Faculty of Science of the University of
Helsinki in Auditorium CK112, Exactum building, Kumpula campus on the 28th of October 2022 at 12 o'clock.
}
\newtheorem{theorem}{Theorem}[chapter]
\newenvironment{proof}{\noindent\textbf{Proof.} }{$\Box$}
\begin{document}
\frontmatter
\maketitle
\begin{abstract}
Metagenomics is the analysis of DNA sequencing data from samples
obtained directly from the environment and containing several
different organisms at once. Common tasks in metagenomics are
taxonomic profiling, where the goal is to identify the organisms
present in the sample and assign relative abundances to them, and
taxonomic binning, where the sequencing data from the sample is
divided into bins that correspond to some sensible taxonomic
units. This thesis introduces methods for performing these two tasks
at a high-resolution capable of distinguishing between lineages of
bacterial species. The first of these methods is mSWEEP, which
solves the profiling task by utilizing a collection of grouped
bacterial reference sequences, pseudoalignment, and a probabilistic
model. The second method, mGEMS, builds upon mSWEEP to solve the
binning task using an assignment rule derived from the fundamentals
of the probabilistic model used by mSWEEP. Both methods are
accompanied by efficient implementations that utilize fast
variational inference and pseudoalignment to fit the model in a
reasonable time, rendering them applicable to large-scale datasets.
Both mSWEEP and mGEMS have been developed for application in either
the traditional whole community metagenomics context, where the
direct-from-environment samples are analysed, or in the plate sweep
metagenomics context, where the sample has been plated once on a
selective medium. While the latter is not metagenomics in the
traditional sense, this thesis advocates for its use when high depth
sequencing data is required from some species and the other
organisms are not of interest. Regardless of the type of
metagenomics data used, the ultimate goal of both mSWEEP and mGEMS
is to enable performing standard genomic epidemiological analyses
directly from data containing several strains of the same bacteria,
skipping the typically used isolation steps required to separate
them. Due to the implied cost-savings from reducing the number of
cultures that need to be performed as well as the better capture of
variation in the samples through using metagenomics data, mSWEEP and
mGEMS enable performing entirely novel types of analyses in the
field of genomic epidemiology.
\end{abstract}
\begin{acknowledgements}
First I would like to express my sincere thanks to my supervisors
Antti Honkela and Jukka Corander. Without your guidance and
occasional prodding none of the experiences I have been fortunate
enough to collect during this journey would not have been
possible. You have been extraordinarily helpful when needed and
necessarily stern when required.
Like many great things in life, the research in this thesis is the
result of collective work. Accordingly, I want to thank all of the
incredible individuals who made the bigger picture manifest by
collaborating and coauthoring with me. Without all of you, both the
results and the road there would have been a much rougher ride.
In addition to my collaborators, I wish to recognize the impact of
my colleagues in Antti's group who I have had the pleasure of
working along for the past six years and who have provided excellent
company for extensive lunch \& coffee breaks and certain
extracurricular activities. Although our academically measurable
contributions were limited, the impact of a supportive environment
is heartfelt and I owe you my gratitude. I also want to thank all the
other brilliant people who I've met during my doctoral studies and
other activities at the University of Helsinki, and who I've had the
pleasure of chatting, advocating, and simply existing with. The
University of Helsinki, and the Academy of Finland, deserve my
additional gratitude for funding my studies through various
projects.
Finally, I wish to thank all my friends who have been there for
me during these past years, and my family for their unwavering
support.
\begin{flushright}
Helsinki,\\October 2022\\
Tommi M\"aklin
\end{flushright}
\end{acknowledgements}
\tableofcontents
\chapter*{List of original publications\markboth{List of original publications}{}}
\subsection*{Publication I \textemdash{ } High-resolution sweep metagenomics using fast probabilistic inference}
By \underline{Tommi M\"aklin}, Teemu Kallonen, Sophia David, Christine
J Boinett, Ben Pascoe, Guillaume M\'eric, David M Aanensen, Edward J
Feil, Stephen Baker, Julian Parkhill, Samuel K Sheppard, Jukka
Corander, and Antti Honkela. Published in \textit{Wellcome Open
Research} (2021), 5:14. \\doi:
\href{https://doi.org/10.12688/wellcomeopenres.15639.2}{10.12688/wellcomeopenres.15639.2}.
\subsection*{Publication II \textemdash{ } Bacterial genomic epidemiology with mixed samples}
By \underline{Tommi M\"aklin}, Teemu Kallonen, Jarno Alanko, \O rjan
Samuelsen, Kristin Hegstad, Veli M\"akinen, Jukka Corander, Eva Heinz,
and Antti Honkela. Published in \textit{Microbial Genomics} (2021)
7:11. \\doi:
\href{https://doi.org/10.1099/mgen.0.000691}{10.1099/mgen.0.000691}.
\subsection*{Publication III \textemdash{ } Strong pathogen competition in neonatal gut colonization}
By \underline{Tommi M\"aklin}, Harry A Thorpe, Anna K P\"ontinen,
Rebecca A Gladstone, Yan Shao, Maiju Pesonen, Alan McNally, P\aa l J
Johnsen, \O rjan Samuelsen, Trevor D Lawley, Antti Honkela, and Jukka
Corander. Submitted; preprint available from \textit{bioRxiv}
(2022). \\doi:
\href{https://doi.org/10.1101/2022.06.19.496579}{10.1101/2022.06.19.496579}.
\mainmatter
\chapter{Introduction}
\sloppy
Public health research focusing on bacterial pathogens has been
transformed by analysis of the contents of bacterial genomes obtained
by whole-genome sequencing (WGS) since 2010
\citep{armstrong2019pathogen}. In this time frame, the price of
sequencing has decreased tremendously \citep{dnaseqcost,
goodwin2016coming}, enabling adoption of sequencing as a standard
tool in the infectious disease, evolutionary, and genomic epidemiology
toolkits \citep{tang2017infection, grad2014epidemiologic,
kwong2015whole}. Many of the standard analyses in these fields
require data from pure bacterial cultures, created by isolating a
bacterium from an initial mixed culture, which often contains several
distinct bacteria and even other micro-organisms. Isolating all of
these presents substantial economical barriers to more widespread
adoption of WGS as a routine tool since the cost and turnaround time
of the library preparation and DNA extraction steps, performed once
per each isolated organism, are approaching the price of sequencing
itself \citep{rossen2018practical}.
Whole community metagenomics, where DNA is extracted and
sequenced directly from the original environmental sample, presents a
potential cost-effective alternative to the isolate sequencing
approach. Contrary to isolate sequencing, whole community metagenomics requires
only a single library preparation and DNA extraction step and no
cultivation steps since the sample is sequenced directly. However,
direct sequencing may require significantly higher sequencing depths
due to presence of host DNA \citep{pereira2019impact,
mcardle2020sensitivity} and contamination
\citep{mcardle2020sensitivity, salter2014reagent}. In addition, low
biomass samples are challenging for metagenomics to use
reproducibly. Due to these factors, whole community metagenomics is difficult
to apply when only a subset of the diversity is of interest but the
planned analyses require high sequencing depths, which is
typical in genomic epidemiological studies.
Genomic epidemiology is generally speaking the study of the spread of
bacterial pathogens generally speaking based on WGS data. Sequencing
the genomes of bacterial pathogens during an outbreak allows for
comparing accumulated mutations in their genomes
\citep{tang2017infection}, elucidating their short-term evolutionary
history and enabling case linking when combined with appropriate
metadata \citep{grad2014epidemiologic, hill2021progress}. Similarly,
long-term routine surveillance aids in hastening the detection of
outbreaks \citep{eyre2012pilot, gardy2018towards}, identifying
potential high-risk clones \citep{aanensen2016whole}, or reservoirs
for antimicrobial resistance \citep{weingarten2018genomic,
coipan2020genomic}. Many of these analyses require assembling the
genomes of the bacteria from the sequencing reads which has led to
dominance of the isolate sequencing approach and a lack of studies
attempting to solve the epidemiological problems with metagenomics.
When choosing between whole community metagenomics and isolate sequencing, a
middle-ground can be found plate sweep metagenomics
\cite{maklin_high-resolution_2021} \textemdash{ } sometimes also
called limited-diversity metagenomics \citep{cocker_drivers_2022}. In
this approach the initial culture from a sample is swept and DNA
extracted from it is sequenced en masse rather than preparing several
isolates from it. Since selective culture media are available for most
clinically relevant bacteria \citep{lagier2015current}, plate sweep
metagenomics simultaneously both reduces the number of library
preparation and DNA extraction steps by using only a single culture,
and solves the host DNA overabundance and sequencing depth issues in
whole community metagenomics through selective media enrichment. Incorporating an
enrichment step has also been found to increase the sensitivity to
low-abundance organisms that might be missed in direct sequencing
\citep{whelan2020culture}.
While both plate sweep metagenomics and whole community metagenomics have
technically been possible for many years with the latter appearing
around 2004 \citep{tyson2004community, venter2004environmental}, the
development of computational methods has largely focused on analysing
sequencing reads from a single organism at a time. Although many
methods for analysis of metagenomic data at the level of identifying
strains (in this thesis a strain is the biological organism
corresponding to a single cell colony) or lineages (a collection of
strains that descend from the same strain and have maintained similar
genetic sequences) have been developed \citep{breitwieser2019review},
these do not typically perform well when applied to data containing
multiple lineages of the same species at once
\citep{sczyrba2017critical}, which will be referred to as
lineage-level variation further in the thesis. Methods diverging from
the traditional relative abundance estimation (taxonomic profiling)
\citep{truong2017microbial} or metagenomic sequence read demixing
(taxonomic binning) \citep{van2022strainge} context do successfully
tackle lineage-level variation but do not easily translate to replacing
assembly-based analyses such as SNP calling or phylogenetic inference.
This thesis presents two computational methods that enable
taxonomic profiling and taxonomic binning from either whole community metagenomic
or plate sweep metagenomic short-read sequencing data. While the plate
sweep metagenomics approach was the focus of both methods during their
initial publication, further research has shown that they also perform reliably
when applied to whole community metagenomics data. Using either of the two
approaches to reduce the costs associated with data collection, the
methods presented here enable performing routine genomic
epidemiological analyses when significant lineage-level variation is
present in the collected sequencing data.
The first of the two methods, called mSWEEP, consists of a
probabilistic model for estimating the relative abundances of lineages
of a bacterial species in a set of sequencing reads
\citep{maklin_high-resolution_2021}. mSWEEP leverages pseudoalignment
\citep{bray2016near} of the reads against a set of reference sequences
that have been grouped together into lineages and outputs estimates of
the lineage-level abundances. The second method, mGEMS, processes the
output from mSWEEP to construct an assignment rule for assigning each
read to one or more bins corresponding to a reference lineage
\citep{maklin_bacterial_2021}. Both methods explicitly account for the
fundamental characteristic of sequencing data containing multiple lineages of the same species where
each read can, and often does belong to several lineages of the same
species at the same time. The combination of mSWEEP and mGEMS enables
effective computational quantification of metagenomic data at a high
resolution within the species, and enables downstream processing of
mixed samples with results often comparable to using isolate data.
Even though both mSWEEP and mGEMS were originally designed with
applications in plate sweep metagenomics in mind, Publication III
\citep{maklin_strong_2022} demonstrates applicability of both methods
to whole community metagenomics data. The data analysed in Publication
III was collected from a cohort of UK neonates
\citep{shao2019stunted} and the samples were submitted for whole community metagenomics sequencing. Results from this data show strong
competition between bacterial species and strains during the initial
colonization of the newborn gut microbiome. More importantly from a
methods perspective, this analysis shows that mSWEEP and mGEMS provide
(so-far) completely unprecedented levels of resolution in analysis of
metagenomic sequencing data.
Together Publications I-III represent foundational
methodological steps in both opening up high-resolution exploration of
bacterial diversity as well as making such analyses more accessible to
resource-constrained laboratories.
\section{Three approaches to sequencing bacterial DNA}
\label{three-approaches-to-metagenomics}
%- Something about sequeuncing reads and sequence assembly? 16S sequencing?
Preparing bacterial DNA for sequencing is often done after a culture
step that enriches the number of bacterial cells from a target group
of micro-organisms. Culturing is performed by plating a sample and
inoculating it for a period of time that allows the bacteria to
multiply \citep{sanders2012aseptic}. After inoculation, visible
colonies may be isolated and further propagated on their own plates
\citep{sanders2012aseptic}, or the entire plate may be prepared for DNA
extraction to produce plate sweep metagenomic data. Alternatively, in
whole community metagenomics, the whole culture procedure is skipped, and DNA is
extracted directly from the sample with the extract procedure
depending on the sample type \citep{bachmann2018advances}. When it
comes to the end-result \textemdash{ } the sequencing reads
\textemdash{ } all three approaches have their own characteristics
that affect the available downstream analyses.
Whole community metagenomics, where all or most of the DNA in a sample is
extracted (Figure \ref{fig:microbiome-sampling-methods}a), has emerged
as a tool for analysing the full breadth of variety in various
microbiomes \citep{shao2019stunted, ghensi2020strong,
bertrand2019hybrid, danko2021global, whelan2020culture}. Exploring
this diversity comes at a price, however, since the produced
sequencing reads are split across the numerous organisms possibly
present, resulting in a need to sequence the sample more deeply to
capture the less abundant organisms \citep{whelan2020culture,
vollmers2017comparing, quince2017shotgun}. Combined with other
issues related to host DNA abundance \citep{whelan2020culture, ivy2018direct, gu2019clinical}, the shortcomings of whole community metagenomics have
so-far hindered its adoption in genomic epidemiology.
Plate sweep metagenomics proposes a middle-ground between the direct
sequencing of whole community metagenomics and isolate studies by incorporating a
single culture step \citep{maklin_high-resolution_2021}. In this
approach, the sample is cultured on an appropriate selective medium
and the entire complexity of the plate is subjected to DNA extraction
and sequenced after a suitable inoculation period (Figure
\ref{fig:microbiome-sampling-methods}b). The inclusion of a culturing
step allows for generating large numbers of sequencing reads from the
bacteria that thrive in the chosen medium, circumventing both the
sequencing depth and host DNA issues in whole community metagenomics while
improving the sensitivity to bacteria found in low abundance in the
original sample \citep{whelan2020culture,
tonkin-hill_pneumococcal_2022, zhang2022using}. Furthermore, focusing the sequencing
efforts on the relevant bacteria enables application of bioinformatics
tools that require a high sequencing depth provided that the reads
from different organisms can be computationally separated. Developing
a tool to solve the aforementioned deconvolution problem is one of the
key contributions of this thesis.
\addtocounter{footnote}{-1}\let\thefootnote\svthefootnote
\begin{figure}[!t]
\centering
\includegraphics[width=\textwidth,keepaspectratio]{img/sampling/microbiome_sampling_methods.pdf}
\caption{Different approaches to sequencing bacterial DNA. Panel
\textbf{a)} depicts the whole community metagenomics approach, where
sequence data is produced directly from the sample. Panel
\textbf{b)} depicts the plate sweep metagenomics approach, where
the sample is plated on a selective medium and DNA extracted and
sequenced from the whole plate after an inoculation
period. Panel \textbf{c)} depicts the whole-genome sequencing
approach, where a subset of visible colonies on the inoculated
culture is extracted, and DNA from them prepared for
sequencing.}
\label{fig:microbiome-sampling-methods}
\end{figure}
In the third approach, whole-genome sequencing of isolates (Figure
\ref{fig:microbiome-sampling-methods}c), visible colonies from the
initial culture are picked and transferred to new plates. After
letting the transferred colonies grow, the resulting culture will
consist only of the descendants of the original colony, allowing for
massive numbers of sequencing reads to be generated from the isolated
organisms. Since visible colonies on the initial culture are typically
assumed to contain clones of the same organism, this approach
effectively gets rid of most of the variation found in both the sample
and the initial culture. While the whole-genome sequencing approach is
excellent for generating high-coverage and high-quality data from a
single bacterial strain, in practice the number of colonies that can
be isolated is often constrained by laboratory resources and nearly
always restricted to rapidly growing colony-forming phenotypes.
\vfill
\pagebreak
\noindent\let\thefootnote\relax\footnote{Figure \ref
{fig:microbiome-sampling-methods} source: Adapted from
\cite{miansari_fenugreek-sprouts} and \cite{niaid_escherichia-coli}.}
In genomic epidemiological analyses, the whole-genome sequencing
approach has been dominant due to its strengths in producing highly
accurate data capable of SNP calling and differentiating between
organisms \citep{efsa2019whole}. Producing the same results using
whole community or plate sweep metagenomics has some obvious benefits
in both increasing the number of samples that can be processed as well
as in capturing more of the diversity in the samples, but the existing
metagenome-analysis tools have not been able to reach the required
level of resolution \citep{sczyrba2017critical,
mcintyre2017comprehensive, maklin_bacterial_2021}. Here, the issue is
tackled through methodological advances that open up more
widespread use of metagenomic sequencing data in genomic epidemiology.
\pagebreak
\section{Analysing metagenomic sequence data}
Metagenomic sequencing data analysis presents several challenges to
bioinformaticians). Firstly, the increased diversity of species
requires much larger computational resources to analyse
\citep{yang2021review}. Secondly, the possible presence of
lineage-level variation complicates analyses that attempt to separate
the reads to distinct taxonomic units because the differences between
strains within a species can be minimal
\citep{meyer2022critical}. Subsequently, the bulk of method
development has focused on operating with the assumption that only a
single strain from each species is present in the same sample
\citep{breitwieser2019review}. This section will briefly cover
some of the previous approaches and describe how mSWEEP and mGEMS fit
into the metagenomics toolkit.
One of the more commonly used tools for analysing whole community metagenomics
data are metagenome assemblers. Similarly to genome assemblers,
metagenome assemblers aim to produce a set of contigs (sets of
overlapping sequencing reads) that correspond overlapping sequencing
reads. Since reads from metagenomic sequencing contain several
organisms, metagenome assemblers are often paired with metagenome
binners that attempt to assign the metagenome-assembled contigs to
bins that correspond to a taxonomic unit. These units are typically
assumed distinct enough that they are not mistaken for sequencing
error or minor genetic variation. When the assumption is fulfilled,
metagenome assemblers produce contigs that are adequately accurate for
several types of analyses, and have subsequently been adopted among
the standard tools in a variety of microbiome studies
\citep{bertrand2019hybrid, somerville2019long, stewart2019compendium}.
Metagenome binners are closely related to another type of analysis,
where the aim is to assign (relative) abundances to the taxonomic
units that were identified in the sequencing reads, called taxonomic
profiling. These two approaches sometimes go hand-in-hand since the
abundance of a taxonomic unit can naively be defined as the number of
reads that align to contigs which have been assigned to the unit. When
the units are closely related species or strains, more sophisticated
methods are necessary since both the reads and short contigs may
plausibly belong to several taxa, which has led to the development of
dedicated taxonomic profilers that do not attempt to bin the contigs
or the sequencing reads \citep{beghini2021integrating,
truong2017microbial, maklin_high-resolution_2021, van2022strainge}.
\vfill
\pagebreak
Recently, a third category of methods for tracking the
presence/absence of a specific strain across several samples has
emerged \citep{van2022strainge, truong2017microbial,
nayfach2016integrated}. These methods aim to infer similarity and
shared strains and provide an attractive tool for transmission
analysis when the genomes of the individual strains are not
required or cannot be assembled due to low sequencing depth. Ideally,
the tools for tracking strains would be combined with those for
extracting the contig or read bins, together enabling both a wide
analysis covering all samples and strains and a focused analysis of
the strains that are abundant enough to assemble their genomes.
All of the above can be further divided into reference-based methods
that leverage reference data in their analysis, and reference-free
methods that perform the analysis solely based on the sequencing
reads. While reference-free methods are able to handle data containing
previously unknown bacteria that do not have any available genome
assemblies, reference-based methods provide an easily interpretable
context for the results and typically reach a higher resolution
\citep{hiraoka2016metagenomics, thomas2012metagenomics}. If detailed
quantification is only required for some subset of organisms in the
sample, reference-based methods frequently also provide means to
filter out the reads belonging to the uninteresting organisms.
The lineage-level methods from this thesis take the reference-based
approach and specifically leverage bespoke reference
collections. Tailoring the collections to fit the presumed contents of
the samples allows both mSWEEP and mGEMS to perform at a resolution
that is mainly limited by the quality and variety of the available
assemblies. Contrary to many existing reference-based methods, mSWEEP
and mGEMS do not attempt identification of the individual reference
sequences, but rather incorporate a clustering of the reference
assemblies to biologically interpretable and phylogenetically
reflected lineages. Since differences between lineages are generally
more pronounced, the identification task becomes significantly easier
\citep{sankar2016bayesian} and extends to handling cases where the
sequencing reads originate from a previously unknown sequence which
nevertheless belongs to a known reference lineage. Trading the ability
to imprecisely identify the exact sequence to precisely identifying
the lineage is especially useful in clinical settings, where the
diversity of potential pathogenic bacteria has been thoroughly studied
using whole-genome sequencing and the clinically relevant lineages are
often well-known.
\vfill
\section{Metagenomics data in genomic epidemiology}
Aside metagenomics tools, the second significant aspect of this thesis
has to do with their application in genomic epidemiology, where an
important goal is to trace the transmission of pathogenic bacteria
using genomic sequence data. Genome-informed analyses have in recent
years greatly expanded the ability of researchers to investigate
outbreaks, identify epidemiologically relevant genetic elements, and
detect emerging public health concerns \citep{tang2017infection,
van2019status, grad2014epidemiologic, kwong2015whole}. Due to the
high level of accuracy required, such analyses have been performed
using isolate WGS data, which has a relatively high economical cost
and slow turnaround time \citep{rossen2018practical}, rendering the
research more reactive in nature. One of the goals of mSWEEP and mGEMS
is to enable partially replacing the use of isolate data with
metagenomics data, decreasing both the cost and turnaround time of the
existing genomic epidemiology pipelines.
In addition to improving both cost- and time-effectiveness,
incorporating some kind of metagenomics data into genomic epidemiology
presents some obvious advantages in increasing the sensitivity to
genetic diversity that might be obscured by the use of isolate data in
routine surveillance. As an example, a recent study into within-host
diversity of the common respiratory pathogen \textit{Streptococcus
pneumoniae} utilized an analogue of plate sweep metagenomics and
found low-frequency co-colonization by lineages corresponding to
epidemic serotypes alongside lineages of known carriage serotypes
\citep{tonkin-hill_pneumococcal_2022}. This finding helped explain the
previously unknown source of the epidemic serotypes in outbreaks of
disease, which could not be fully explained by isolate sequencing
data. Since naturally occurring variation is common in many species of
clinical interest \citep{paterson2015capturing, zlitni2020strain,
dixit2018within, mosavie2019sampling, tonkin-hill_pneumococcal_2022}
similar findings in other fields are likely with more widespread use
of whole community and plate sweep metagenomics data.
Another related aspect in favour of using more metagenomics-oriented
approaches arises from simple practicality: sequencing several
organisms at once is simply easier than performing the several steps
required to isolate an organism for DNA extraction. Direct sequencing
of the samples combined with nearly equally accurate analyses should,
in principle, make implementing routine surveillance significantly
more accessible to locations and laboratories lacking in funding and
resources. This in turn combined with data sharing practices across
borders has the potential to vastly increase the capabilities of
proactive surveillance. Furthermore, sequencing the whole sample and
publicly archiving the reads has the benefit of preserving DNA from
the full variety of organisms in the sample and making it available
for future analyses with different goals from the original studies.
In conclusion, the field of genomic epidemiology that was established
with the emergence of rapid and scalable WGS sequencing in the early
2010's can be seen as entering a transformative period with both data
generation and more powerful computation tools becoming increasingly
available and accessible. The development of methods such as mSWEEP
and mGEMS will facilitate a further speed up of this transformation and
enable entirely novel types of analyses and discoveries through the
inclusion of metagenomic data.
\section{Contributions}
This thesis comprises three publications covering both mSWEEP
\citep{maklin_high-resolution_2021} (Publication I), mGEMS
\citep{maklin_bacterial_2021} (Publication II), and a third article
(Publication III) demonstrating their application to whole-genome
shotgun metagenomic sequencing data
\citep{maklin_strong_2022}. Publications I-II are accompanied by
software implementations \citep{maklin_mSWEEP,
maklin_mGEMS}. Publication III is more applied in nature, exploring
in more detail the types of analyses enabled by Publications I-II.
\subsection*{Publication I \textemdash{ } High-resolution sweep metagenomics using fast probabilistic inference}
By \underline{Tommi M\"aklin}, Teemu Kallonen, Sophia David, Christine J
Boinett, Ben Pascoe, Guillaume M\'eric, David M Aanensen, Edward J Feil,
Stephen Baker, Julian Parkhill, Samuel K Sheppard, Jukka Corander, and
Antti Honkela. Published in \textit{Wellcome Open Research} (2021),
5:14, doi: \href{https://doi.org/10.12688/wellcomeopenres.15639.2}{10.12688/wellcomeopenres.15639.2}.
Publication I \citep{maklin_high-resolution_2021}, presented and
benchmarked the mSWEEP method for taxonomic profiling of sequencing
data containing multiple strains from the same bacterial species. The author
contributed to conceptualization of the study, formal analysis
and investigation of the data, developing the methodology and software
implementations, validation and visualization of the results, and
writing and editing both the original draft and the revised
manuscript.
Software implementation of the ideas presented in Publication I is available
from GitHub at
\href{https://github.com/PROBIC/mSWEEP}{https://github.com/PROBIC/mSWEEP}
(latest version). The latest version at the time of writing is
archived and available in Zenodo \citep{maklin_mSWEEP}.
\subsection*{Publication II \textemdash{ } Bacterial genomic epidemiology with mixed samples}
By \underline{Tommi M\"aklin}, Teemu Kallonen, Jarno Alanko, \O rjan
Samuelsen, Kristin Hegstad, Veli M\"akinen, Jukka Corander, Eva Heinz,
and Antti Honkela. Published in \textit{Microbial Genomics} (2021)
7:11, doi: \href{https://doi.org/10.1099/mgen.0.000691}{10.1099/mgen.0.000691}.
Publication II \citep{maklin_bacterial_2021} continued to build upon
mSWEEP by developing an algorithm for binning sequencing reads at the
lineage-level of mSWEEP analyses. This approach, and the accompanying software
implementation, are both called mGEMS. The author contributed to Publication II by
taking part in conceiving the study, developing the mGEMS pipeline, in
designing both the synthetic and the \textit{in vitro} experiments,
developing the mGEMS assignment algorithm, running the experiments,
creating the visualizations, interpreting the results, and in writing
and editing the main manuscript and the final published version
Software implementation of the ideas presented in Publication II is
available from GitHub at
\href{https://github.com/PROBIC/mGEMS}{https://github.com/PROBIC/mGEMS} (latest version).
The latest version at the time of writing is archived and available in
Zenodo \citep{maklin_mGEMS}.
\vfill
\pagebreak
\subsection*{Publication III \textemdash{ } Strong pathogen competition in neonatal gut colonization}
By \underline{Tommi M\"aklin}, Harry A Thorpe, Anna K P\"ontinen, Rebecca
A Gladstone, Yan Shao, Maiju Pesonen, Alan McNally, P\aa l J Johnsen,
\O rjan Samuelsen, Trevor D Lawley, Antti Honkela, and Jukka
Corander. Submitted; preprint available from \textit{bioRxiv} (2022),
doi: \href{https://doi.org/10.1101/2022.06.19.496579}{10.1101/2022.06.19.496579}.
Publication III \citep{maklin_strong_2022} provides an example of
applying mSWEEP and mGEMS to whole community metagenomics
sequencing data and explores the dynamics of pathogen competition and
colonization in the gut microbiome of babies in their first three
weeks of life. The author contributed to Publication III in running the
mSWEEP/mGEMS pipeline on all data used in Publication III, updating the
reference databases for the investigated species, performing the
analysis of the mSWEEP/mGEMS results for the samples containing
\textit{E. coli}, and aiding the co-authors in analysing the other
species. Additional contributions included creating the visualisations,
interpreting the results, and naturally writing the publication.
\section{Structure}
The rest of the thesis is structured into three chapters that describe
the contents of Publications I-III and how they contribute to the
topics presented in the introduction chapter. The first of the three
chapters (Chapter \ref{mixture-modelling-of-sequence-data}) describes
the basic ideas behind the mSWEEP and mGEMS methods and provides
historical context for the parts of the methods that have their
origins within analysis of RNA sequencing data. The second chapter
(Chapter \ref{high-resolution-metagenomics}) describes the
experimental results from Publications I-III in more detail, focusing
more on the applied part rather than the theoretical foundations. The
third chapter (Chapter \ref{section:metagenomic-epidemiology}) is more
speculative in nature, covering both the demonstrated applications
from Publications I-III as well as exploring potential future avenues
for use of the developed methods. The three chapters are followed by a
concluding chapter (Chapter \ref{conclusions-and-future-directions})
which in the physical copy of the thesis is further followed by
reprints of the three included original publications.
\chapter{Mixture modeling of sequence data}
\label{mixture-modelling-of-sequence-data}
Mixture models are a family of probabilistic models which model
sampling from an overall population as a mixture of sampling from
several distinct subpopulations. Each subpopulation is typically
assumed to have its own distribution, which can be from the same or a
different distribution family, and mixing parameters that determine
the percentages of data each subpopulation contributes to the overall
population.
In sequencing data analysis, a key area of application for mixture
models has been in RNA-Seq, where identifying the expression levels
(relative contributions) of protein isoforms in some set of RNA
sequencing reads is one of the main problems
\citep{garber2011computational, wang2009rna}. Specifically, mixture
models are useful in cases where the sequencing reads do not uniquely
identify the isoform but could plausibly be the product of several
genes. This is in contrast to the microarray technology that preceded
RNA-Seq, where the technology itself allows for unique identification
of the expressed isoform and applications of probabilistic models
focused more on obtaining uncertainty estimates
\citep{rattray2006propagating, liu2007including}.
The ability of mixture models to differentiate between expression of
isoforms with similar nucleotide contents makes them ideal for
analysis of short-read sequencing data from bacterial strains. Since
the strains within a species typically share a large percentage of
their genome (although the exact values vary greatly by species
\citep{doolittle2006genomics, van2020diversity}), sequencing reads
from one will match with a large number of genomes from the same
species. Indeed, the models from RNA-Seq have been adapted almost
directly to identify bacterial strains from sequencing data
\citep{sankar2016bayesian} with great effect but not without some
caveats related to more general applicability across different
bacterial species. The work in this thesis extends the previous work
in the field \citep{sankar2016bayesian} by introducing a more general
formulation of the model that generalizes well to arbitrary bacterial
species and allows for assigning sequencing reads to the bacterial
strains in addition to identifying their relative contributions in a
set of sequencing reads.
\section{mSWEEP and mGEMS}
The mSWEEP method is a tool for estimating the relative abundances of
lineages of bacterial species in a set of sequencing reads. The method
consists of two parts: preparing and clustering a reference genome
assembly collection, and estimating the relative abundances using
pseudoalignment \citep{bray2016near} and probabilistic modelling. In
the preparation part, a reference collection consisting of genome
assemblies for some predefined set of bacterial species is constructed
and prepared for analysis by clustering the assemblies into
biologically sensible lineages. In the analysis part, short-read
sequencing data are pseudoaligned against the reference collection and
the alignments are used alongside the lineage clustering as the input to
the mSWEEP probabilistic model. With results of the pseudoalignment
mSWEEP estimates the relative abundances of each lineage in the
reference collection using a mixture model and variational
inference. The outputs from mSWEEP are the relative abundances of the
lineages defined in the reference collection and a probability matrix
describing the fit of each sequencing read to each reference lineage.
Accompanying mSWEEP is the mGEMS pipeline, which is a method for
assigning each read in a sample to some (or none if the read does not
pseudoalign against any reference sequences) of the reference
lineages. mGEMS utilizes the relative abundance estimates and the
probability matrix from mSWEEP to assign the lineage membership of
each read. Importantly, mGEMS allows for multi-lineage membership,
since many reads can plausibly originate from several strains within
the same species.
Although both mSWEEP and mGEMS are novel methods that have been
published in Publications I-II, the roots of mSWEEP
especially lie in the mixture modelling context from RNA-Seq and the
subsequent BIB method \citep{sankar2016bayesian} . These roots will be
examined in more detail in the next section, which explains how they
relate to the approach used in mSWEEP for bacterial data. The
differences between reads from bacteria and RNA-Seq necessitate some
changes to the probabilistic models employed in RNA-Seq, which
eventually produces the model used in mSWEEP.
\subsection{Relationship to RNA-Seq methods}
%% MCMC katz 2010
%% EM algorithm li 2010
%% maximum likelihood wang 2010
%% importance sampling jiang 2009
In RNA-Seq, mixture models were proposed as a solution to the isoform
expression level estimation problem around 2010 with several methods
appearing around the same time \citep{katz2010analysis, li2010rna,
jiang2009statistical, wang2010isoform}. In these methods, the model
is defined through latent indicator variables that denote the source
isoform for each sequencing read and the parameter of interest (the
expression levels) are the proportions of reads assigned to each
indicator variable. The proportions are inferred using either a
likelihood function based on assessing the fit of the read to the
reference isoforms based on sequence alignment
\citep{katz2010analysis, li2010rna}, or by assuming a Poisson
distribution on the numbers of reads that are compatible with each
reference \citep{jiang2009statistical, wang2010isoform}. Estimating
the parameters themselves was performed using a variety of algorithms
ranging from Markov chain Monte Carlo (MCMC) sampling
\citep{katz2010analysis} and importance sampling
\citep{jiang2009statistical} to maximum likelihood estimation
\citep{wang2010isoform} and the EM algorithm \citep{li2010rna}.
From the perspective of this thesis, a significant development of the
methods appeared in 2012 with the introduction of BitSeq
\citep{glaus2012identifying}. BitSeq extended the previous models by
being the first of the methods to perform Bayesian inference on the
relative isoform expression levels and derived update equations for a
collapsed Gibbs sampler to implement MCMC sampling
over the posterior distribution defined by the model. A further
development of BitSeq appeared in 2015 with the introduction of
BitSeqVB \citep{hensman2015fast}, where the sampling approach was
supplemented by a collapsed variational Bayes approach that is
significantly faster in fitting the model than the collapsed Gibbs
sampler in BitSeq.
\subsection{Applying RNA-Seq methods to data from bacteria}
Solving the RNA-Seq isoform expression estimation problem had some
unforeseen consequences in that the mixture models used can be almost
directly applied to estimating the relative abundances of different
strains of bacteria in a set of DNA sequencing reads. In 2016, the
alikeness of the two problems was noted by the BIB method
\citep{sankar2016bayesian} which solved the analogous bacterial strain
relative abundance problem leveraging the work from BitSeq and
BitSeqVB. In BIB, the reference isoforms are simply replaced by the
genomes of reference bacterial strains, turning the expression level
estimates into relative abundances of these strains. However, due to
the fact that the strains within a bacterial species are more alike
than the isoforms BitSeq was developed to handle, BIB incorporated a
step where the reference sequences were made more differentiable by
clustering them into lineages. Each lineage was represented in the
reference collection by a reference sequence randomly sampled from all
those belonging to the lineage, and the representative sequences were
further trimmed down to contain only the core genome of the
species. The core genome refers to genomic sequences that are shared
by all, or nearly all, members of a species. In some cases it may be
preferable to define the core genome for subunits within the species,
such as lineages, and particularly if the species definition is not
based on or conforming to the genetic sequences.
The relative abundance estimation method from this thesis, mSWEEP,
builds upon the work in BIB by using both the core genome and the
accessory genome (accessory meaning the genome contents that are not
contained in the core genome) of the reference sequence assemblies,
and by removing the need to select a representative sequence from each
lineage. Instead of selecting a representative sequence, mSWEEP uses
all available assemblies from each lineage as the reference sequences,
which gets around the problem of having to define an adequate sequence
to represent the whole lineage. Furthermore, using all available
sequences from each lineage provides better coverage of the variation
in the now-included accessory genomes and allows applying the method
to species that do not have as stable core genomes as
\textit{Staphylococcus aureus}, which was used as one of the example
organisms in BIB \citep{sankar2016bayesian}.
In order to make the alignment against a much larger reference
sequence collection feasible, mSWEEP additionally replaces the use of
the location-based alignment in BIB with pseudoalignment
\citep{bray2016near} which reports only a 0 or 1 depending on whether
the read aligns somewhere (1) within a reference sequence or not at
all (0), and massively speeds up the alignment part. This also allows
for simplifying the likelihood function used in the mixture model by
using just the pseudoalignment count within each lineage as the
observations. In this sense, the mixture model used in mSWEEP can be
seen as a descendant of both Poisson RNA-Seq models
\citep{jiang2009statistical, wang2010isoform}, which used the
alignment counts, and the models leveraging location-based information
about the alignments \citep{katz2010analysis, li2010rna,
glaus2012identifying} through the relation to BitSeq through
BIB. Pseudoalignment-based identification of the relative abundances
of some reference sequences is also implemented in the metakallisto
\citep{schaeffer2017pseudoalignment} method but the inclusion of the
probabilistic model from mSWEEP is necessary for high-resolution
accuracy as demonstrated in Publication I.
\subsection{Differences between RNA-Seq and bacterial data}
\label{section:bacterial-data}
Sequencing data and reference sequences from bacteria have some unique
characteristics that distinguish them from data originating from
humans or other more complex organisms. Mainly, the generation time
for bacterial organisms is much shorter, measured in hours or even
minutes depending on the environmental conditions (for example in the
lab or in the wild) and species \citep{gibson2018distribution}. This
has implications for analyses that incorporate the use of reference
data from previously sequenced organisms. First, major changes in the
genomic contents happen within human-observable time-frames and are
reflected in sequence data obtained from what is assumed to be the
same strain, although the accumulation rate is highly variable
\citep{gibson2019investigating}. Secondly, bacterial genomes can
undergo major horizontal gene transfer events even across large
evolutionary distances, resulting in major genomic differences
\citep{arnold2022horizontal}. Together these factors imply that
reference sequences for any set of bacteria are almost certainly at
least somewhat different than what would be obtained from sequencing
descendants of the organisms corresponding to the original reference
sequence.
As noticed in the BIB method, the problems introduced by quicker
evolution of the bacterial organisms can be solved by replacing the
individual reference sequences with lineages within a bacterial
species as the unit for relative abundance estimation
\citep{sankar2016bayesian}. When estimation is performed for suitably
clustered sequences, the problem becomes significantly easier since
the short-term genetic variation, at least presumably, is more
contained within the lineages, provided that they are biologically
sensible. Since mSWEEP allows representing the lineages through
several reference sequences, they become easier to distinguish because
the differences between lineages are larger than the differences
between strains within the same lineage which could potentially
coexist in a sample. However, selecting the clustering algorithm for
identifying the lineages needs to be performed carefully, since the
estimation will be reliant on the signals that are contained within
the lineages.
\section{Significance of reference databases}
In addition to the lineage definitions, the reference collection of
some available genome assemblies for bacterial species of interest,
or several, lies at the very core of mSWEEP. Since the method
estimates the relative abundances of the lineages based on information
provided by pseudoalignment of the reads against the reference
sequences, the accuracy of the results is naturally constrained by the
quality of the reference collection. The included sequences can be
tailored to the problem at hand since mSWEEP does not place
constraints on the kind of assemblies that are used. This bespoke
approach to reference building is particularly useful when isolate
sequencing data is available from the same or closely related
organisms assumed to be present in the analysed sequencing reads.
Due to the disadvantages of requiring significant user effort in
constructing the reference collection, many metagenomics methods rely
on prebuilt references covering multiple species that have a high
availability of sequences assemblies. For mSWEEP, supplying similar
prebuilt references for a wide variety of species is not currently
feasible due to the computational requirements of the pseudoalignment
step, hence opting to use study or species-specific collections
instead. Nevertheless, Publications I-III do include the databases
that were used as parts of them, and allow for their reuse in future
analyses. However, extending them with further isolate data is highly
encouraged.
Another significant step in building the reference collection is
deciding on the desired level of detail in the lineage
definitions. While the fit of the reference sequences to the
sequencing reads ultimately determines the relative abundance
estimates, tweaking the depth of the lineage definitions can enable identification in cases where the reference
sequences are not exactly from a comparable source to the sample
reads. Publications I-III employ several different approaches to the
lineage definitions, making use of multilocus sequence typing (MLST)
\citep{enright1999multilocus}, and several clustering algorithms
\citep{lees2019fast, cheng2013hierarchical, corander2006bayesian}.
\subsection{Clustering bacterial sequences}
Various methods for clustering bacterial genomes have been
developed. One of the most commonly used of these is MLST, where
sequence clusters are defined based on variation observed at
housekeeping loci \citep{enright1999multilocus}, the combinations of
which correspond to a unique sequence type. For many biological
applications, the sequence types defined by MLST correspond to
taxonomic units that have observable differences in phenotypes such as
antimicrobial resistance \citep{kallonen2017systematic,
shaik2017comparative}. This has subsequently led to widespread
adoption of the method among microbiologists. The
downside of MLST is that it only offers a limited resolution by
considering a small fraction of the variation present in a genome.
The PopPUNK method \citep{lees2019fast} provides an alternative to
MLST that uses nucleotide distances and a Gaiussian
mixture model or DBSCAN \citep{ester1996density} to define the
lineages. In practice, the lineages that PopPUNK identifies typically
correspond to clonal complexes \citep{lees2019fast} which are sequence
clusters containing the sequences assigned to a central multilocus
sequence type (ST) and closely related single or double locus variants
of the central ST. Main advantage of using the clonal complex
analogues provided by PopPUNK is the ability to assign arbitrary
reference sequences to lineages while mostly conforming to the MLST
complexes. Additionally, using PopPUNK allows for including the
accessory genome in defining the lineages if desired, making PopPUNK
an ideal choice for defining the reference lineages for mSWEEP.
\subsection{Sequence alignment}
\label{section:sequence-alignment}
The reference collection is used in mSWEEP as the target for
pseudoalignment. Contrary to the location-based alignment method
employed in BIB \citep{sankar2016bayesian}, pseudoalignment only
reports a 0 when the read being pseudoaligned does not align anywhere
within a reference sequence, and 1 in the when the read does align
somewhere. In the original mSWEEP publication, the kallisto method
\citep{bray2016near}, which introduced the pseudoalignment concept,
was used to pseudoalign the reads, but Publication II introduced a
more scalable method, called Themisto, that replaces kallisto in the
pipeline. In addition to its scalability, Themisto also provides an
exact version of the kallisto pseudoalignment algorithm
\citep{maklin_bacterial_2021}.
Pseudoaligning the reads has the advantage of being much quicker to
compute, enabling more extensive reference collections to be employed
by mSWEEP. Although the disadvantage in information loss from
binarizing the alignments does require some adjustments of the
likelihood function in mSWEEP when compared to BIB, the added
reference coverage more than makes up for any potential losses in
accuracy. The next section will cover this mixture model formulation
and the changes introduced by mSWEEP in more detail, as well as the
theory behind the mGEMS algorithm for assigning the sequencing reads
themselves to bins corresponding to the reference lineages.
\section{A probabilistic model for sequences from mixed sources}
\label{section:model}