forked from gtonkinhill/panstripe
-
Notifications
You must be signed in to change notification settings - Fork 0
/
Copy pathREADME.Rmd
282 lines (184 loc) · 12.2 KB
/
README.Rmd
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
---
title: "panstripe"
output:
github_document:
includes:
in_header: inst/vignette-supp/header.md
toc: true
toc_depth: 3
editor_options:
chunk_output_type: console
---
<!-- README.md is generated from README.Rmd. Please edit that file -->
```{r, echo = FALSE}
knitr::opts_chunk$set(
collapse = TRUE,
comment = "#>",
fig.path = "inst/vignette-supp/",
echo=TRUE,
warning=FALSE,
message=FALSE,
tidy=TRUE
)
```
## Installation
`panstripe` is currently available on GitHub. It can be installed with `devtools`
```{r, eval=FALSE}
install.packages('remotes')
remotes::install_github("gtonkinhill/panstripe")
```
If you would like to also build the vignette with your installation run:
```{r, eval=FALSE}
remotes::install_github("gtonkinhill/panstripe", build_vignettes = TRUE)
```
## Quick Start
Panstripe takes as input a phylogeny and gene presence/absence matrix in Rtab format as is produced by [Panaroo](https://gtonkinhill.github.io/panaroo/#/), [Roary](https://github.com/sanger-pathogens/Roary), [PIRATE](https://github.com/SionBayliss/PIRATE) and other prokaryotic pangenome tools.
```{r, warning=FALSE, message=FALSE}
library(panstripe)
library(ape)
library(patchwork)
set.seed(1234)
###
# NOTE: here we load example files from the panstripe package. You should replace these variables with the relevant paths to the files you are using.
phylo.file.name <- system.file("extdata", "tree.newick", package = "panstripe")
rtab.file.name <- system.file("extdata", "gene_presence_absence.Rtab", package = "panstripe")
###
# Load files
pa <- read_rtab(rtab.file.name)
tree <- read.tree(phylo.file.name)
# Run panstripe
fit <- panstripe(pa, tree)
fit$summary
# Plot results
plot_pangenome_params(fit)
plot_pangenome_cumulative(fit)
```
A significant p-value for the `tip` term indicates that there is a different rate of gene exchange at the tips of the phylogeny compared with the internal branches. This is usually driven by annotation errors or highly mobile elements that do not persist long enough to be observed in multiple genomes.
A significant p-value for the `core` term indicates that there is a significant association between the core genome branch length and the number of gene exchange events.
The `depth` term is less interesting but indicates when there is a difference in our ability to detect older gene exchange events.
## Citation
To cite panstripe please use: Tonkin-Hill, G. et al. Robust analysis of prokaryotic pangenome gene gain and loss rates with Panstripe. bioRxiv (2022) doi:10.1101/2022.04.23.489244
## Comparing Pangenomes
As `panstripe` uses a GLM framework it is straightforward to compare the slope terms between datasets. The `compare_pangenomes` function considers the interaction term between the data sets and the `core`, `tip` and `depth` terms.
> **IMPORTANT:** *It is important that the phylogenies for the pangenomes being compared are on the same scale. This can be achieved most easily by using time scaled phylogenies. Alternatively, it is possible to build a single phylogeny and partition it into clades or use SNP scaled phylogenies.*
Here, we simulate two pangenomes with different gene exhange rates, fit the model using `panstripe` and run the `compare_pangenomes` function.
```{r}
# Simulate a fast gene gain/loss rate with error
sim_fast <- simulate_pan(rate = 1e-3, ngenomes = 200)
sim_slow <- simulate_pan(rate = 5e-4, ngenomes = 200)
# Run panstripe
fit_fast <- panstripe(sim_fast$pa, sim_fast$tree)
fit_slow <- panstripe(sim_slow$pa, sim_slow$tree)
# Compare the fits
result <- compare_pangenomes(fit_fast, fit_slow)
result$summary
```
A significant p-value for the `tip` term indicates that the two pangenomes differ in the rates of gene presence and absence assigned to the tips of the phylogeny. This is usually driven by either differences in the annotation error rates between data sets or differences in the gain and loss of highly mobile elements that do not persist long enough to be observed in multiple genomes.
A significant p-value for the `core` term indicates that two data sets have different rates of gene gain and loss.
The `depth` term is less interesting but indicates when there is a difference in our ability to detect older gene exchange events between the two pangenomes.
The `dispersion` parameter indicates if there is a significant difference in the dispersion of the two pangenomes. This suggests that the relationship between the rate of gene exchange and the size of each event differs in the two pangenomes. Here, the p-value is obtained using a Likelihood Ratio Test.
```{r}
## Plot the results
plot_pangenome_params(list(fast=fit_fast, slow=fit_slow), legend = FALSE) +
plot_pangenome_cumulative(list(fast=fit_fast, slow=fit_slow)) +
plot_layout(nrow = 1)
```
## Open vs Closed
The definition of what constitutes an open or closed pangenome is somewhat ambiguous. We prefer to consider whether there is evidence for a temporal signal in the pattern of gene gain and loss. This prevents annotation errors leading to misleading results.
After fitting a `panstripe` model the significance of the temporal signal can be assessed by considering the coefficient and p-value of the 'core' term in the model. The uncertainty of this estimate can also be investigated by looking at the bootstrap confidence intervals of the core term.
Let's simulate a 'closed' pangenome
```{r}
sim_closed <- simulate_pan(rate = 0, ngenomes = 100)
fit_closed <- panstripe(sim_closed$pa, sim_closed$tree)
fit_closed$summary
```
The p-value indicates that there is not a significant association between core branch lengths and gene gain/loss. This is typical of species that undergo very little to no recombination such as *M. tuberculosis*.
## Rate vs Size
While comparing the `core` parameter of the model identifies differences in the association between branch lengths and gene gain and loss it does not indicate whether this is driven by higher rates of recombination or simply larger recombination events involving more genes.
The `panstripe` model allows these two scenarios to be investigated by allowing the dispersion parameter in the model to be different for each pangenome.
Here, we simulate two data sets with the same recombination rate but where each recombination event differs in the number of genes involved.
```{r}
sim_large <- simulate_pan(rate = 1e-3, ngenomes = 100, mean_trans_size = 4)
sim_small <- simulate_pan(rate = 1e-3, ngenomes = 100, mean_trans_size = 3)
fit_large <- panstripe(sim_large$pa, sim_large$tree)
fit_small <- panstripe(sim_small$pa, sim_small$tree)
# Compare the fits
result <- compare_pangenomes(fit_large, fit_small)
result$summary
```
## Output
The `panstripe` function generates a list with the following attributes
#### summary
A table indicating a subset of the inferred parameters of the GLM. The p-values and bootstrap confidence intervals can be used to determine whether each term in the model is significantly associated with gene gain and loss.
- **core** indicates whether the branch lengths in the phylogeny are associated with gene gain and loss.
- **tip** indicates associations with genes observed to occur on the tips of the phylogeny. These are usually driven by a combination of annotation errors and depending upon the temporal sampling density also highly mobile elements that are not observed in multiple genomes.
- **depth** indicates whether the rate of gene gain and loss changes significantly with the depth of a branch. Typically, our ability to detect gene exchange events reduces for older ancestral branches.
- **p** the inferred index parameter of the underlying Tweedie distribution used in the GLM
- **phi** the inferred dispersion parameter of the GLM
#### model
The output of fitting the GLM model. This object can be used to predict how many gene gains and losses we would expect to observe given the parameters of a particular branch.
#### data
A table with the data used to fit the model. The `acc` column indicates the inferred number of gene gain/loss events for a branch; the `core` column is the branch length taken from the phylogeny; the `istip` column indicates whether the branch occurs at the tip of the phylogeny and the `depth` column indicates the distance from the root node to the branch.
#### ci_samples
A 'boot' object, generated by the [boot](https://cran.r-project.org/web/packages/boot/boot.pdf) package. Can be used to investigate the uncertainty in the parameter estimates.
#### tree/pa
The original data provided to the `panstripe` function.
## Alternative models
The default Panstripe model assumes a Compound Poisson (Tweedie) distribution as implemented in the [Tweedie](https://cran.r-project.org/web/packages/tweedie/) R package. In some cases there may be insufficient data or the data may not reliably fit the Tweedie distribution. This is often the case when there is nearly no gene exchange events inferred to have occurred at the internal branches of the phylogeny.
To help account for these issues and to add flexibility to the package it is also possible to fit alternative distributions in place of the Tweedie model. The most common alternative that we suggest is to use a Gaussian distribution. This is usually more robust and is less likely to run into convergence issues in the model fit. In our tests both of these distributions generally give very similar results.
Panstripe can be run using an alternative distribution as
```{r}
fit_gaussian <- panstripe(pa, tree, family='gaussian')
```
## Plots
Panstripe includes a number of useful plotting functions to help with interpretation of the output of [panaroo](https://gtonkinhill.github.io/panaroo/#/).
### Pangenome fit
The inferred parameters of the Panstripe model fit can be plotted as
```{r}
plot_pangenome_params(fit_fast)
```
We also suggest plotting the cumulative number of gene gain and loss events against the core branch length. As Panstripe models each branch individually and accounts for the depth of the branch it is not possible to add the Panstripe model fit to this plot.
```{r}
plot_pangenome_cumulative(fit_fast)
```
These two plots can easily be combined using the patchwork R package as
```{r}
plot_pangenome_params(fit_fast) +
plot_pangenome_cumulative(fit_fast) +
plot_layout(nrow = 1)
```
The functions can also take a named list as input allowing for easy comparisons between data sets.
```{r}
plot_pangenome_params(list(fast=fit_fast, slow=fit_slow), legend = FALSE) +
plot_pangenome_cumulative(list(fast=fit_fast, slow=fit_slow)) +
plot_layout(nrow = 1)
```
A plot of the residuals of the regression can also be generated using
```{r}
plot_residuals(fit_fast)
```
### Tree with presence/absence
A plot of the phylogeny and the corresponding gene presence/absence matrix can be made using the `plot_tree_pa` function. This function also takes an optional vector of gene names to include in the plot.
```{r}
#look at only those genes that vary
variable_genes <- colnames(pa)[apply(pa, 2, sd)>0]
plot_tree_pa(tree = tree, pa = pa, genes = variable_genes, label_genes = FALSE, cols = 'black')
```
### Inferred ancestral states
The `plot_gain_loss` function allows for the visualisation of the fitted gene gain and loss events on the given phylogeny. The enrichment for events at the tips of a tree is often driven by a combination of highly mobile elements and annotation errors.
```{r}
plot_gain_loss(fit)
```
### tSNE
The tSNE dimension reduction technique can be used to investigate evidence for clusters within the pangenome.
```{r}
plot_tsne(pa)
```
The [Mandrake](https://github.com/johnlees/mandrake) method can also be used as an alternative to tSNE.
### Accumulation curves
While we do not recommend the use of accumulation curves as they do not account for population structure, sampling bias or annotation errors we have included a function to plot them to make it easier for users to compare methods.
```{r}
plot_acc(list(fast=sim_fast$pa, slow=sim_slow$pa))
```
### Etymology
The name panstripe is an adaptation of 'pinstripe', the name of a [long-nosed potoroo](https://en.wikipedia.org/wiki/Long-nosed_potoroo) who was a villain in the playstation game [Crash Bandicoot](https://crashbandicoot.fandom.com/wiki/Pinstripe_Potoroo). The name was chosen as the program relies on the output of panaroo (named after the potoroo).