supplemental_file_A.Rmd

---
bibliography: references.bib
biblio-style: apalike
highlight_bw: yes
output:
  bookdown::pdf_document2:
    toc: false
    includes:
      in_header: header.tex
    dev: "cairo_pdf"
    latex_engine: xelatex
    fig_caption: yes
geometry: margin=0.45in
link-citations: yes
---

**Supplemental File A of**

\begingroup\Large
**MicrobiotaProcess: A comprehensive R package for deep mining microbiome**
\endgroup

**Shuangbin Xu, Li Zhan, Wenli Tang, Qianwen Wang, Zehan Dai, Lang Zhou, Tingze Feng, Meijun Chen, Tianzhi Wu, Erqiang Hu and Guangchuang Yu\* **

^\*^correspondence: Guangchuang Yu \<gcyu1@smu.edu.cn\>

\renewcommand{\figurename}{Fig.}
\newcommand{\beginsupplement}{%
\setcounter{table}{0}
\renewcommand{\thetable}{SA.\arabic{table}}%
\setcounter{figure}{0}
\renewcommand{\thefigure}{SA.\arabic{figure}}%
}

\beginsupplement

```{r, echo=FALSE, message=FALSE, results='hide'}
require(kableExtra)
options(knitr.table.fromat = "latex")
knitr::opts_chunk$set(fig.pos= "!ht")
#knitr::opts_chunk$set(out.width="90%")
knitr::opts_chunk$set(fig.align="center", warning = FALSE, message = FALSE)
usepackage_latex("float")
usepackage_latex("makecell")
usepackage_latex("booktabs")

Biocpkg <- function (pkg){
    sprintf("[%s](http://bioconductor.org/packages/%s)", pkg, pkg)
}

CRANpkg <- function(pkg){
    cran <- "https://CRAN.R-project.org/package"
    fmt <- "[%s](%s=%s)"
    sprintf(fmt, pkg, cran, pkg)
}

knitr::knit_hooks$set(crop = knitr::hook_pdfcrop)
knitr::opts_chunk$set(crop = TRUE)
```

```{r setup, echo = FALSE, results = 'hide', message = FALSE}
suppressPackageStartupMessages({
  library(SummarizedExperiment) # SummarizedExperiment container, a container contains one or more assays.
  library(MicrobiotaProcess) # an R tidy framework for microbiome or other related ecology data analysis.
  library(ggplot2) # Create Elegant Data Visualisations Using the Grammar of Graphics.
  library(coin) # Conditional Inference Procedures in a Permutation Test Framework.
  library(ggnewscale) # Multiple Fill and Colour Scales in 'ggplot2'.
  library(forcats) # Helpers for reordering factor levels and tools for modifying factor levels.
  library(ggtree) # visualizing phylogenetic tree and heterogenous associated data based on grammar of graphics.
  library(ggtreeExtra) # plot associated data on the external layer based on grammar of graphics.
  library(clusterProfiler)
  library(enrichplot)
  library(MicrobiomeProfiler)
  library(curatedMetagenomicData)
  library(randomForest)
})
```

# Installation 

To install `MicrobiotaProcess` package, please enter the following command in R:

```r
if (!requireNamespace("BiocManager", quietly = TRUE))
   install.packages("BiocManager")
BiocManager::install("MicrobiotaProcess")
```

To reproduce the analysis in this document, the several extra packages also needed to be installed.

```r
cranpkgs <- c("aplot", "ggpp", "igraph",  
              "broom", "forcats", 'pROC',
              "ggrepel", "ggVennDiagram", 
              "patchwork", "shadowtext",
              "ggupset", "ggnewscale", 
              "GUniFrac", "matrixStats")

for (i in cranpkgs){
    if (!requireNamespace(i, quietly = TRUE)){
        install.packages(i)
    }
}

Biocpkgs <- c("SummarizedExperiment", "clusterProfiler", 
              "edgeR", "enrichplot", "tidybulk", "curatedMetagenomicData",
              "ggtree", "ggtreeExtra", "MicrobiomeProfiler")

for (i in Biocpkgs){
    if (!requireNamespace(i, quietly = TRUE)){
        BiocManager::install(i)
    }
}
```

# Analysis of 16s rDNA dataset about 43 pediatric CD stool samples from iHMP {#chapter2}

Here, we re-analyzed the 16s rDNA dataset of 43 pediatric IBD stool samples, which were obtained from the Integrative Human Microbiome Project Consortium (iHMP) [@iHMP]. 

## Importing the output of dada2 {#chapter2.1}

The datasets were downloaded from the web^[<https://www.microbiomeanalyst.ca/MicrobiomeAnalyst/resources/data/ibd_data.zip>]. These datasets contain `ibd_asv_table.txt` (feature table (*row features* X *column samples*)), `ibd_meta.csv` (metadata file of samples), and `ibd_taxa.txt` (the taxonomic annotation of features). In the session, we used *mp_import_dada2* of *MicrobiotaProcess* to import the dataset, and returned an *MPSE* object.

```{r, message=FALSE, warning=FALSE, fold.plot=FALSE, fold.output=FALSE, LoadData}
library(MicrobiotaProcess)
otuda <- read.table("./data/IBD_data/ibd_asv_table.txt", header=T, 
                    check.names=F, comment.char="", row.names=1, sep="\t")
# building the output format of removeBimeraDenovo of dada2
otuda <- data.frame(t(otuda), check.names=F)
sampleda <- read.csv("./data/IBD_data/ibd_meta.csv", row.names=1, comment.char="")
taxda <- read.table("./data/IBD_data/ibd_taxa.txt", header=T, 
                    row.names=1, check.names=F, comment.char="")
# the feature names should be the same with rownames of taxda. 
taxda <- taxda[match(colnames(otuda), rownames(taxda)),]
ref.tree <- treeio::read.tree('./data/IBD_data/ibd_repseq.tree')
mpse <- mp_import_dada2(seqtab = otuda, taxatab = taxda, sampleda = sampleda)
# view the reads depth of samples and the prevalence of the OTUs. In this example, 
# mpse %>% mp_extract_assay(.abundant=Abundance) %>% rowSums() %>% sort %>% head(100)
# mpse %>% mp_extract_assay(.abundant=Abundance) %>% colSums() %>% sort %>% head()
# Or
# head(sort(rowSums(assay(mpse, "Abundance"))), 100)
# head(sort(colSums(assay(mpse, "Abundance"))))
# In this example, we can find some OTUs have very low frequency in the samples.
# and some taxonomy are unreasonable, for example, the probability of chloroplasts 
# in the intestine should be low. We can also remove the features.
mpse2 <- mpse %>% 
         dplyr::filter(!Phylum %in% c("p__un_k__Bacteria", "p__Chloroflexi") & 
                       !Class %in% "c__Chloroplast" & 
                       !Family %in% "f__mitochondria" 
         ) %>% 
         mp_filter_taxa(.abundance = Abundance, min.abun = 1, min.prop = 0.1)
otutree(mpse2) <- ref.tree
mpse2
```

```{r, echo = FALSE}
if (!dir.exists(file.path("./TmpRef"))){
    dir.create("./TmpRef")
}
saveRDS(mpse, file="./TmpRef/mpse.rds")
```

## Other import functions {#chapter2.2}

*MicrobiotaProcess* also presents some other functions \ref{tab:importFuns} to parse the output of the upstream pipelines. In addition, some common objects of R can also be converted to *MPSE* object, such as *phyloseq* [@phyloseq], *SummarizedExperiment* [@SE], *TreeSummarizedExperiment* [@TSE], *biom* [@biom] (output of *biomformat* by *read_biom*) referring to session \ref{chapter3.1}. 

```{r, echo=FALSE, results='asis', importFuns}

x <- "MicrobiotaProcess\tmp_import_qiime2\tImport function to load the output of qiime2\n
MicrobiotaProcess\tmp_import_qiime\tImport function to read the now legacy-format QIIME OTU table (tsv format)\n
MicrobiotaProcess\tmp_import_metaphlan\tImport function to read the output of MetaPhlAn\n
"

xx <- base::strsplit(x, "\n\n")[[1]]
y <- base::strsplit(xx, "\t") %>% do.call("rbind", .)
y <- as.data.frame(y)
colnames(y) <- c("Package", "Import Function", "Description")

suppressPackageStartupMessages(require(kableExtra))
caption = "List of import functions provided by MicrobiotaProcess"
knitr::kable(y, caption = caption, booktabs = T) %>%
  collapse_rows(columns = 1, latex_hline = "major", valign ="middle") %>%
  kable_styling(latex_options = c("striped", "hold_position")) #%>% landscape
```

## alpha diversity analysis {#chapter2.3}

### rarefaction visualization {#chapter2.3.1}

Rarefaction based on the sampling technique was used to compensate for the effect of sample size on the number of units observed in a sample. *MicrobiotaProcess* provides *mp_cal_rarecurve* and *mp_plot_rarecurve* to calculate and plot the curves.

(ref:rarecurve1) **This example shows *mp_cal_rarecurve* and *mp_plot_rarecurve* provided by *MicrobiotaProcess* to calculate and visualize the rarefaction curve.** The horizontal coordinate represents the sequencing depth of samples, the vertical coordinate shows the Alpha diversity index (such as Observe OTU, Chao1 and ACE). The *mp_plot_rarecurve* provides three types of visualization. (A) the rarefaction curve for each sample. (B) the rarefaction curve for each sample with colored group (specified *.group* argument in *mp_plot_rarecurve*). (C) the rarefaction curve for each group with standard error of the mean (specified *.group* argument and *plot.group=TRUE* in *mp_plot_rarecurve*)

```{r class.source = 'fold-show', fold.output=FALSE, fold.plot=FALSE, fig.width=7, fig.height=6, fig.align="center",warning=FALSE, message=FALSE, fig.cap='(ref:rarecurve1)', rarecurve1}
library(MicrobiotaProcess)
library(patchwork)
cols <- c('#fcc751ff', '#00c7bfff')
mpse2 %<>% 
    mp_rrarefy(.abundance=Abundance) %>%
    mp_cal_rarecurve(.abundance=RareAbundance, chunks=500)

p_rare <- mpse2 %>%
          mp_plot_rarecurve(
            .rare = RareAbundanceRarecurve, 
            .alpha = c(Observe, Chao1, ACE),
          ) +
          theme(
            legend.key.width = unit(0.3, "cm"),
            legend.key.height = unit(0.3, "cm"),
            legend.spacing.y = unit(0.01,"cm"),
            legend.text = element_text(size=4)
          )

prare1 <- mpse2 %>%
          mp_plot_rarecurve(
            .rare = RareAbundanceRarecurve, 
            .alpha = c(Observe, Chao1, ACE),
            .group = Group
          ) +
          scale_fill_manual(values = cols)+
          scale_color_manual(values = cols)+
          theme_bw()+
          theme(
            axis.text=element_text(size=8), panel.grid=element_blank(),
            strip.background = element_rect(colour=NA,fill="grey"),
            strip.text.x = element_text(face="bold")
          ) 

prare2 <- mpse2 %>%
          mp_plot_rarecurve(
            .rare = RareAbundanceRarecurve,
            .alpha = c(Observe, Chao1, ACE),
            .group = Group,
            plot.group = TRUE
          ) +
          scale_color_manual(values = cols)+
          scale_fill_manual(values = cols) +
          theme_bw()+
          theme(
            axis.text=element_text(size=8), panel.grid=element_blank(),
            strip.background = element_rect(colour=NA,fill="grey"),
            strip.text.x = element_text(face="bold")
          )
(p_rare / prare1 / prare2) + patchwork::plot_annotation(tag_levels="A")
```

\newpage

```{r echo = FALSE}
prare1 <- prare1 + 
          theme(
            legend.position=c(0.1, 0.8), 
            legend.key.width = unit(.3, "cm"), 
            legend.key.height = unit(.3, 'cm'), 
            legend.text = element_text(size = 6)
          )
ggsave(prare1, file = "./Figures/rarecurve.svg", device="svg", width=7, height=3)
```

<!-- Since the curves in each sample were near saturation, the sequencing data were great enough with very few new species undetected -->

### Calculation and different analysis of alpha diversity {#chapter2.3.2}

Alpha diversity can evaluate the richness and evenness of microbial communities. *MicrobiotaProcess* provides *mp_cal_alpha* to calculate alpha index. Six common diversity measures (*Observe*, *Chao1*, *ACE*, *Shannon*, *Simpson*, *Pielou*) are supported. In addition, *MicrobiotaProcess* also provided *mp_cal_pd_metric* to calculate some phylogenetic community structure metrics, such as PD (Faith's Phylogenetic Diversity), NRI (Nearest Relative Index), NTI (Nearest Taxon Index), IAC (Relative deviation from null expectation of phylogenetically balanced abundances), PAE (Phylogenetic evenness of the abundance distribution scaled by branch lengths), HAED (Entropic measure of diversity of evolutionary distinctiveness among individuals), EAED (Equitability of HAED) [@PhylogeneticMetric2; @PhylogeneticMetric]. These phylogenetic metrics can help us to explore the process of microbiota community assembly [@PhylogeneticMetric]. The result can be visualized by *mp_plot_alpha*. The following example showed how to use *mp_cal_alpha* and *mp_plot_alpha* of *MicrobiotaProcess* to analyze the alpha diversity of the community. The *RareAbundance* is rarefied (default), which will be used to calculate the alpha diversity index, users can specify the *force=TRUE* of *mp_cal_alpha* to calculated the alpha diversity if the abundance can not be rarefied (referring to session \ref{chapter3.3.1}).

(ref:AlphaBox) **The raincloud plot of alpha diversity index** The horizontal coordinate represents each group (by *.group* argument of *mp_plot_alpha*), the vertical coordinate represents the alpha diversity index. 

```{r fig.width=8, fig.height=3, fig.align="center", warning=FALSE, message=FALSE, fig.cap='(ref:AlphaBox)', AlphaBox}
library(MicrobiotaProcess)
mpse2 %<>% mp_cal_alpha(.abundance = RareAbundance)
p_alpha <- mpse2 %>% 
       mp_plot_alpha(
          .alpha = c(Observe, Chao1, ACE, Shannon, Simpson, Pielou),
          .group = Group,
       ) +
       scale_fill_manual(values=cols) +
       scale_color_manual(values=cols) +
       theme(
         legend.position="none",
         strip.background = element_rect(colour=NA, fill="grey")
       )
p_alpha
```

```{r echo = FALSE}
ggsave(p_alpha, file = "./Figures/alpha.svg", device = "svg", width = 7, height = 3.8)
```

(ref:AlphaBox2) The raincloud plot of phylogenetic diversity index. The horizontal coordinate represents each group (by *.group* argument of *mp_plot_alpha*), the vertical coordinate represents the phylogenetic diversity index.

```{r, fig.width=8, fig.height=3, fig.align="center", warning=FALSE, message=FALSE, fig.cap='(ref:AlphaBox2)', AlphaBox2}
mpse2 %<>% mp_cal_pd_metric(.abundance = RareAbundance, metric = all)
p.pd_alpha <- mpse2 %>%
        mp_plot_alpha(
           .alpha = c("PAE", "NRI", "NTI", "PD", "HAED", "EAED", "IAC"),
           .group = Group,
        ) +
        scale_fill_manual(values=cols)+
        scale_color_manual(values=cols) +
        theme(legend.position="none",
              strip.background = element_rect(colour=NA, fill="grey"))
p.pd_alpha    
```

```{r echo = FALSE}
p.alpha2 <- mpse2 %>%
            mp_plot_alpha(
              .alpha = c(Observe, Shannon, Pielou, NRI, PD, HAED),
              .group = Group,
            ) +
            scale_fill_manual(values = cols) +
            scale_color_manual(values = cols) +
            theme(
              legend.position = "none", 
              strip.background = element_rect(colour=NA, fill="grey"),
              panel.border=element_rect(linewidth=.2), 
              axis.ticks=element_line(linewidth=.2)
            )

p.alpha2$layers[[2]]$aes_params$size <- .8
p.alpha2$layers[[2]]$aes_params$stroke <- .1
p.alpha2$layers[[3]]$aes_params$size <- .4
p.alpha2$layers[[4]]$aes_params$size <- .2
ggsave(p.alpha2, file = './Figures/alpha2.svg', device = 'svg', width = 7, height = 3.8)
```

\newpage

<!--

These metrics might provide new clues for the diagnosis of disease. We used each the significant differential metric (100 random samples) to calculate the AUC, then found the vast majority of AUC scores are greater than 0.65.

(ref:PdAUC) The box plot of the AUC score based on the significant differential alpha diversity metrics.

```{r, fig.width = 5.6, fig.height = 4, fig.align = 'center', warning = FALSE, message = FALSE, fig.cap = '(ref:PdAUC)', PdAUC}
generate_random_sample <- function(dat, prob=2/3){
    dat %>%
    dplyr::group_split(Group) %>%
    lapply(function(x)x[sample(nrow(x), size=prob * nrow(x)),]) %>%
    dplyr::bind_rows()
}

mpse2 %>%
mp_extract_sample %>%
select(Sample, Observe, IAC, NRI, HAED, PD, Group) %>%
tibble::column_to_rownames(var='Sample') -> dat

pd.roc.res <- replicate(100,
             pROC::roc(formula=Group ~ Observe+IAC+NRI+HAED+PD,
                   data = generate_random_sample(dat),
                   levels=c('Control', 'CD'),
                   quiet = TRUE,
               ) %>%
             lapply(function(x)as.numeric(x$auc)) %>%
             unlist()
       )

pd.roc.p <- pd.roc.res %>%
magrittr::set_colnames(value=paste0('Sample', seq_len(100))) %>%
tibble::as_tibble(rownames='Metrics') %>%
tidyr::pivot_longer(cols=!Metrics, names_to='Sample', values_to='AUC') %>%
ggplot(aes(x=AUC, y=Metrics, fill=Metrics)) +
geom_boxplot() +
geom_jitter(height=.3, color='grey') +
theme_bw() +
theme(legend.position='none') +
ylab(NULL)
pd.roc.p
```

-->

## Taxonomy composition analysis {#chapter2.4}

### Statistics and visualization of specific levels {#chapter2.4.1}

*MicrobiotaProcess* presents the *mp_cal_abundance* and *mp_plot_abundance* for the calculation and visualization of the composition of microbial communities. After the *mp_cal_abundance* is done, you can get the abundance of specific levels of the class by *mp_extract_abundance* (referring to session \ref{chapter2.5.4}). 

(ref:Abundance1) **The relative abundance of each sample in *class* level**

```{r fig.width=8.5, fig.height=4, fig.align="center", warning=FALSE, message=FALSE, fig.cap="(ref:Abundance1)", Abundance1}
library(ggplot2)
library(MicrobiotaProcess)
# The relative abundance of all taxonomy for samples will be calculated
mpse2 %<>% mp_cal_abundance(.abundance = RareAbundance)
# The relative abundance of all taxonomy for group will be calculated
mpse2 %<>% mp_cal_abundance(.abundance = RareAbundance, .group = Group)
# The 30 most abundant taxonomy will be visualized. 
pclass <- mpse2 %>%
      mp_plot_abundance(
         .abundance = RareAbundance,
         .group = Group,
         taxa.class = Class, 
         topn = 30
      ) +
      xlab(NULL) +
      ylab("relative abundance (%)") +
      theme(
         legend.key.width = unit(0.3, "cm"), 
         legend.key.height = unit(0.3, "cm") 
      ) +
      xlab(NULL) +
      ylab("relative abundance (%)") +
      theme(
         legend.key.width = unit(0.3, "cm"), 
         legend.key.height = unit(0.3, "cm"), 
         legend.text = element_text(size=6)
      )
pclass
```

```{r, echo = FALSE}
ggsave(pclass, filename = "./Figures/AbundanceBar.svg", device = "svg", height=4, width = 8.5)
```

The relative abundance of different groups also can be visualized by providing *.group* argument and setting *plot.group=TRUE* in the *mp_plot_abundance*. If you want to view the raw abundance (count or others) of taxa, you can set the *relative* parameter of *mp_plot_abundance* to *FALSE*.

(ref:Abundance2) **This example show how to displayed the abundance (count or other) of sample and the relative abundance of groups**. The relative abundance of group (A) and the abundance (count by rarefied) of each sample (B), these results show the *Gammaproteobacteria* of *CD* group might be more abundant than the *control* group.

```{r fig.width=12, fig.height=6, fig.cap='(ref:Abundance2)', warning=FALSE, message=FALSE, fig.align="center", Abundance2}
# Show the abundance in different groups.
fclass <- mpse2 %>%
          mp_plot_abundance(
             .abundance = RareAbundance,
             .group = Group,
             taxa.class = Class,
             topn = 30,
             plot.group = TRUE
          ) +
          xlab(NULL) +
          ylab("relative abundance (%)") +
          theme(legend.position = "none")

pclass2 <- mpse2 %>%
          mp_plot_abundance(
             .abundance = RareAbundance, 
             .group = Group,
             relative = FALSE,
             taxa.class = Class,
             topn = 30
          ) +
          xlab(NULL) +
          ylab("count reads") +
          theme(
             legend.key.width = unit(0.3, "cm"),
             legend.key.height = unit(0.3, "cm"),
             legend.text = element_text(size=6)
          )          

aplot::plot_list(pclass2, fclass, widths=c(10, 1), tag_levels = "A")
```

The abundance of features also can be visualized using *mp_plot_abundance* with heatmap plot by setting `geom="heatmap"`.

(ref:AbundanceHeatmap) **The heatmap of abundance for each sample at *class* level.** The color (continuous) of heatmap represents the abundance of different classes, the color of bar plot represents the group name of sample, the horizontal coordinate represents the sample, and the vertical coordinate represents the different classes.

```{r, fig.width = 12, fig.height = 4, fig.align = "center", fig.cap = '(ref:AbundanceHeatmap)', warning=FALSE, message=FALSE, AbundanceHeatmap}
hclass1 <- mpse2 %>%
          mp_plot_abundance(
             .abundance = RareAbundance,
             .group = Group,
             taxa.class = Class,
             topn = 30,
             geom = "heatmap"
          ) %>%
          set_scale_theme(
            x = list(scale_fill_viridis_c(option = "H"),
                     theme(
                       axis.text.x = element_text(size = 6),
                       axis.text.y = element_text(size = 7),
                       legend.title = element_text(size = 7),
                       legend.text = element_text(size = 5),
                       legend.key.width = unit(0.3, "cm"),
                       legend.key.height = unit(0.3, "cm")
                     )
                ),
            aes_var = RelRareAbundance
          ) %>%
          set_scale_theme(
            x = list(scale_fill_manual(values = cols),
                     theme(
                       legend.key.height = unit(0.3, "cm"),
                       legend.key.width = unit(0.3, "cm"),
                       legend.spacing.y = unit(0.02, "cm"),
                       legend.text = element_text(size = 7),
                       legend.title = element_text(size = 9)
                     )
                ),
            aes_var = Group
          )

hclass2 <- mpse2 %>%
          mp_plot_abundance(
             .abundance = RareAbundance,
             .group = Group,
             taxa.class = Class,
             topn = 30, 
             geom = 'heatmap',
             relative = FALSE
          ) %>%
          set_scale_theme(
            x = list(scale_fill_viridis_c(option = "H"),
                     theme(
                       axis.text.x = element_text(size = 6),
                       axis.text.y = element_text(size = 7),
                       legend.title = element_text(size = 7),
                       legend.text = element_text(size = 5),
                       legend.key.width = unit(0.3, "cm"),
                       legend.key.height = unit(0.3, "cm")
                     )
                ),
            aes_var = RareAbundance
          ) %>%
          set_scale_theme(
            x = list(scale_fill_manual(values = cols),
                     theme(
                       legend.key.height = unit(0.3, "cm"),
                       legend.key.width = unit(0.3, "cm"),
                       legend.spacing.y = unit(0.02, "cm"),
                       legend.text = element_text(size = 7),
                       legend.title = element_text(size = 9)
                     )
                ),
            aes_var = Group
          )

p <- aplot::plot_list(hclass1, hclass2, nrow = 1, tag_levels = "A")
p
```

```{r, echo = FALSE}
ggsave(hclass1, filename = "./Figures/Abunheat.svg", device = "svg", width = 7, height = 5)
```

### Venn or Upset plot {#chapter2.4.2}

The Venn or UpSet plot can help us to obtain the difference between groups in the overview. *MicrobiotaProcess* provides *mp_cal_venn* (*mp_plot_venn*) and *mp_cal_upset* (*mp_plot_upset*) to perform the analysis.

(ref:VennUpsetPlot) **The Venn diagram and upset plot for groups in OTU/ASV level**

```{r, fig.width=5, fig.height=5, fig.align="center", warning=FALSE, message=FALSE, fig.cap="(ref:VennUpsetPlot)", VennUpsetPlot}
mpse2 %<>% 
    mp_cal_venn(
      .abundance = RareAbundance,
      .group = Group
    ) 

venn_p <- mpse2 %>% 
    mp_plot_venn(
      .group = Group,
      set_size = 2.5,
      label_size = 2,
      edge_size = 2.5
    ) +
    scale_colour_manual(values = cols) +
    scale_fill_viridis_c(guide = guide_colorbar(barwidth=.3, barheight=2)) +
    theme(
      legend.title = element_text(size = 8), 
      legend.text = element_text(size = 6) 
    )

mpse2 %<>%
    mp_cal_upset(
      .abundance = RareAbundance,
      .group = Group
    )

upset_p <- mpse2 %>%
    mp_plot_upset(
      .group = Group
    ) +
    theme_bw() +
    theme(
      plot.background = element_blank(),
      panel.border = element_blank(),
      panel.grid = element_blank(),
      axis.line.x.bottom = element_line(size = .5),
      axis.line.y.left = element_line(size = .5)
    ) +
    ggupset::theme_combmatrix(
      combmatrix.label.extra_spacing = 40
    )

library(ggpp)
p.up.venn <- upset_p + 
             ggpp::annotate(
               "plot_npc", 
               npcx = "right", 
               npcy = "top", 
               label = venn_p,
               vp.width = 0.6,
               vp.height = 0.4
             )
p.up.venn
```

```{r echo = FALSE}
ggsave(p.up.venn, filename = "./Figures/upset.svg", device = "svg", width = 5, height = 5)
```

## beta analysis {#chapter2.5}

### PCA analysis {#chapter2.5.1}

`PCA` (Principal Component Analysis) and `PCoA` (Principal Coordinate Analysis) are general statistical procedures to compare dissimilarity of samples. And `PCoA` can based on the phylogenetic or count-based distance metrics, such as `Bray-Curtis`, `Jaccard`, `Unweighted-UniFrac` and `weighted-UniFrac`. *MicrobiotaProcess* presents the *mp_cal_dist*, *mp_cal_pca*, *mp_cal_pcoa*, *mp_cal_dca*, *mp_cal_nmds*, *mp_cal_cca*, *mp_cal_rda*, *mp_adonis*, *mp_anosim*, *mp_mrpp*, *mp_envfit* and *mp_mantel* for the related analysis.

(ref:PCAplot) **The PCA plot of the community**. Each point represents one sample, the size of point represents the observe OTU of the sample. The color of point represents the group name of the sample, based on the first and second component (A), based on the first and third component (B).

```{r , fig.width=10, fig.height=4, fig.align="center", warning=FALSE, message=FALSE, fig.cap="(ref:PCAplot)", PCAplot}
library(MicrobiotaProcess)
library(patchwork)
# hellinger transform
mpse2 %<>% 
    mp_decostand(
        .abundance = Abundance, 
        method = "hellinger"
    )

mpse2 %<>% mp_cal_pca(.abundance = hellinger)
# Visulizing the result
pcaplot1 <- mpse2 %>%
            mp_plot_ord(
              .ord = pca, 
              .group = Group,
              .starshape = Group,
              .size = Observe
            ) +
            scale_fill_manual(values = cols) +
            scale_size_continuous(
              range = c(1, 3), 
              guide = guide_legend(override.aes = list(starshape = 15))
            ) +
            theme(
              legend.key.width = unit(0.3, "cm"),
              legend.key.height = unit(0.3, "cm"),
              legend.text = element_text(size = 6),
              legend.title = element_text(size = 7)
            )
# .dim = c(1, 3) to show the first and third principal components.
pcaplot2 <- mpse2 %>%
            mp_plot_ord(
              .ord = pca, 
              .dim = c(1, 3), 
              .group = Group,
              .starshape = Group,
              .size = Observe
            ) +
            scale_fill_manual(values = cols) + 
            scale_size_continuous(
              range = c(1, 3),
              guide = guide_legend(override.aes = list(starshape = 15))
            ) +
            theme(
              legend.key.width = unit(0.3, "cm"),
              legend.key.height = unit(0.3, "cm"),
              legend.text = element_text(size = 6),
              legend.title = element_text(size = 7)
            )

(pcaplot1 | pcaplot2) + plot_annotation(tag_levels = "A")
```

### PCoA analysis {#chapter2.5.2}

(ref:PCoAplot) **The PCoA plot based on Bray-Curtis distance**. 

```{r, fig.width=10, fig.height=4, fig.align="center", warning=FALSE, message=FALSE, fig.cap="(ref:PCoAplot)", PCoAplot}
# distmethod
# "unifrac",  "wunifrac", "manhattan", "euclidean", "canberra", "bray", "kulczynski" ...(vegdist, dist)
mpse2 %<>%
    mp_cal_dist(
      .abundance = hellinger,
      distmethod = "bray"
    )

# PCoA analysis
mpse2 %<>% 
    mp_cal_pcoa(
      .abundance = hellinger,
      distmethod = "bray"
    )
pcoaplot1 <- mpse2 %>%
             mp_plot_ord(
               .ord = pcoa,
               .group = Group,
               .starshape = Group,
               .color = Group,
               .size = Observe,
               ellipse = TRUE,
               show.legend = FALSE
            ) +
            scale_color_manual(
               values = cols 
            ) +
            scale_fill_manual(values = cols) +
            scale_size_continuous(
               range = c(1, 3),
               guide = guide_legend(override.aes = list(starshape = 15))
            ) +
            theme(
               legend.key.width = unit(0.3, "cm"),
               legend.key.height = unit(0.3, "cm"),
               legend.text = element_text(size=6),
               legend.title = element_text(size=7)
            )
# first and third principal co-ordinates
pcoaplot2 <- mpse2 %>%
             mp_plot_ord(
               .ord = pcoa, 
               .group = Group,
               .starshape = Group,
               .color = Group,
               .size = Observe,
               ellipse = TRUE,
               .dim = c(1, 3),
               show.legend = FALSE
             ) +
             scale_color_manual(
               values = cols
             ) +
             scale_fill_manual(
               values = cols
             ) +
             scale_size_continuous(
               range = c(1, 3),
               guide = guide_legend(override.aes = list(starshape = 15))
             ) +
             theme(
               legend.key.width = unit(0.3, "cm"),
               legend.key.height = unit(0.3, "cm"),
               legend.text = element_text(size = 6),
               legend.title = element_text(size = 7)
             )
(pcoaplot1 | pcoaplot2) + plot_annotation(tag_levels = "A")
```

```{r, echo = FALSE}
ggsave(pcoaplot1, filename="./Figures/pcoa.svg", width = 5, height = 4, device = "svg")
```

The result of distance between the samples also can be visualized by `mp_plot_dist` with heatmap or boxplot.

(ref:distplot) **The distance heatmap and the boxplot for each sample**. The size and color of the heatmap represent the distance of each sample, and color of bar plot represents the group of sample (A). The boxplot represents the distance pairs of sample among the group, it shows the dissimilarity of the sample between the *control* and *CD* is significant, which is consistent with the result of the Permutational Multivariate Analysis of Variance in session \ref{chapter2.5.3}.

```{r, fig.width = 10, fig.height = 6, fig.align="center", message = FALSE, fig.cap = '(ref:distplot)', distplot}
pdist1 <- mpse2 %>%
          mp_plot_dist(
            .distmethod = bray, 
            .group = Group
          ) %>%
          set_scale_theme(
            x = scale_fill_manual(
                  values=cols,
                  guide = guide_legend(
                             keywidth = 0.5,
                             keyheight = 0.5,
                             label.theme=element_text(size=6)
                    )
                ),
            aes_var = Group
          ) %>%
          set_scale_theme(
            x = list(scale_size_continuous(range = c(1, 3)),
                     scale_color_viridis_c(option = "H"),
                     theme(
                       legend.key.width = unit(0.3, "cm"),
                       legend.text = element_text(size = 6),
                       legend.title = element_text(size = 7)
                     )
                ),
            aes_var = bray
          )

pdist2 <- mpse2 %>%
          mp_plot_dist(
            .distmethod = bray,
            .group = Group,
            group.test = TRUE
          ) +
          scale_color_manual(
            values = c("orange", "#00A08A", "deepskyblue")
          ) +
          scale_fill_manual(
            values = c("orange", "#00A08A", "deepskyblue")
          )
aplot::plot_list(pdist1, pdist2, widths = c(3, 1), nrow=1, tag_levels = "A")
```

\newpage

### Permutational Multivariate Analysis of Variance {#chapter2.5.3}

We also can perform the Permutational Multivariate Analysis of Variance using *mp_adonis* wrapping the *adonis* of *vegan* [@vegan].

```{r} 
mpse2 %<>% mp_adonis(.abundance = hellinger, distmethod = "bray", 
            .formula = ~Group, permutation = 9999, action = "add")
mpse2 %>% mp_extract_internal_attr(name=adonis) %>% mp_fortify()
```

From the result, we found the *pvalue* of the analysis of *adonis* is smaller than 0.05 for the `Group`, meaning the dissimilarity of samples between the `Group` is significant, which is consistent with the \ref{chapter2.5.2}.

### hierarchical cluster analysis of samples {#chapter2.5.4}

`beta diversity` metrics can assess the differences between microbial communities. It can be visualized with `PCA` or `PCoA`, it also can be visualized with hierarchical clustering based on ggplot2 [@ggplot2], ggtree [@ggtree2017] and ggtreeExtra [@ggtreeExtra]

(ref:HClustplot) **The hierarchical clustering plot of samples based on Bray-Curtis distance calculated with abundance of OTU/ASV and the relative Abundance of phyla for samples**

```{r fig.width=7, fig.height=5, fig.align="center", warning=FALSE, message=FALSE, fig.cap="(ref:HClustplot)", HClustplot}
library(ggplot2)
library(MicrobiotaProcess)
library(ggtree)
library(ggtreeExtra)
mpse2 %<>% 
    mp_cal_clust(.abundance = hellinger, distmethod = "bray", action = "add")
hcsample <- mpse2 %>% mp_extract_internal_attr(name=SampleClust)
# rectangular layout + relative abundance of phyla
phy.tb <- mpse2 %>%
          mp_extract_abundance(
            taxa.class = Phylum, 
            topn = 30
          ) %>%
          tidyr::unnest(cols=RareAbundanceBySample) %>% 
          dplyr::rename(Phyla="label")
cplot1 <- ggtree(hcsample, layout = "rectangular") +
          geom_treescale(fontsize = 2) +
          geom_tippoint(mapping=aes(color=Group)) +
          geom_fruit(
            data = phy.tb,
            geom = geom_col,
            mapping = aes(x = RelRareAbundanceBySample, y = Sample, fill = Phyla),
            orientation = "y",
            offset = 0.08,
            pwidth = 3,
            width = .6,
            axis.params = list(
              axis = "x",
              title = "The relative abundance of phyla (%)",
              title.size = 3,
              title.height = 0.04,
              text.size = 2,
              vjust = 1
            )
          ) +
          geom_tiplab(as_ylab = TRUE) +
          scale_color_manual(
            values = cols, 
            guide = guide_legend(
              keywidth = .5, 
              keyheight = .5, 
              title.theme = element_text(size = 8),
              label.theme = element_text(size = 6)
            )
          ) +
          scale_fill_manual(
            values=c(colorRampPalette(RColorBrewer::brewer.pal(12,"Set2"))(6)),
            guide = guide_legend(
              keywidth = .5,
              keyheight = .5,
              title.theme = element_text(size = 8),
              label.theme = element_text(size = 6)
            )
          ) +
          scale_x_continuous(expand = c(0, 0.01)) 
cplot1
```

```{r, echo = FALSE}
ggsave(cplot1, filename = "./Figures/sample.clust.svg", device = "svg", width = 7, height = 5)
```

\newpage

## biomarker discovery {#chapter2.6}

This package provides `mp_diff_analysis` to detect the biomarker. And the result (with `action = "get"`) can be visualized by *ggdiffbox*, *ggdiffclade*, *ggeffectsize*, *ggdifftaxbar* and *mp_plot_diff_res* *mp_plot_diff_cladogram* (with `action = "add"`), or displayed manually using `ggtree` [@ggtree2017] and `ggtreeExtra` [@ggtreeExtra].

(ref:DeCladogram) **The cladogram of significant differential taxa between CD group and Control group** The hight light clades represent the differential taxa is enriched in the corresponding group. We found the species from Proteobacteria were enriched in the CD group, the species (OTU_42) from Actinobacteria were enriched in Control group.

```{r warning=FALSE, message=FALSE, fig.width = 10, fig.height=10, fig.align='center', fig.cap='(ref:DeCladogram)', DeCladogram}
# for the kruskal_test and wilcox_test
library(coin)
library(MicrobiotaProcess)

# get result (diffAnalysisClass) of the different analysis with action = 'get'.
deres <- mpse2 %>%
         mp_diff_analysis(
            .abundance = RareAundance,
            .group = Group,
            first.test.method = "kruskal_test",
            filter.p = "pvalue",
            first.test.alpha = 0.05,
            strict = TRUE,
            second.test.method = "wilcox_test",
            second.test.alpha = 0.05,
            subcl.min = 3,
            subcl.test = TRUE,
            ml.method = "lda",
            ldascore = 3,
            action = "get"
         )
# The result of different analysis was added to the taxatree with action = 'add'
mpse2 <- mpse2 %>%
         mp_diff_analysis(
            .abundance = RareAundance,
            .group = Group,
            first.test.method = "kruskal_test",
            filter.p = "pvalue",
            first.test.alpha = 0.05,
            strict = TRUE,
            second.test.method = "wilcox_test",
            second.test.alpha = 0.05,
            subcl.min = 3,
            subcl.test = TRUE,
            ml.method = "lda",
            ldascore = 3,
            action = "add"
         )

p.clado <- mpse2 %>% 
   mp_plot_diff_cladogram(
     taxa.class = Order, 
     removeUnknown = TRUE,
     as.tiplab = TRUE, 
     tip.annot = TRUE, 
     label.size = 2.2
   ) + 
   scale_fill_diff_cladogram(values=cols)
p.clado
```

\newpage

```{r echo=FALSE}
ggsave(p.clado, filename="./Figures/diff_cladogram.cd.svg", device="svg", width = 10, height = 10)
```

### visualization of different results by `ggdiffclade` {#chapter2.6.1}

The color of discriminative taxa represents the taxa is more abundant in the corresponding group. The point size shows the negative logarithms (base 10) of the pvalue. The bigger size of the point shows more significant (lower pvalue), the *pvalue* was calculated in the first step test (default is *kruskal.test*).

(ref:DiffClade) **The taxa tree clade plot of different analysis result (with action = 'get')**.

```{r eval = FALSE, fig.width=12, fig.height=12, fig.align="center", warning=FALSE, message=FALSE, fig.cap="(ref:DiffClade)", DiffClade}
diffclade_p <- ggdiffclade(
                   obj=deres, 
                   alpha=0.3, 
                   linewd=0.15,
                   skpointsize=0.6, 
                   layout="radial",
                   taxlevel=3, 
                   removeUnkown = TRUE,
                   reduce = FALSE # This argument is to remove the branch of unknown taxonomy.
               ) +
               scale_fill_manual(
                   values = cols
               ) +
               guides(color = guide_legend(
                                  keywidth = 0.1, 
                                  keyheight = 0.2,
                                  order = 3,
                                  ncol=1)
               ) +
               theme(
                   panel.background = element_rect(fill=NA),
                   legend.position = "right", 
                   plot.margin = ggplot2::margin(0,0,0,0),
                   legend.key.width = unit(0.2, "cm"),
                   legend.key.height = unit(0.2, "cm"),
                   legend.spacing.y = unit(0.02, "cm"), 
                   legend.title = element_text(size=7),
                   legend.text = element_text(size=6), 
                   legend.box.spacing = unit(0.02,"cm")
               )
diffclade_p
```

We also can visualized the result (default, with `action = 'add'`) via `ggtree` [@ggtree2017] and `ggtreeExtra` [@ggtreeExtra].

(ref:DiffAllplot) **The taxa tree of the community with the relative abundance of each OTU/ASV on sample and the LDA of different OTU/ASV**. The taxa tree was built with the taxa of all samples. The high light clades of taxa tree represented the phyla. The external point layer represented the relative abundance of each OTU on sample. The external bar plot represented the LDA of the different OTU. The colored points represented the different taxa, the size of colored point represented the *pvalue* or *fdr*.

```{r, fig.width = 10, fig.height=10, fig.align="center", fig.cap = '(ref:DiffAllplot)', DiffAllplot}
taxa.tree <- mpse2 %>% mp_extract_tree(type='taxatree')
p1 <- ggtree(
        taxa.tree,
        layout="radial",
        size = 0.3
      ) +
      geom_point(
        data = td_filter(!isTip),
        fill="white",
        size=1,
        shape=21
      )
# display the high light of phylum clade.
p2 <- p1 +
      geom_hilight(
        data = td_filter(nodeClass == "Phylum"),
        mapping = aes(node = node, fill = label)
      )
# display the relative abundance of features(OTU)
p3 <- p2 +
      ggnewscale::new_scale("fill") +
      geom_fruit(
         data = td_unnest(RareAbundanceBySample),
         geom = geom_star,
         mapping = aes(
                       x = fct_reorder(Sample, Group, .fun=min),
                       size = RelRareAbundanceBySample,
                       fill = Group,
                       subset = RelRareAbundanceBySample > 0
                   ),
         starshape = 13,
         starstroke = 0.01,
         offset = 0.04,
         pwidth = 1.5,
         grid.params = list(vline = TRUE, size = 0.001, color="snow2", linetype = 1)
      ) +
      scale_size_continuous(
         name="Relative Abundance (%)",
         range = c(0.5, 3),
         guide = guide_legend(override.aes = list(starstroke = 0.25))
      ) +
      scale_fill_manual(values=cols)
# display the tip labels of taxa tree
p4 <- p3 + geom_tiplab(size=2, offset=12.8)
# display the LDA of significant OTU.
p5 <- p4 +
      ggnewscale::new_scale("fill") +
      geom_fruit(
         geom = geom_col,
         mapping = aes(
                       x = LDAmean,
                       fill = Sign_Group,
                       subset = !is.na(LDAmean)
                       ),
         orientation = "y",
         offset = 0.5,
         pwidth = 1,
         axis.params = list(axis = "x",
                            title = "Log10(LDA)",
                            title.height = 0.005,
                            title.size = 2,
                            text.size = 1.8,
                            vjust = 1),
         grid.params = list(linetype = 3)
      )

# display the significant (FDR) taxonomy after kruskal.test (default)
p6 <- p5 +
      ggnewscale::new_scale("size") +
      geom_point(
         data=td_filter(!is.na(Sign_Group)),
         mapping = aes(size = -log10(fdr),
                       fill = Sign_Group,
                       ),
         stroke = 0.01,
         shape = 21,
      ) +
      scale_size_continuous(range=c(1, 3), guide = guide_legend(override.aes = list(stroke = .25))) +
      scale_fill_manual(values=cols)

p6 <- p6 + theme(
           legend.key.height = unit(0.3, "cm"),
           legend.key.width = unit(0.3, "cm"),
           legend.spacing.y = unit(0.02, "cm"),
           legend.text = element_text(size = 7),
           legend.title = element_text(size = 9),
          )
p6
```

To decreases coding burden, we also developed *mp_plot_diff_res* to visualize the result of different analysis (mp_diff_analysis).

(ref:DiffAllplot2) seel also Fig. \ref{fig:DiffAllplot}

```{r, eval = FALSE, fig.width = 10, fig.height=10, fig.align="center", fig.cap = '(ref:DiffAllplot2)', DiffAllplot2}
library(ggplot2)
pp <- mpse2 %>% 
    mp_plot_diff_res() +
    scale_fill_manual(
      values = cols
    ) +
    scale_fill_manual(
      aesthetics = "fill_new",
      values = cols
    )
pp
```

\newpage

```{r, echo = FALSE}
ggsave(p6, filename="./Figures/Diffres.svg", device="svg", width = 12, height = 12)
```

### visualization of differential results (with action = "get") by `ggdiffbox` {#chapter2.6.2}

The left panel represented the relative abundance or abundance (according the standard_method) of biomarker, the right panel represented the confident interval of effect size (LDA or MDA) of biomarker.
The bigger confident interval shows that the biomarker is more fluctuant, owing to the influence of sampling times.

(ref:DiffBoxplot) **The boxplot and the LDA score of different taxa.** The left panel represented the relative abundance of the different taxa, the right panel represented the LDA effect size (95% confidence interval) of different taxa.

```{r fig.width=6, fig.height=8, fig.align="center", warning=FALSE, message=FALSE, fig.cap="(ref:DiffBoxplot)", DiffBoxplot}
diffbox <- ggdiffbox(obj=deres, box_notch=FALSE, 
		     colorlist=cols, l_xlabtext="relative abundance")
diffbox
```

\newpage

### visualization of differential results (with action = "get") by `ggdifftaxbar` {#chapter2.6.3}

`ggdifftaxbar` can visualize the abundance of the biomarker in each sample of groups, the mean and median abundance of groups or subgroups are also shown. `output` parameter is the directory of output.

```{r}
ggdifftaxbar(obj=deres, xtextsize=1.5, 
             output="IBD_biomarkder_barplot",
             coloslist=cols)
```

## Significant differential clades for the diagnosis of some related diseases {#chapter2.7}

*MicrobiotaProcess* provided *mp_balance_clade* to calculate the balance of clades of phylogenetic tree with the abundance (geometric mean, mean or median) of tips. Then we can use *mp_diff_analysis* to identify the significantly differential clades.

(ref:SignBalanceCladogram) **The cladogram of significant differential clades between the CD and Control group.** The external heatmap represents the differential clades (up and down). The external point layer represents the relative abundance of each OTU on each sample. The external bar plot represents the mean LDA of the differential OTUs.

```{r, fig.width = 14, fig.height = 14, fig.align = 'center', fig.cap = '(ref:SignBalanceCladogram)', warning=FALSE, message=FALSE, SignBalanceCladogram}
library(ggplot2)
library(ggsci)
library(ggtree)
library(forcats)

mpse3 <- mpse2 %>% dplyr::filter(Class != 'c__un_p__Proteobacteria')
mpse3 %>%
    mp_balance_clade(
      .abundance = Abundance,
      force = TRUE,
      relative = FALSE,
      pseudonum = 1,
      balance_fun='geometric.mean'
    ) -> mpse.balance.node

mpse.balance.node %<>%
    mp_diff_analysis(
      .abundance = Abundance,
      force = TRUE,
      relative = FALSE,
      .group = Group,
      fc.method = 'compare_mean'
    ) 

mpse.balance.node %>%
    mp_extract_feature() %>%
    dplyr::filter(!is.na(Sign_Group)) -> ba.node.sign

ba.node.sign %>%
    dplyr::filter(node %in% c(434, 426, 343, 388)) %>%
    tidyr::unnest(Balance_offspring) %>%
    tidyr::unnest(offspringTiplabel) %>%
    select(offspringTiplabel, node) %>%
    dplyr::mutate_at('node', as.character) %>%
    dplyr::rename(BalanceNode = 'node') -> Hight.BalanceNode

p1 <- mpse3 %>% mp_extract_otutree() %>%
      ggtree(
        layout = 'circular', 
        size = .25, 
        color = '#bed0d1'
      ) %<+% Hight.BalanceNode +
      geom_tiplab(
        data = td_filter(!is.na(BalanceNode)),
        size = 1.2,
        mapping = aes(color=BalanceNode),
        align = TRUE,
        linesize = .5,
        linetype = 3,
        offset = 1.45
      ) +
      scale_color_npg(guide=guide_legend(overide.aes=list(size = 2.6))) +
      geom_tiplab(
        data = td_filter(is.na(BalanceNode)),
        size = 1.2,
        align = TRUE,
        linesize = .05,
        linetype = 3,
        offset = .9
      ) +
      geom_point(
        data = td_filter(node %in% ba.node.sign$node),
        size = .3,
        color = 'red'
      ) +
      ggrepel::geom_text_repel(
        data = td_filter(node %in% ba.node.sign$node),
        mapping = aes(label = node),
        bg.color = 'white',
        size = 2,
        segment.size = .1,
        min.segment.length = 0,
        max.overlaps = 24,
      )

ba.node.sign2 <- ba.node.sign %>%
                 tidyr::unnest(Balance_offspring) %>%
                 tidyr::unnest(offspringTiplabel)

bla.sign.da <- ba.node.sign %>%
    select(OTU, AbundanceBySample) %>%
    tidyr::unnest(AbundanceBySample) %>%
    select(OTU, Sample, Abundance, Group) %>%
    tidyr::pivot_wider(id_cols=c('Sample', 'Group'), values_from=Abundance, names_from=OTU) %>%
    dplyr::mutate_at('Group', as.factor) 

otu.sign.da <- mpse3 %>% mp_extract_feature() %>%
    filter(!is.na(Sign_Group)) %>%
    tidyr::unnest(RareAbundanceBySample) %>%
    select(OTU, RelRareAbundanceBySample, Sample, Group) %>%
    tidyr::pivot_wider(id_cols=c('Sample', 'Group'), names_from='OTU', values_from=RelRareAbundanceBySample) %>%
    dplyr::mutate_at('Group', as.factor)

p2 <- p1 +
    geom_fruit(
      data = ba.node.sign2,
      geom = geom_tile,
      mapping = aes(
        x = OTU,
        y = offspringTiplabel,
        fill = Clade
      ),
      axis.params = list(axis='none', text.angle=-45, vjust=1, hjust=0, text.size=2),
      grid.params = list(),
      pwidth = .5,
      offset = .01
    ) +
    scale_fill_manual(values = c('#00D617', '#E6A519')) +
    scale_y_continuous(limits=c(-1, NA))

p3 <- p2 + 
   ggnewscale::new_scale_fill() +
   geom_fruit(
     data = td_filter(RelRareAbundanceBySample > 0, .f=td_unnest(RareAbundanceBySample)),
     geom = geom_star,
     mapping = aes(
       x = fct_reorder(Sample, Group, .fun=min),
       fill = Group,
       size = RelRareAbundanceBySample
     ),
     offset = .15,
     pwidth = 1.5,
     starshape = 13,
     starstroke = .05,
     grid.params = list(vline=TRUE, size = 0.1, color="snow2", linetype = 1)
   ) +
   scale_fill_manual(values = cols) +
   scale_size_continuous(
     name = 'Relative Abundance(%)',
     range = c(.5, 4),
     guide = guide_legend(overide.aes = list(starstroke = .5))
   )

sign.otu <- mpse3 %>%
    mp_extract_feature() %>%
    filter(!is.na(Sign_Group)) %>%
    select(OTU, LDAmean, Sign_Group) %>%
    dplyr::left_join(
      mpse3 %>% mp_extract_taxonomy(),
      by = 'OTU'
    )

p4 <- p3 %<+% sign.otu +
   ggnewscale::new_scale_fill() +
   geom_fruit(
      data = td_filter(!is.na(Sign_Group)),
      geom = geom_tile,
      mapping = aes(fill=Phylum),
      width = .1,
      offset = .1
   ) +
   ggnewscale::new_scale_fill() +
   geom_fruit(
      data = td_filter(!is.na(Sign_Group)),
      geom = geom_col,
      mapping = aes(x = LDAmean, fill = Sign_Group),
      orientation = "y",
      offset = 0.05,
      pwidth = 1,
      axis.params = list(axis = "x",
                         title = "Log10(LDA)",
                         title.height = 0.005,
                         title.size = 2,
                         text.size = 1.8,
                         vjust = 1),
      grid.params = list(linetype = 3) ,
      show.legend = FALSE
   ) +
   scale_fill_manual(values = cols) +
   theme(
     legend.key.width = unit(.3, 'cm'),
     legend.key.height = unit(.3, 'cm'),
     legend.text = element_text(size=6),
     legend.title = element_text(size=8),
     legend.margin = ggplot2::margin(-.25, 0, 0, 0, 'cm')
   )
p4
```

```{r echo=FALSE}
saveRDS(ba.node.sign, './TmpRef/balance_diff_nodes.rds')
saveRDS(mpse.balance.node, './TmpRef/balance_input_CD.rds')
```

\newpage

We found some differential clades contain the closely related species that were not be detected in the previous differential analysis, such as OTU_454/OTU_97 (both belong to Clostridiaceae SMB53), OTU_152/OTU_233 (both belong to Lachnospira), which suggested the phylogenetic transform can improve the detection of differential signals by accumulating the small consistent differences at a broad resolution.

(ref:SignalClades) **The balance scores of significantly differential clades and the relative abundance of their original OTUs** (A) the relative abundance, the taxonomy information and the compositional clades annotation of the original OTUs. (B) The balance scores of significantly differential clades.

```{r, warning = FALSE, message = FALSE, fig.width = 9.5, fig.height=7, fig.align = 'center', fig.cap = '(ref:SignalClades)', SignalClades}
no.sig.OTUs.da <- mpse.balance.node %>% mp_extract_feature() %>%
    dplyr::filter(!is.na(Sign_Group)) %>%
    select(OTU, node, Balance_offspring) %>%
    tidyr::unnest(Balance_offspring) %>%
    dplyr::filter(node %in% c(434, 426, 343, 388)) %>%
    tidyr::unnest(offspringTiplabel) %>%
    dplyr::arrange(node)

no.sig.OTUs <- no.sig.OTUs.da %>% dplyr::pull(offspringTiplabel)

no.sig.otu.genus <- mpse2 %>%
    mp_extract_taxonomy %>%
    dplyr::filter(OTU %in% no.sig.OTUs) %>%
    select(OTU, Genus) %>%
    dplyr::mutate(Genus=gsub("g__Clostridium_f__Clostridiaceae", "g__Clostridium", Genus))

theme_annot <- function(){
    th <- list(
        labs(x=NULL, y=NULL),
        theme_bw(),
        theme(
          axis.text = element_blank(),
          axis.ticks = element_blank(),
          panel.grid = element_blank(),
          panel.border = element_blank(),
          legend.key.height = unit(.3, 'cm'),
          legend.key.width = unit(.3, "cm"),
          legend.text = element_text(size=7),
          legend.title = element_text(size=9)
        )
    )
    return(th)
}

mpse2 %>%
    filter(OTU %in% no.sig.OTUs) %>%
    as_tibble() %>%
    ggplot(
      aes(
        y = fct_reorder(Sample, Group, .fun = min),
        x = fct_relevel(OTU, no.sig.OTUs),
        fill = RelRareAbundanceBySample,
        size = RelRareAbundanceBySample
      )
    ) +
    geom_tile(color='grey', size=.5, fill=NA) +
    geom_point(
      data = td_filter(RelRareAbundanceBySample!=0),
      shape=21
    ) +
    scale_fill_gradient2() +
    theme_bw() +
    theme(axis.text.x=element_text(angle=45, hjust=1), panel.grid=element_blank()) +
    labs(x=NULL, y=NULL, size="RelAbun", fill='RelAbun') -> f1

mpse2 %>%
    mp_extract_sample() %>%
    ggplot(aes(y=Sample, fill=Group, x="Group")) +
    geom_tile() +
    scale_fill_manual(values = cols) +
    theme_annot() +
    labs(x=NULL, y=NULL) -> f2

no.sig.OTUs.da %>%
    ggplot(aes(x=fct_reorder(offspringTiplabel, node, .fun=min),
               y='BalanceNode',
               fill=as.character(node))) +
    geom_tile() +
    scale_fill_uchicago() +
    theme_annot() +
    labs(x=NULL, y=NULL, fill='BalanceNode') -> f3

no.sig.otu.genus %>% ggplot(aes(x=OTU,y="Genus",fill=Genus)) +
    geom_tile() +
    labs(fill = 'Genus') +
    coord_cartesian(expand=F) +
    theme_annot() +
    scale_fill_npg() -> f4

ff <- f1 %>%
      aplot::insert_right(f2, width = .1) %>%
      aplot::insert_top(f3, height = .03) %>%
      aplot::insert_top(f4, height = .028)

f.box <- mpse.balance.node %>%
    dplyr::filter(node %in% c(434, 426, 343, 388)) %>%
    as_tibble() %>%
    tidyr::unnest(Balance_offspring) %>%
    dplyr::filter(Clade == 'up') %>%
    ggplot(aes(y = Group, x = Abundance, fill = Group)) +
    geom_boxplot(orientation = 'y') +
    geom_jitter(color = 'grey', height = .2) +
    facet_wrap(pseudolabel~., ncol = 1, strip.position = 'top', scales = 'free') +
    scale_fill_manual(values = cols) +
    ggsignif::geom_signif(comparisons = list(c('CD', 'Control')), orientation = 'y') +
    scale_y_discrete(position = 'right') +
    ylab(NULL) +
    xlab('Balance Score') +
    theme_bw() +
    theme(
      legend.position = 'none',
      panel.grid = element_blank(),
      strip.background = element_rect(fill='grey', color=NA),
      strip.text = element_text(face='bold')
    )

aplot::plot_list(ff, f.box, tag_levels = "A", widths=c(4.5, 5))
```

\newpage

## Performing differential analysis among multiple groups {#chapter2.8}

It is the same to perform the differential analysis between two groups using mp_diff_analysis. For example, we perform the following example to show this. The dataset was from a colorectal cancer study [@Zeller_2014], which was obtained with curatedMetagenomicData. The samples were from stools of the CRC, the Adenoma and the Control individuals. Through the analysis of *mp_diff_analysis*, we found *Fusobacterium gonidiaformans*, *Porphyromonas asaccharolytica*, *Parvimonas micra*, *Peptostreptococcus stomatis* and *Escherichia coli* were significantly enriched in CRC (colorectal cancer), *Ruminococcus lactaris* was significantly enriched in Adenoma (colorectal adenoma), but *Bifidobacterium longum*, *Bifidobacterium catenulatum*, *Blautia wexlerae* and *Anaerostipes hadrus* were significantly decreased in CRC and Adenoma.

(ref:MultipleGroupTree) **The cladogram of significant differential taxa** The hight light represented the differential taxa enriched in the corresponding group.

```{r, message = FALSE, warnings = FALSE, fig.width = 10, fig.height = 10, fig.align = "center", fig.cap='(ref:MultipleGroupTree)', MultipleGroupTree}
ExperimentHub::setExperimentHubOption('LOCAL', TRUE)
xx <- curatedMetagenomicData('ZellerG_2014.relative_abundance', dryrun=F)
xx[[1]] %>% as.mpse -> mpse.crc.ZellerG_2014

mpse.crc.ZellerG_2014 %<>% mp_diff_analysis(
    .abundance = Abundance,
    .group = disease,
    force = TRUE,
    relative = FALSE,
    first.test.alpha = 0.05,
    filter.p = "pvalue"
)

p.cladogram <- mpse.crc.ZellerG_2014 %>%
     mp_plot_diff_cladogram(
       .group = disease,
       .size = pvalue,
       taxa.class = Genus,
       hilight.alpha = .3,
       bg.tree.size = .15,
       bg.point.stroke = .1,
       bg.point.size = 1.5,
       label.size = 2.6,
       tip.annot = FALSE,
       as.tiplab = FALSE
     ) +
     scale_fill_diff_cladogram(
       values = c('red', 'orange', 'deepskyblue'),
     ) +
     scale_size_continuous(
       range = c(1, 4)
     )

p.cladogram
```

\newpage

## Interoperable with the existing computing ecosystem {#chapter2.9}

Because the *MPSE* object of *MicrobiotaProcess* inherits the *SummarizedExperiment* object [@SE], The related inherited methods for signature *SummarizedExperiment* can also be applied to the *MPSE*. For example, the *tidybulk* [@tidybulk] provides an R tidy framework for modular transcriptomic data analysis. It provides a *test_differential_abundance* to perform differential transcription testing using edgeR quasi-likelihood edgeR likelihood-ratio (LR), limma-voom, limma-voom-with-quality-weights or DESeq2. It is also compatible with *MPSE*.

(ref:tidybulkPlot) **The results of different OTUs based on the edgeR_quasi_likelihood with tidybulk** (A). The relative abundance heatmap of the different OTUs. (B). The hierarchical cluster of samples based on the relative abundance of the different OTUs. (C). The hierarchical cluster of OTUs based on the relative abundance of total OTUs, the different OTUs were labeled with their names. We found the cluster of different OTUs in the heatmap is consistent with the different OTUs in the background of total OTUs (C).

```{r message=FALSE, warning = FALSE, fig.width = 14.5, fig.height = 7, fig.cap='(ref:tidybulkPlot)', tidybulkPlot}
library(tidybulk)
library(edgeR)
library(aplot)
library(shadowtext)
library(ggrepel)
mpse2 %<>% test_differential_abundance(.abundance = Abundance, .formula = ~Group)
# extract the different OTUs from the MPSE class
res <- mpse2 %>% dplyr::filter(FDR <= .05 & abs(logFC) >= 2)
pp <- res %>%
      mp_plot_abundance(
        .abundance = RareAbundance,
        force = TRUE,
        relative = TRUE,
        feature.dist = "bray",
        geom = "heatmap",
        topn = "all",
        .group = Group
      ) 
pp[[1]] <- pp[[1]] +
           scale_fill_viridis_c(
             option='A', 
             na.value = 0, 
             trans = 'log10'
           ) +
           guides(
             fill = guide_colorbar(
               title = expression(log[10]("relative abundance")),
               title.position = "right",
               title.theme = element_text(angle=-90, size=9, vjust=.5, hjust=.5),
               label.theme = element_text(angle=-90, size=7, vjust=.5, hjust=.5),
               barwidth = unit(.3, 'cm'),  
               barheight = unit(5, 'cm')
             )
           ) +
           theme(
             axis.text.x = element_blank(),
             axis.text.y = element_text(size = 6),
           )

pp[[2]] <- pp[[2]] +
           scale_fill_manual(values = cols) +
           theme(
             legend.key.height = unit(0.3, "cm"),
             legend.key.width = unit(0.3, "cm"),
             legend.spacing.y = unit(0.02, "cm"),
             legend.text = element_text(size = 7),
             legend.title = element_text(size = 9)
           )

f <- res %>%
     mp_extract_taxonomy() %>%
     ggplot() +
     geom_text(
       mapping = aes(y=OTU, x=0, label=Genus, color=Phylum),
       hjust = 0,
       size = 2
     ) +
     scale_x_continuous(expand=c(0, 0, 0, 0.1)) +
     theme_bw() +
     theme(
       legend.text = element_text(size = 5),
       legend.title = element_text(size = 7),
       legend.key.width = unit(0.3, "cm"),
       legend.key.height = unit(0.3, "cm"),
       panel.background = element_blank(),
       panel.grid = element_blank(),
       axis.text = element_blank(),
       axis.ticks = element_blank(),
       panel.border = element_blank()
     ) +
     labs(x = NULL, y = NULL)
pp <- pp %>% insert_right(f, width = 0.4)
sample.tree <- res %>%
      select(-bray) %>%  # remove the bray, Because it was the result of all OTU,
      mp_cal_clust(.abundance = RelRareAbundanceBySample, distmethod = "bray") %>%
      ggtree(layout = igraph::layout_with_kk, color = "#afb7b8") +
      geom_nodepoint(color = "#afb7b8", size = .5) +
      geom_tippoint(aes(fill = Group), shape = 21, size=3) +
      geom_text_repel(
        data = td_filter(isTip),
        mapping = aes(label = label),
        size = 2,
        max.overlaps = 30,
        colour = "black",
        bg.colour = "white"
      ) +
      scale_fill_manual(
        values = cols,
        guide = guide_legend(
           title.theme = element_text(size = 7),
           label.theme = element_text(size = 5),
        )
      )
p <- mpse2 %>%
      mp_cal_dist(
         .abundance = RelRareAbundanceBySample,
         distmethod = "bray",
         cal.feature.dist = T
      ) %>%
      hclust() %>%
      ggtree(layout = igraph::layout_with_kk, color = "#bed0d1") +
      geom_nodepoint(color = "#bed0d1", size = .5)
# The data.frame contained results of test_differential_abundance
otu.tab <- mpse2 %>% mp_extract_feature()
p <- p %<+% otu.tab +
     geom_tippoint(
       mapping = aes(fill = logFC, size = -log10(FDR)),
       shape = 21,
       color = "grey"
     ) +
     scale_fill_viridis_c(
       option="C",
       guide = guide_colorbar(
          title.theme = element_text(size = 7),
          label.theme = element_text(size = 5),
          barheight = unit(1.5, "cm"),
          barwidth = unit(.3, "cm")
       )
     ) +
     scale_size_continuous(
       range = c(.5, 6),
       guide = guide_legend(
          key.width = .3,
          key.height = .3,
          label.theme = element_text(size = 5),
          title.theme = element_text(size = 7)
       )
     ) +
     geom_text_repel(
       data = td_filter(FDR <= .05 & abs(logFC) >= 2),
       mapping = aes(x = x, y = y, label = label),
       size = 2,
       min.segment.length = 0.1,
       segment.size = .25,
       segment.colour = 'grey18',
       colour = "black",
       bg.colour = 'white'
       #max.overlaps = 60,
     )
design <- "
  12
  13
  13
"
px <- plot_list(pp, sample.tree, p, design = design, tag_levels = "A")  
px
```

\newpage

We compared the different result between the *edgeR* [@edgeR] and *MicrobiotaProcess*. We found the number of the different OTUs based on edgeR is more than the MicrobiotaProcess. We think this is because we didn't remove the low-abundance OTUs in the analysis using *tidybulk*. This operation is generally needed in standard whole-transcriptome workflows. However, if it is performed in the microbiome analysis, many low-abundance OTUs will be removed. More different OTUs were identified by the operation using edgeR [@edgeR].

(ref:CompareDEotus) **The comparison of the different analysis result between the edgeR and MicrobiotaProcess** 

```{r, warning = FALSE, fig.width = 4, fig.heigh = 2, fig.align = "center", fig.cap='(ref:CompareDEotus)', CompareDEotus}
DE.method <- list(
    EdgeR = mpse2 %>% 
            mp_extract_feature %>% 
            dplyr::filter(FDR<=0.05 & abs(logFC)>=2) %>% 
            pull(OTU), 
    MP = mpse2 %>% 
         mp_extract_feature %>% 
         dplyr::filter(pvalue <=0.05) %>% 
         pull(OTU)
)
library(ggVennDiagram)
ggVennDiagram(DE.method, edge_size = 3, set_size = 4) + 
    scale_color_manual(values=c("pink", "gold")) + 
    scale_fill_viridis_c(option="C")
```

Then we extracted the same different OTUs, we found the abundance of the same OTUs belonging to *Bifidobacterium*, *Faecalibacterium*, *Roseburia* and *Coprobacillus* were significantly decreased in CD group compared to the Control group, the abundance of several OTUs belonged to *Escherichia*, *Klebsiella* and *Haemophilus*, which belonged to Gammaproteobacteria, were significantly enriched in CD group.

```{r}
mpse2 %>% 
    mp_extract_feature(addtaxa=T) %>% 
    dplyr::filter(OTU %in% do.call(intersect, base::unname(DE.method)))
```

```{r, echo = FALSE}
saveRDS(mpse2, "./data/IBD_data/mpse2.RDS")
sample.tree$layers[[3]]$aes_params$size <- 6
sample.tree$layer[[4]]$aes_params$size <- 3.5
p$layer[[4]]$aes_params$size <- 2.8
p$layer[[4]]$aes_params$segment.size <- .4
pp[[1]] <- pp[[1]] + 
      theme(panel.border=element_rect(linewidth=.2), 
            axis.ticks.y=element_line(linewidth=.2), 
            )
ggsave("./Figures/OTU_heatmap.svg", pp, device = "svg", width = 6, height = 6.2)
ggsave("./Figures/sample.tree.svg", sample.tree, device = "svg", width = 6, height = 5.6)
ggsave("./Figures/OTU.tree.svg", p, device = "svg", width = 7, height = 7)
```

\newpage

## Interface to integrate external data

In addtion, because the *MPSE* used `treedata` class to store the taxonomy, phylogenetic and related information, the related results of other tools also can be integrated to it easily, we also developed *left_join* to cooperate. Then the new *MPSE* class can be further analyzed and visualized. 

### Integrating the results of other distance methods

The *mp_cal_dist* of *MicrobiotaProcess* had provided many distance methods, such as "bray", "aitchison", "jaccard", "gower", "altGower" etc. But if users want to use other methods that are not provided in *MicrobiotaProcess*. They can use *left_join* to add the result to *MPSE* class.

(ref:LeftJoinDist) **Integrating the other distance with left_join**

```{r, warning = FALSE, message = FALSE, fig.width = 6.8, fig.height=4.5, fig.align='center', fig.cap='(ref:LeftJoinDist)', LeftJoinDist}
otu.da <- mpse2 %>% mp_extract_assays(.abundance=Abundance)
Aitchison.dist <- robCompositions::aDist(t(otu.da+1))
p1 <- mpse2 %>%
       left_join(y=list(Aitchison=Aitchison.dist)) %>%
       mp_plot_dist(.distmethod=Aitchison, .group=Group, group.test=T) +
       scale_fill_manual(values=c("orange", "#00A08A", "deepskyblue")) +
       scale_color_manual(values=c("orange", "#00A08A", "deepskyblue"))
p2 <- mpse2 %>%
      left_join(y=list(Aitchison=Aitchison.dist)) %>%
      mp_cal_pcoa(distmethod="Aitchison") %>%
      mp_plot_ord(.group=Group) +
      scale_fill_manual(values=cols)
aplot::plot_list(p1, p2, widths = c(0.5, 2))
```

### Integrating the results of other DAA tools

If your want to integrate the results of differential OTU and taxa. You can provided a data frame that contains a column of OTU and taxa names, a column of enriched group and other statistical results such like pvalue or FDR value, then using *left_join* to integrate them to *taxatree* slot in *MPSE* class.

(ref:ZiCoSeqRun1) **Integrating the different analysis results (contains differential OTU and taxa) of ZiCoSeq and visualizing using mp_plot_diff_cladogram**
(ref:ZiCoSeqRun2) **Integrating the different analysis results (contains differential OTU and taxa) of ZiCoSeq and visualizing using mp_plot_diff_manhattan**

```{r, warning = FALSE, message = FALSE, fig.width = 10, fig.height = 10, fig.align = "center", fig.cap='(ref:ZiCoSeqRun1)', ZiCoSeqRun1}
library(GUniFrac)
library(matrixStats)
# obtain the abundance on all taxonomy levels
all.abun <- mpse2 %>% mp_extract_abundance() %>%
  dplyr::select(label, RareAbundanceBySample)

# build the matrix input (longer format to wider format) of ZicoSeq
All.features <- all.abun %>%
  tidyr::unnest(RareAbundanceBySample) %>%
  dplyr::select(label, Sample, RelRareAbundanceBySample) %>%
  tidyr::pivot_wider(
    id_cols = label,
    names_from = 'Sample',
    values_from = RelRareAbundanceBySample
  ) %>%
  tibble::column_to_rownames(var='label') %>% as.matrix()

All.features <- All.features[!rowSds(All.features)==0, ]

sample.da <- mpse2 %>%
             mp_extract_sample() %>%
             dplyr::select(Sample, Group) %>%
             tibble::column_to_rownames(var='Sample')
set.seed(123)
zicoseq.res <- ZicoSeq(meta.dat=sample.da, feature.dat=All.features/100, grp.name='Group',
                      prev.filter=.1, perm.no=999, feature.dat.type='proportion', verbose=F)

res.df <- data.frame(zicoseq.res$p.adj.fdr) 
colnames(res.df) <- 'FDR.zicoseq'

# build the enrich group information of the significant features.
res.sign <- all.abun %>% dplyr::filter(label %in% rownames(res.df[res.df$FDR.zicoseq <=0.05,,drop=FALSE])) %>%
    tidyr::unnest(RareAbundanceBySample) %>%
    dplyr::group_by(label, Group) %>%
    dplyr::summarize(MeanAbu=mean(RareAbundance)) %>%
    dplyr::slice_max(MeanAbu) %>%
    dplyr::ungroup() %>%
    dplyr::rename(Sign_Group=Group) %>%
    dplyr::select(label, Sign_Group)

res.df %<>% as_tibble(rownames='label') %>% dplyr::left_join(res.sign)

# remove the results of other DAA methods.
taxa.tree <- mpse2 %>% mp_extract_taxatree() %>%
  dplyr::select(-c('LDAupper', 'LDAmean', 'LDAlower', 'pvalue', 'fdr', 'Sign_Group'), keep.td=T)

# add the results of ZicoSeq to taxatree slot in MPSE
taxa.tree %<>% dplyr::left_join(res.df, by='label')
mpse4 <- mpse2
taxatree(mpse4) <- taxa.tree
zicoseq.p1 <- mpse4 %>% mp_plot_diff_cladogram(
                .group = Sign_Group,
                .size = FDR.zicoseq,
                removeUnknown = T,
                as.tiplab = F
             ) +
             scale_fill_diff_cladogram(values=cols)
zicoseq.p1
```

```{r, warning = FALSE, message = FALSE, fig.width = 8, fig.height = 4, fig.align = "center", fig.cap='(ref:ZiCoSeqRun2)', ZiCoSeqRun2}
zicoseq.p2 <- mpse4 %>% mp_plot_diff_manhattan(
                 .group = Sign_Group,
                 .y=-log10(FDR.zicoseq), 
                 taxa.class = OTU, 
                 anno.taxa.class=Phylum) +
              scale_shape_manual(
                 values = c(17, 25, 19)
              )
zicoseq.p2
```

But if you want integrate the results of differential OTU only, you can use *left_join* to integrate the result to *MPSE* class directly.

(ref:LinDARun1) **Integrating the different analysis results (only differential OTU) of LinDA and visualizing using mp_plot_diff_res**
(ref:LinDARun2) **Integrating the different analysis results (only differential OTU) of LinDA and visualizing using mp_plot_diff_boxplot**

```{r, warning = FALSE, message = FALSE, fig.width = 10, fig.height = 10, fig.align = "center", fig.cap='(ref:LinDARun1)', LinDARun1}
ps2 <- mpse2 %>% as.phyloseq(.abundance=RareAbundance)
library(MicrobiomeStat)
library(dplyr)
res.linda <- linda(phyloseq.obj=ps2, formula='~Group', p.adj.method='fdr', prev.filter=.1)

tbl.res <- res.linda$output$GroupControl

tbl.res %<>% tibble::as_tibble(rownames='OTU') %>%
            dplyr::mutate(Sign_Group = case_when(
                            log2FoldChange < 0 & reject ~ "CD",
                            log2FoldChange > 0 & reject ~ 'Control',
                            TRUE ~ as.character(NA))
            )


mpse5 <- ps2 %>%
         as.mpse() %>%
         mp_rrarefy() %>%
         mp_cal_abundance(.abundance=RareAbundance)
mpse5 %<>% left_join(tbl.res, by='OTU')

# visualizing the results with mp_plot_diff_boxplot
# and mp_plot_diff_res

linda.p1 <- mpse5 %>%
      mp_plot_diff_res(
        .group = Sign_Group,
        point.size = padj, 
        barplot.x = lfcSE
      ) + 
      scale_fill_manual(
        aesthetics = "fill_new",
        values = cols
      ) +
      scale_fill_manual(
        values = cols
      )
linda.p1
```

```{r, warning = FALSE, message = FALSE, fig.width = 5, fig.height = 6, fig.align = "center", fig.cap='(ref:LinDARun2)', LinDARun2}
linda.p2 <- mpse5 %>%
      mp_plot_diff_boxplot(
          .group = Sign_Group,
          .size=-log10(padj),
          point.x = lfcSE
      ) %>%
      set_diff_boxplot_color(
          values = cols,
          guide = guide_legend(title=NULL)
      )
linda.p2
```


\newpage

# the analysis of the other published pediatric CD stool samples {#chapter3} 

In the previous session, we described how to use *MicrobiotaProcess* to analyze the 16s rDNA data. However, it also can be applied to metagenome or metatranscriptome species community data and functional data analysis. In this session, we used the example datasets about the other published pediatric CD stool microbial study [@Douglas2018CD] to show how to use *MicrobiotaProcess* to do the related analysis. The datasets were obtained from the github^[<https://github.com/LangilleLab/CD_RF_microbiome>]. To avoid duplication, we only show how to import the 16s dataset, we focused on the analysis of metagenomics and KEGG gene datasets.

## The parsing of the 16s data and construction of MPSE class {#chapter3.1}

The session is similar with the session \ref{chapter2}, some operations can refer to the previous session \ref{chapter2}.

```{r}
cols <- c('#fcc751ff', '#00c7bfff')
cols2 <- c("deepskyblue", "yellow", "#FF9933")
sample.da <- read.table("./data/CD_RF_microbiome/biscuit_metadata.txt", header=TRUE, check.names=FALSE, sep="\t")
sample.da %<>% dplyr::select(1:5)
biom <- biomformat::read_biom("./data/CD_RF_microbiome/otu_table_w_tax_BISCUIT.biom")
mpse16s <- biom %>% as.MPSE 
mpse16s
mpse16s %<>% dplyr::left_join(sample.da, by=c("Sample"="sample_id"))
mpse16s
```

## Functional characterization using the KEGG dataset {#chapter3.2}

The KEGG gene abundances were annotated based on the MGS data. It can also be imported as MPSE, and further analyzed using *MicrobiotaProcess*. Here, we only show how to identify the different genes using the *mp_diff_analysis* of *MicrobiotaProcess* (refer to session \ref{chapter2.6}). Other operations are similar with the analysis of 16s rDNA data (refer to session \ref{chapter2}).

```{r}
KO.da <- read.table("./data/CD_RF_microbiome/biscuit_mgs_KOs.tsv", 
            header=TRUE, sep = "\t", row.names=1, check.names=F)
# Building the MPSE object.
mpseKO <- MPSE(assays=list(Abundance = KO.da))
# merge the sample metadata information.
mpseKO %<>% left_join(sample.da, by=c("Sample"="sample_id"))
```

### Differential analysis of KEGG genes abundance {#chapter3.2.1}

The metric of the KEGG genes was the relative abundance, here we used *mp_diff_analysis* to identify the difference KEGG genes with 'force = TRUE and relative = FALSE', meaning the relative abundance will be used directly.

```{r, message = FALSE, warning = FALSE}
mpseKO %<>% mp_diff_analysis(
     .abundance = Abundance,
     force = TRUE,
     relative = FALSE,
     .group = disease,
     filter.p = "pvalue"
   )
```

Then we can perform the KEGG pathway enrichment analysis using clusterProfiler [@clusterProfiler] and MicrobiomeProfiler [@MicrobiomeProfiler] developed by our team. 

```{r, message = FALSE, warning = FALSE, fig.show='hide'}
# perform KEGG pathway analysis with clusterProfiler and MicrobiomeProfiler
com.xx <- mpseKO %>%
    mp_extract_feature() %>% # Extracting the feature metadata information
    dplyr::filter(!is.na(Sign_disease)) %>% # Extracting the differential features
    compareCluster(OTU~Sign_disease, data=., fun=enrichKO)
# visualizing the enriched pathway with dotplot
p.dot <- dotplot(com.xx) + 
         scale_color_gradientn(
           colours = c("#b3eebe", "#46bac2", "#371ea3"),
           guide = guide_colorbar(reverse=TRUE, order=1)
         ) +
         labs(x = NULL) +
         guides(size = guide_legend(override.aes=list(shape=1))) +
         theme(
           panel.grid.major.y = element_line(linetype='dotted', color='#808080'),
           panel.grid.major.x = element_blank()
         )
# with network plot
set.seed(1024)
p.net <- cnetplot(
           com.xx,
           layout = "fr",
           cex_label_category = 1.8
         ) +
         scale_fill_manual(
           values = cols
         )
p <- aplot::plot_list(p.net, p.dot, widths = c(3, 1), tag_levels="A")
p
```

(ref:enrichKEGG) **The result of KEGG pathway enrichment analysis**

```{r, echo = FALSE, message = FALSE, warning = FALSE, fig.width = 12, fig.height=6, fig.align = "center", fig.cap = '(ref:enrichKEGG)', enrichKEGG}
p
ggsave("./Figures/KO.svg", p, width=14, height=7, device = "svg")
saveRDS(mpseKO, "./data/CD_RF_microbiome/mpse_KO.rds")
```

\newpage

The KEGG enrichment results showed that the KEGG pathways of the CD stool group were significantly enriched in the Biosynthesis of amino acids and Glycine, serine, and threonine metabolism, and Pyruvate metabolism (Fig. \ref{fig:enrichKEGG}). This result was not revealed in the original paper [@Douglas2018CD], but it was consistent with recent some other related studies, which found that Crohn’s Disease microbiomes had an increased potential to synthesize amino acids and Pyruvate metabolism [@Heinken2021IBD; @Bjerrum2017; @Polunin2013]. In addition, we used some other differential abundance methods to identify the differential KEGG genes, but the two pathways were not found simultaneously in the enrichment results of CD based on the differential genes identified by other methods (refer to the second session of supplemental file B). We think this is because the *mp_diff_analysis* of *MicrobiotaProcess* achieves a better false positive rate (refer to the third session of supplemental file B)

\newpage

## The species characterization of the metagenomics data {#chapter3.3}

The taxa abundance data from the metagenomics study also can be analyzed by *MicrobiotaProcess*, Here we used the example data from the output of *MetaPhlAn* [@MetaPhlAn] to show how to perform the related analysis using *MicrobiotaProcess*. The output of other taxa abundance can also be imported and converted to the *MPSE* object, and further analyzed by *MicrobiotaProcess*, which can refer to session\ref{chapter3.2} and session\ref{chapter4}.

```{r}
# This is the output of MetaPhlAn2, which might need to specific the 'linenum'
# base on the first several rows whether to contain the metadata information
mpseMGS <- mp_import_metaphlan("./data/CD_RF_microbiome/metaphlan2_out_merged_species.tsv", linenum=1)
# rename the column names of MPSE.
colnames(mpseMGS) <- mpseMGS %>% mp_extract_sample %>% pull(2)
mpseMGS %<>% left_join(sample.da, by=c("Sample"="sample_id"))
mpseMGS
```

### Alpha diversity analysis in MGS (metagenomics sequencing) level {#chapter3.3.1}

The metric of metagenomics data usually is relative abundance. But some functions of `MicrobiotaProcess` need to require the abundance is count (in default). To process the relative abundance (not integer), We can specific 'force = TRUE', which means the corresponding functions will be calculated directly without rarefied.

(ref:MGSalpha) **The alpha diversity boxplot based on MGS data**

```{r, fig.width = 5, fig.height = 4, fig.align = 'center', fig.cap="(ref:MGSalpha)", message = FALSE, MGSalpha}
mpseMGS %<>% mp_cal_alpha(
      .abundance = Abundance,
      force = TRUE
    )
p <- mpseMGS %>% mp_plot_alpha(
       .group = disease,
       .alpha = c(Observe, Shannon, Pielou)
     ) +
     scale_color_manual(values = cols) +
     scale_fill_manual(values = cols) +
     theme(legend.position = "none")
p
```

\newpage

### Beta diversity analysis in MGS level {#chapter3.3.2}

We used *mp_cal_dist* to calculated the distance between the samples, then used *mp_plot_dist* to display the distance with heatmap (Fig.\ref{fig:MGSBeta}.A)  and boxplot (Fig.\ref{fig:MGSBeta}.B), then the distance was used to perform the PCoA analysis (Fig.\ref{fig:MGSBeta}.C).

(ref:MGSBeta) **The distance heatmap and boxplot and the PCoA plot based on the MGS data**

```{r, echo=FALSE, fig.width = 8, fig.height = 10, fig.align = 'center', message=FALSE, warning=FALSE, fig.cap="(ref:MGSBeta)", MGSBeta}
mpseMGS %<>% mp_decostand(
      .abundance = Abundance,
      method = "hellinger"
    )
mpseMGS %<>% mp_cal_dist(
      .abundance = hellinger,
      distmethod = "bray"
    )
mpseMGS %<>% mp_cal_pcoa(
      .abundance = hellinger,
      distmethod = "bray"
    )
p1 <- mpseMGS %>% mp_plot_dist(
        .distmethod = bray,
        .group=c(disease, response)
      ) %>%
      set_scale_theme(
        x = scale_fill_manual(
              values = cols,
              guide = guide_legend(
                          keywidth = 0.5,
                          keyheight = 0.5,
                          label.theme=element_text(size=6)
                 )              
            ),
        aes_var = disease
      ) %>%
      set_scale_theme(
        x = scale_fill_manual(
              values=cols2,
              guide = guide_legend(
                          keywidth = 0.5,
                          keyheight = 0.5,
                          label.theme=element_text(size=6)
                 )              
            ),
        aes_var = response
      ) %>%
      set_scale_theme(
        x = scale_size_continuous(
              range = c(1, 3),
              guide = guide_legend(
                          keywidth = 0.5,
                          keyheight = 0.5,
                          label.theme=element_text(size=6)
                 )              
            ),
        aes_var = bray
      )  
p2 <- mpseMGS %>% mp_plot_dist(
        .distmethod = bray,
        .group = disease,
        group.test = TRUE
      ) +
      scale_color_manual(
        values = c("orange", "#00A08A", "deepskyblue")
      ) +
      scale_fill_manual(
        values = c("orange", "#00A08A", "deepskyblue")
      )
p3 <- mpseMGS %>% mp_plot_ord(
        .ord = pcoa,
        .group = disease,
        .size = Observe,
        .starshape = response,
        show.side = FALSE
      ) +
      scale_starshape_manual(values = c(1, 13, 15)) +
      scale_fill_manual(
        values=cols,
        guide=guide_legend(
          keywidth = 0.3,
          keyheight = 0.3,
          label.element = element_text(size = 6),
          override.aes = list(size = 2, starshape = 15)
        )
      ) +
      scale_size_continuous(
        range = c(1, 3),
        guide = guide_legend(
          keywidth = 0.3,
          keyheight = 0.3,
          label.element = element_text(size = 6),
          override.aes = list(starshape = 15)
        )
      )
design <- "\n111\n111\n111\n233\n233\n"
pp <- aplot::plot_list(p1, p2, p3, design = design, tag_levels = "A")
pp
```

Then we used *mp_adonis* to perform the Permutational Multivariate Analysis of Variance based on the distance. 

```{r}
mpseMGS %<>% mp_adonis(
       .abundance = Abundance,
       .formula = ~ disease + response,
       distmethod = "bray",
       permutation = 9999,
       action = "add"
     ) 
# the result can be extracted with mp_extract_internal_attr
mpseMGS %>% mp_extract_internal_attr(name = adonis) %>% mp_fortify()
```

### Different analysis in MGS level {#chapter3.3.3}

Here, we also used *mp_diff_analysis* to detect the difference taxa, we also specified the 'force = TRUE' and 'relative = FALSE', meaning the metric of abundance (.abundance) was used to perform the analysis directly without rarefied and calculated the relative abundance (Fig.\ref{fig:DiffMGS}).

(ref:DiffMGS) **The result of differential analysis based on the MGS data**

```{r, message = FALSE, warning = FALSE, fig.width = 10, fig.height = 10, fig.align = 'center', fig.cap = "(ref:DiffMGS)", DiffMGS}
mpseMGS %<>%
    mp_diff_analysis(
       .abundance = Abundance,
       force = TRUE,
       relative = FALSE,
       .group = disease,
       filter.p = "pvalue"
    )
library(forcats)
trda <- mpseMGS %>% mp_extract_tree()
p <- ggtree(trda, layout = 'radial') +
     geom_tiplab(size = 1.8, offset = 11) +
     geom_hilight(
         data = td_filter(nodeClass == 'Phylum'),
         mapping = aes(
           node = node,
           fill = label
         )
     )
p2 <- p +
      ggnewscale::new_scale_fill() +
      geom_fruit(
         data = td_unnest(AbundanceBySample, names_repair=tidyr::tidyr_legacy),
         geom = geom_star,
         mapping = aes(
            x = fct_reorder(Sample, disease, .fun=min),
            size = Abundance,
            fill = disease,
            subset = Abundance > 0
         ),
         starshape = 13,
         offset = 0.02,
         pwidth = 1,
         grid.params = list(linetype=2)
      ) +
      scale_size_continuous(name="Relative Abundance (%)",range = c(1, 3)) +
      scale_fill_manual(values = cols)
p3 <- p2 +
      ggnewscale::new_scale("fill") +
      geom_fruit(
         geom = geom_col,
         mapping = aes(
                       x = LDAmean,
                       fill = Sign_disease,
                       subset = !is.na(LDAmean)
                       ),
         orientation = "y",
         offset = .05,
         pwidth = 0.5,
         width = 0.5, # the parameter of geom_col
         axis.params = list(axis = "x",
                            title = "Log10(LDA)",
                            title.height = 0.001,
                            title.size = 2,
                            text.size = 1.8,
                            vjust = 1),
         grid.params = list(linetype = 1)
      ) + 
      ggnewscale::new_scale("size") +
      geom_point(
         data=td_filter(!is.na(Sign_disease)),
         mapping = aes(size = -log10(pvalue),
                       fill = Sign_disease
                   ),
         shape = 21
      ) +
      scale_size_continuous(range=c(0.5, 3)) +
      scale_fill_manual(values=cols) +
      theme(
           legend.key.height = unit(0.3, "cm"),
           legend.key.width = unit(0.3, "cm"),
           legend.spacing.y = unit(0.02, "cm"),
           legend.text = element_text(size = 7),
           legend.title = element_text(size = 9),
      ) 
p3
```

\newpage

Next, we extracted the abundance of the different species, then using ggplot2 [@ggplot2] to visualize them (Fig.\ref{fig:DiffBoxMGS}).

(ref:DiffBoxMGS) **The abundance boxplot of the differential species between the CD and control group**

```{r, fig.width=7, fig.height=5, fig.align = "center", fig.cap = "(ref:DiffBoxMGS)", DiffBoxMGS}
deT <- mpseMGS %>% mp_extract_tree() %>% dplyr::filter(!is.na(Sign_disease) & isTip, keep.td=F) %>% dplyr::pull(label)
mpseMGS %>% 
    mp_extract_abundance(taxa.class="OTU") %>% 
    dplyr::filter(label %in% deT) %>% 
    tidyr::unnest(AbundanceBySample) %>% 
    ggplot(mapping=aes(x=disease, y=Abundance, fill=disease)) + 
    geom_boxplot() + 
    facet_wrap(facets = vars(label), nrow = 1, scales = "free", strip.position = "right") + 
    ggsignif::geom_signif(comparisons=list(c("CD", "CN"))) +
    scale_fill_manual(values=cols, guide="none") +
    labs(x=NULL, y="relative abundance (%)")
```

```{r echo = FALSE}
saveRDS(mpseMGS, "./TmpRef/mpse_MGS.rds")
```

\newpage

# The analysis of the mosquito ecology data using MicrobiotaProcess {#chapter4}

*MicrobiotaProcess* also can be used to perform the other related ecology data analysis, besides the microbial community data. Here, we used an example data about a Mosquito ecology study [@mosquito] to show how to use *MicrobiotaProcess* to perform the analysis of the related ecology study. The data was obtained from the github^[<https://github.com/rgriff23/Mosquito_ecology>].

## Loading data and Construction of MPSE object {#chapter4.1}

The 1 to 14 columns are the sample metadata including the study site, and habitat, etc. and the other columns represent the abundance of mosquito species the in each sample.

```{r}
data <- read.csv("./data/Mosquito_ecology/data.csv", row.names=1)
abun.d <- data[, 14:36]
sample.d <- data[, 1:13]
# We implements `MPSE` function to build the `MPSE` object, which requires the abundance table (matrix-like).
mpse <- MPSE(assays=list(Abundance=t(abun.d)), colData=sample.d)
mpse
```

## Alpha diversity analysis of the Mosquito ecology study {#chapter4.2}

The `MicrobiotaProcess` provides some verbs of `dplyr`, which allows user to explore the `MPSE` class effectively and develop reproducible and human-readable pipelines

(ref:MosquitoaAlpha) **The raincloud plot of the alpha diversity of the Mosquito ecology community.** The result of the alpha diversity analysis about the Mosquito ecology study showed that the Mosquito species richness gradually increases from field to forest (field \-\-> near field \-\-> edge \-\-> near field \-\-> forest).

```{r, message = FALSE}
cols = c("lightgoldenrod1", "orange","chartreuse2", "chartreuse4", "darkgreen")
# Adjusting the order of Habitat
mpse %<>% 
   dplyr::mutate(
     Habitat = factor(
       Habitat, 
       levels = c("Field", "NearField", "Edge", "NearForest", "Forest")
    )
   )
mpse
# force=TRUE meaning the Abundance will be used to calculate the alpha index without rarefaction
mpse %<>% mp_cal_alpha(.abundance=Abundance, force=TRUE)
# test the relationship between the Observe Species and Habitat or Shannon and Habitat.
tb1 <- mpse %>% mp_extract_sample() %>% lm(formula=Observe ~ Habitat, data=.) %>% anova() %>% broom::tidy()
tb2 <- mpse %>% mp_extract_sample() %>% lm(formula=Shannon ~ Habitat, data=.) %>% anova() %>% broom::tidy()
```

The result of ANOVA test revealed that the richness of the mosquito species was significantly associated with the **habitat**. Then the result was visualized by *mp_plot_alpha* (Fig.\ref{fig:MosquitoaAlpha}).

```{r, message = FALSE, warning = FALSE, fig.width = 5.2, fig.height = 4.6, fig.cap = '(ref:MosquitoaAlpha)', MosquitoaAlpha}
p.alpha <- mpse %>%
     mp_plot_alpha(.group = Habitat, .alpha = c(Observe, Shannon), test = NULL) +
     scale_fill_manual(values = cols) +
     scale_color_manual(values = cols) +
     theme(legend.position = "none")
library(ggpp)
# building the table layer
tb1 %<>% dplyr::slice(1) %>% select(statistic, p.value) %>% round(3)
tb2 %<>% dplyr::slice(1) %>% select(statistic, p.value) %>% round(3)
df <- tibble(npcx=c(0.9, 0.9), npcy=c(0.05, 0.05), tb=list(tb1, tb2), Measure=c("Observe", "Shannon"))

p.alpha <- p.alpha + 
           geom_table_npc(
             data = df, 
             mapping = aes(
               npcx = npcx,
               npcy = npcy,
               label = tb
             ),
             table.theme = ttheme_gtminimal
           )
p.alpha
```

## Beta Diversity Analysis of the Mosquito ecology study {#chapter4.3}

Here, we use the cca (constrained correspondence analysis) to test which environment factor is related to the Mosquito species in the habitat (Fig.\ref{fig:MosquitoCCA}).

```{r, message = FALSE, warning = FALSE}
mpse %<>%
    mutate(NormAbun=sqrt(Abundance)/TrapNights) %>%
    mp_cal_cca(
       .abundance  = NormAbun,
       .formula = ~DeciduousForest+
           EvergreenForest+
           Grassland+
           MixedForest+
           ShrubScrub+
           Condition(
             BarrenLand+
             Building+
             Pavement+
             CultivatedCrops
           )
    )
mpse
```

The raw result of pCCA was added the *internal_attr*, which can be extracted by *mp_extract_internal_attr* with specific *name=cca*. Then it can be performed the significance test using the functions of *vegan* [@vegan], such as *anova.cca*, *permutest*.

```{r, message = FALSE, warning = FALSE}
# Extract the raw result of cca analysis
# And significance test with anova

mpse %>% 
    mp_extract_internal_attr(name=cca) %>%
    anova()
```

Further we used *mp_envfit* to identity the environment variables that were significantly associated with the mosquito communities.

```{r, message = FALSE, warning = FALSE}
# fits environmental vectors onto cca
mpse %<>% 
    mp_envfit(
       .ord = cca, 
       .env = c(
          DeciduousForest, 
          EvergreenForest, 
          Grassland, 
          MixedForest, 
          ShrubScrub 
        ),
       action = "add", 
       permutation = 9999
    )

# Extract the raw result of envfit analysis
mpse %>% mp_extract_internal_attr(name=cca_envfit) %>% mp_fortify() 
```

Then we used *mp_plot_ord* to visualize the result of pCCA (Fig.\ref{fig:MosquitoCCA}).

(ref:MosquitoCCA) **The CCA plot of the Mosquito ecology study (A) without the result of *mp_envfit* (B) with the result of *mp_envfit*.** Each point represents one sample, the size of the points represents the observe species of the corresponding sample, the color of the points represents the habitat of the corresponding sample, the shape of points represents the Region of the corresponding sample. And the arrows represent the environment factors, the marked ones by star represent significant related to the Mosquito communities in the study (\* 0.05, \*\* 0.01, \*\*\* 0.001). 

```{r, message = FALSE, warning = FALSE, fig.width=12, fig.height=4.6, fig.align = 'center', fig.cap = '(ref:MosquitoCCA)', MosquitoCCA}
# visualization only pCCA
f <- mpse %>%
     mp_plot_ord(
       .ord = cca,
       .group = Habitat,
       .size = Observe,
       .starshape = Region,
       show.side = FALSE,
       show.envfit = FALSE,
       colour = 'black',
       bg.colour = 'white'
     ) +
     scale_starshape_manual(values=c(1, 13, 15)) +
     scale_fill_manual(
        values = cols,
        guide = guide_legend(
          override.aes = list(starshape=15)
        )
     ) +
     scale_size_continuous(
       range = c(1, 3),
       guide = guide_legend(override.aes = list(starshape=15))
     ) +
     theme(
        legend.key.height = unit(0.3, "cm"),
        legend.key.width = unit(0.3, "cm"),
        legend.spacing.y = unit(0.02, "cm"),
        legend.text = element_text(size = 7),
        legend.title = element_text(size = 9),
     )
# visualization with envfit result
p <- mpse %>% 
     mp_plot_ord(
       .ord = cca, 
       .group = Habitat, 
       .size = Observe,
       .starshape = Region,
       show.side = FALSE, 
       show.envfit = TRUE,
       colour = "black",
       bg.colour = "white"
     ) +
     scale_starshape_manual(values=c(1, 13, 15)) +
     scale_fill_manual(
        values = cols, 
        guide = guide_legend(
          override.aes = list(starshape=15)
        )
     ) +
     scale_size_continuous(
       range = c(1, 3),
       guide = guide_legend(override.aes = list(starshape=15))
     ) + 
     theme(
        legend.key.height = unit(0.3, "cm"),
        legend.key.width = unit(0.3, "cm"),
        legend.spacing.y = unit(0.02, "cm"),
        legend.text = element_text(size = 7),
        legend.title = element_text(size = 9),              
     )
ff <- aplot::plot_list(f, p, tag_levels="A")
ff
```

## The distribution of Mosquito species in the study. {#chapter4.4}

We used *mp_cal_abundance* and *mp_plot_abundance* to calculate and visualize the abundance of the Mosquito species in the study (Fig.\ref{fig:mosquitoHeatmap}).

(ref:mosquitoHeatmap) **The heatmap of the abundance (A) and relative abundance (B) of the Mosquito species.**

```{r, message = FALSE, warning = FALSE , fig.width = 14, fig.height=6, fig.cap = "(ref:mosquitoHeatmap)", mosquitoHeatmap}
cols2 <- c("deepskyblue", "yellow", "#FF9933")
# The theme and scale of fill of heatmap
Abund.char <- list(
           scale_fill_viridis_c(option = "H"),
           theme(
             axis.text.x = element_text(size = 6),
             axis.text.y = element_text(size = 8),
             legend.title = element_text(size = 7),
             legend.text = element_text(size = 5),
             legend.key.width = unit(0.3, "cm"),
             legend.key.height = unit(0.3, "cm")
           )
      )
# The theme and legend of annotate bar of 'Habitat' variable
Habitat.char <- list(
           scale_fill_manual(values = cols),
           theme(
             legend.key.height = unit(0.3, "cm"),
             legend.key.width = unit(0.3, "cm"),
             legend.spacing.y = unit(0.02, "cm"),
             legend.text = element_text(size = 7),
             legend.title = element_text(size = 9)
           )
      )
# The theme and legend of annotate bar of 'Region' variable
Region.char <- list(
           scale_fill_manual(values = cols2),
           theme(
             legend.key.height = unit(0.3, "cm"),
             legend.key.width = unit(0.3, "cm"),
             legend.spacing.y = unit(0.02, "cm"),
             legend.text = element_text(size = 7),
             legend.title = element_text(size = 9)
           )
      )
# visualization of the count abundance.
p.count <- mpse %>%
    mp_cal_abundance(
      .abundance = Abundance,
      force = T,
      relative = F
    ) %>%
    mp_plot_abundance(
      .abundance = Abundance,
      force = T,
      relative = F,
      geom = "heatmap",
      topn = "all",
      .group = c(Habitat, Region)
    ) %>%
    set_scale_theme(
      x = Abund.char,
      aes_var = Abundance
    ) %>%
    set_scale_theme(
      x = Habitat.char,
      aes_var = Habitat
    ) %>%
    set_scale_theme(
      x = Region.char,
      aes_var = Region
    )
# visualization of the relative abundance
p.rel <- mpse %>%
    mp_cal_abundance(
      .abundance = Abundance,
      force = T,
      relative = T
    ) %>%
    mp_plot_abundance(
      .abundance = Abundance,
      force = T,
      relative = T,
      geom = "heatmap",
      topn = "all",
      .group = c(Habitat, Region)
    ) %>%
    set_scale_theme(
      x = Abund.char,
      aes_var = RelAbundance
    ) %>%
    set_scale_theme(
      x = Habitat.char,
      aes_var = Habitat
    ) %>%
    set_scale_theme(
      x = Region.char,
      aes_var = Region
    )
ff <- aplot::plot_list(p.count, p.rel, tag_levels="A")
ff
```

Then We can use *mp_diff_analysis* to identify the significant differential species between the **field** and **forest**. We found the Cx.sal (*Culex salinarius*) and Ps.col (*Psorophora columbiae*) were significantly enriched in **field**, However, the Ae.albo (*Aedes albopicta*), Ae.cin (*Aedes cinereus*), Ps.fer (*Psorophora ferox*), Ae.tris (*Aedes triseriatus*), Ae.can (*Aedes canadensis*), Ae.hen (*Aedes hendersoni*), Ae.atl (*Aedes atlanticus*) and Ae.dup (*Aedes dupreei*) were significantly enriched in the **forest**

```{r}
mpse %>% 
    dplyr::filter(Habitat %in% c("Field", "Forest")) %>% 
    dplyr::mutate(Habitat = as.vector(Habitat)) %>%
    mp_diff_analysis(.abundance=Abundance, force=T, relative=T, .group=Habitat) %>%
    mp_extract_feature() %>% 
    dplyr::filter(fdr<=0.05 & !is.na(Sign_Habitat)) %>% 
    print(width=200)
```

```{r, echo = FALSE}
saveRDS(mpse, "./TmpRef/mosquito_mpse.rds")
p.alpha$layers[[2]]$aes_params$size <- 1
p.alpha$layers[[2]]$aes_params$stroke <- .1
pp <- aplot::plot_list(p.alpha, p, widths = c(2.5, 3), nrow = 1)
ggsave(filename="./Figures/Fig8.svg", pp, device="svg", width = 10, height = 5)
```

(ref:CompareWithOthers) **The comparison of features among the common tools developed for microbiome study**

```{r, echo=FALSE, fig.width=6.5, fig.height=4, fig.align='center', fig.cap='(ref:CompareWithOthers)', CompareWithOthers}
knitr::include_graphics('./supplemental_fileB_codes/compare_plot.pdf')
```

# METHODS {#chapter5}

## The MPSE class {#chapter5.1}

To better store the input data (abundance data, sequence data, phylogenetic tree data) and the result of downstream analysis, *MPSE* class was implemented in the *MicrobiotaProcess* package. This class inherits the *SummarizedExperiment* [@SE] class. In which, the assays slot was designed to store the rectangular abundance matrices of features (microbiota profiling or function profiling) for microbiome experiment results. The *colData* slot was designed to store the meta-data of features and results about the features generated in the downstream analysis. Compared to the *SummarizedExperiment* [@SE] class, *MPSE* introduces the following additional slots, 1) the *otutree* slot is a *treedata* object and was designed to store the phylogenetic tree and the associated data, including the results of the features in the downstream analysis and the evolutionary statistics inferred by the software building the tree; 2) the *taxatree* slot is also a *treedata* [@tidytree; @treeio] object and was designed to store hierarchical taxonomy relationships and the associated data, such as the relative abundances and the results from different analysis; 3) the *refseq* slot is a *XStringSet* [@Biostrings] object and was designed to store the reference sequences, with names corresponding to the rows of the *assays* slot. In addition, *an internal attribute internal_attr* (a *list* object) was introduced to store the raw results of pca (Principal Components Analysis), pcoa (Principal Coordinate Analysis), and hierarchical cluster analysis.

## Overview of the design of the MicrobiotaProcess package {#chapter5.2}

The overall design of the *MicrobiotaProcess* package was illustrated in **Figure 2**. It presents multiple parser functions to read the outputs of upstream analysis tools, such as qiime or qiime2 [@Qiime2019] , dada2 [@DADA2016], and MetaPhlAn [@MetaPhlAn2]. After parsing, the abundance of microbiota (or other features), the metadata of the sample (optional), and the phylogenetic tree information (optional) are extracted from the outputs and stored as an *MPSE* object. Other objects designed to store microbiome data, such as phyloseq [@phyloseq], TreeSummarizedExperiment [@TSE], and SummarizedExperiment [@SE] can be converted to an *MPSE* object. This enables *MPSE* to serve as a standardized entry point for downstream analysis while being compatible with existing analysis software and pipelines. *MicrobiotaProcess* provides a wide variety of microbiome analysis procedures to work with *MPSE* objects. These procedures were designed to follow the tidy data principles and thus are human-friendly, consistent, and composable to solve complicated problems. The results of the analysis can be returned in three modes via the action argument. The (intermediate) result can be stored in the *MPSE* object if the action is ‘add’. If the action is ‘only’, it returns a tidy data frame with non-redundant sample or OTU (features) information with the result. While the action is ‘get’ returns the analysis result only. *MicrobiotaProcess* also extends the *dplyr* package to offer dplyr-verbs for data operation. 

## Parser functions and the MPSE constructor {#chapter5.3}

The *MicrobiotaProcess* package provides *mp_import_qiime2* to load the output of *qiime2* [@Qiime2019], which is a common tool for the analysis of amplicon data. The feature abundance table of the output (i.e., a qza format file) of qiime2 is required while the taxonomy information, phylogenetic tree, and representative sequences are optional. The *mp_import_dada2* function was designed to parse the output of dada2 [@DADA2016]. The output of *removeBimeraDenovo* of *dada2* is required, while the taxonomy information, representation phylogenetic tree, and representative sequences are optional. The *mp_import_metaphlan* function was designed to parse the output of *MetaPhlAn* [@MetaPhlAn2], which is a common tool for profiling the composition of microbial communities. The microbiota abundance output of *MetaPhlAn* [@MetaPhlAn2] is required, while the phylogenetic tree and metadata are also optional. In addition, An *MPSE* object can be constructed from scratch by calling the *MPSE* function with the following key parameters: 
1) *assays* (required): A list or *SimpleList* of matrix-like objects or a matrix-like object (rows represent the features and columns represent the samples) providing abundance data for all samples.
2) *colData* (optional): A *DataFrame* object storing the characteristics of the samples.
3) *otutree* (optional): A *treedata* object storing a phylogenetic tree with/without associated data. Any tree file formats as well as commonly used software outputs that can be parsed by treeio15 are supported.
4) *taxatree* (optional): A *treedata* object storing the taxonomy information (or other hierarchical data). *MicrobiotaProcess* provides the *convert_to_treedata* function to convert taxonomy data (a data.frame object) to a treedata object.
5) *refseq* (optional): A *XStingSet* object storing the representative sequences. Both nucleic acid sequences and amino acid sequences are supported via the *readDNAStringSet* or *readAAStringSet* functions provided by the *Biostrings* packages.

## Process MPSE object using dplyr-verbs {#chapter5.4}

To facilitate data manipulation and exploration of the microbiome data, *MicrobiotaProcess* defined a tidy-like formatted output for the *MPSE* object and extended a subset of the dplyr-verbs to support the *MPSE* object. The extended dplyr-verbs include:
1) *filter* function: subset *MPSE* class, retaining the data that satisfy the provided conditions. 
2) *select* function: select the data according to the provided variables. 
3) *group_by* function: return a grouped tbl_df-like data frame according to groups defined by the provided variables, then some data operations can be done on the groups.
4) *mutate* function: create new columns according to the provided variables and conditions. 
5) *left_join* function: add columns from ‘y’ (a data.frame object) to ‘x’ (an *MPSE* object), by matching all rows of ‘x’ based on the keys. The keys only should be one or all of the Sample or OTU. 
6) *rename* function: rename the column names of an *MPSE* object, except the OTU, Sample, and Abundance column names which cannot be renamed.

## Data preprocessing {#chapter5.5}

The microbiome features (OTU or ASV) with very low abundance and rare occurrence (exits in few samples) are difficult to distinguish from the sequencing error or other experimental technical errors and are usually uninformative. It is better to improve the statistical power of multiple testing in the downstream analysis by filtering these features. *MicrobiotaProcess* presents *mp_filter_taxa* to filter the features based on their abundance (default Abundance count) and sample prevalence. By default, this function will screen out the features with zeros counts in 0.05% sample prevalence and users can reset the criteria via the parameters of *min.abun* (minimum abundance in a sample) and *min.prop* (the minimum sample prevalence). To make the data more meaningful for the downstream analysis after filtering, *MicrobiotaProcess* provides the *mp_decostand* and *mp_rrarefy* functions for the standardization of community data by inheriting the *decostand* and *rrarefy* functions from vegan [@vegan]. The *mp_rrarefy* function allows users to estimate expected diversity (e.g., taxonomic richness) for a reduced sampling size, while the *mp_decostand* function provides several standardization methods for community data, such as total (divide by total abundance of each sample (relative abundance)), max (divide by the max abundance of the feature in all samples), frequency (divide by total abundance of each sample and multiply the number of non-zero features), hellinger (square root for the result of total) [@Legendre2001], log (logarithmic transformation log_b(x > 0) + 1) [@PMID:16706913]. More importantly, we developed the mp_balance_clade method to convert the abundance of species to the balances of internal nodes with the geometric mean, mean or median abundance of the offspring tips in the same clade of the phylogenetic tree. This will convert the compositional microbiota data to an unconstrained coordinate system effectively and may improve the identification of differential clades by accumulating the small consilient differences at a higher resolution on the phylogenetic tree [@gnesis2017; @Egozcue2003]. These functions are developed to follow the tidiness concept and the results will be added to the assays slot of the MPSE object automatically to enhance reproducibility and reuse in the follow-up analysis.

## Alpha diversity {#chapter5.6}

Alpha diversity measures the species richness and evenness within a community. *MicrobiotaProcess* provides the *mp_cal_alpha* function to calculate the community diversity. There are six commonly used methods to calculate alpha diversity, including Observe (calculates the total species per sample), Chao1, ACE (estimate species richness by considering the low abundance of species). Pielou (measures the species' evenness), Shannon and Simpson (take both the richness and evenness of species into account). By default, the *mp_cal_alpha* will rarefy the abundance before calculating the alpha diversity. Users can specify the force argument to TRUE to calculate the diversity directly without performing rarefaction. This will be useful for taxonomic profiling data since they are usually in relative abundance and cannot be rarefied. By default, the results are added to the *colData* slot that stored the metadata information of samples and returns an updated *MPSE* object. *MicrobiotaProcess* also provides the *mp_cal_pd*_metric function to calculate several phylogenetic community structure metrics, such as PD (Faith's Phylogenetic Diversity), NRI (Nearest Relative Index), NTI (Nearest Taxon Index), IAC (Relative deviation from the null expectation of phylogenetically balanced abundances), PAE (Phylogenetic evenness of the abundance distribution scaled by branch lengths), HAED (Entropic measure of the diversity of evolutionary distinctiveness among individuals), EAED (Equitability of HAED). These metrics provide the measures of the phylogenetic diversity incorporating the species abundance of community, which can help users to understand the impact of phylogenetic history on the corresponding microbiota ecological interactions [@PhylogeneticMetric2; @PhylogeneticMetric]. PAE, HAED, EAED, and IAC can be used to evaluate the structure of a phylogeny of assemblage communities by incorporating the species abundance of the community, NRI and NTI can be used to examine whether an observed assembly of communities is a phylogenetically biased subset of the species that could coexist in that assemblage [@PhylogeneticMetric2]. The *mp_plot_alpha* function is implemented to visualize alpha diversity and it allows comparing different communities that were specified via the *.group* parameter.

## Taxonomy composition {#chapter5.7}

To compare the difference in OTUs (features) composition between different communities, *MicrobiotaProcess* provides the *mp_cal_upset* and *mp_cal_venn* to calculate the conjunct OTUs (features) or specific OTUs (features) of different groups (specified by the *.group* parameter). The result can be visualized by the *mp_plot_upset* and *mp_plot_venn* functions respectively. The microbiome OTUs (features) are often annotated to different taxonomy levels in upstream analyses, and to survey the species profile of different samples, it is often necessary to calculate the abundances of different taxonomy levels. *MicrobiotaProcess* implements the *mp_cal_abundance* function to calculate the abundances of all taxonomy levels. Similar to the calculation of alpha diversity, the *mp_cal_abundance* will rarefy the raw abundance and then calculate the relative abundance by default. Users can specify the *force* argument to *TRUE* to disable rarefaction. And the relative argument controls whether to calculate the relative abundance (total is 100% for the same taxonomy level). The results will be added to the associated data of the *taxatree* (*treedata* object) slot by default. The abundance of a selected taxonomy level can be extracted by the *mp_extract_abundance* function with the *taxa.class* parameter specified. The results of the *mp_cal_abundance* can be visualized by the *mp_plot_abundance* function. 

## Beta diversity {#chapter5.8}

The beta diversity has been applied in a broad sense to measure variation or changes in community composition. It can assess how microbiota composition changes across spatial and temporal scales. Some distance indexes, such as the Bray-Curtis index, Jaccard index, and UniFrac (weighted or unweighted) index, are useful and popular to measure the degree of community differentiation. These distances can be further subjected to ordination which aims to capture essential information in a lower-dimensional representation and is commonly used to visualize sample dissimilarities. *MicrobiotaProcess* implements the *mp_cal_dist* function to compute the common distances (dissimilarity) and provides the mp_plot_dist function to visualize the result. It also provides several commonly-used ordination methods, such as Principal Components Analysis (PCA: *mp_cal_pca*), Principal Coordinate Analysis (PCoA: *mp_cal_pcoa*), Nonmetric Multidimensional Scaling (NMDS: *mp_cal_nmds*), Detrended Correspondence Analysis (DCA: *mp_cal_dca*), Redundancy Analysis (RDA: *mp_cal_rda*), and (Constrained) Correspondence Analysis (CCA: *mp_cal_cca*). To fit environmental vectors or factors onto an ordination, this package provides the *mp_envfit* function to perform this analysis. All the ordination results can be visualized by the *mp_plot_ord* function. In addition, it also wraps several statistical analyses for the distance matrices, such as permutational multivariate analysis of variance (*mp_adonis*), analysis of similarities (*mp_anosim*), and multi-response permutation procedure (*mp_mrpp*), and mantel tests for dissimilarity matrices (*mp_mantel*). All these functions are developed based on a tidy-like framework. These functions can be assembled into linear workflows with the pipe operator (%>% or |>).

## Differential analysis and biomarker discovery {#chapter5.9}

*MicrobiotaProcess* implements the *mp_diff_analysis* function for identifying differentially abundant genera as biomarkers based on the tidy-like framework. Similar to LEfSe[@Segata2011], there are three steps to perform this analysis. First, all features are tested to determine whether values (e.g., abundance) in different groups of samples are differentially distributed via the Kruskal-Wallis rank-sum test (default, other option is oneway.test, glm or glm.nb). Then, the resulting features infringing the null hypothesis (using the FDR to filter by default which is different with LEfSe) are further tested by the second round of the test using Wilcoxon rank-sum test (default, another option is t.test, glm or glm.nb) to keep the features that in all pairwise comparisons between the sub-groups are significantly consistent with the group level trend. Finally, the linear discriminant analysis (LDA) or random forest model was built to rank all the features based on the relative difference among different groups. Compare to LEfSe, *mp_diff_analysis* is more flexible. Not only the test method but also the test value (using generalized fold change [@Wirbel2019] by default, another option is comparing the median or mean value of different groups by using *compare_median* or *compare_mean*) can be set to return which group has more abundant the significant features by users. The result is integrated into the *taxatree* component (default) or *rowData* component depending on whether the taxonomy is provided or not. The result can be extracted via the *mp_extract_tree* or *mp_extract_feature* respectively. Then it can be processed and displayed via treeio [@treeio], tidytree [@tidytree], ggtree [@ggtree2017], ggtreeExtra [@ggtreeExtra], and ggplot2 (Figure 4A and Figure 7). To decrease the coding burden, we also developed *mp_plot_diff_boxplot*, *mp_plot_diff_manhattan*, *mp_plot_diff_res*, and *mp_plot_diff_cladogram* to visualize the result of differential analysis (Figure 3F ). The evaluation of simulation dataset and real datasets between the *mp_diff_analysis* and other tools are available in the supplemental B.

## Accessors to fetch internal data {#chapter5.10}

The *MPSE* object is composed of several objects to store different data including primary data and analysis results. To extract the components of the data, *MicrobiotaProcess* provides several accessors starting with *mp_extract_*, including: 
1) *mp_extract_sample* function: to return the sample characteristics in a tidy data table, similar to the *colData* function for the *SummarizedExperiment* class. 
2) *mp_extract_assays* function: to extract the assays from an *MPSE* object, similar to the *assay* function for the *SummarizedExperiment* class.
3) *mp_extract_feature* function: to extract the features characteristics (optional with taxonomy information by setting *addtaxa* argument to *TRUE*) and return tidy data, similar to the *rowData* function for the *SummarizedExperiment* class.
4) *mp_extract_tree* function: to extract the *taxatree* (by default) or *otutree* (by specifying the type argument to *otutree*) from an *MPSE* object.
5) *mp_extract_taxonomy* function: to extract the taxonomy information from an *MPSE* object.
6) *mp_extract_refseq* function: to extract the representative sequences from an *MPSE* object.
7) *mp_extract_dist* function: to extract distances in a matrix (as a dist object, by default) or a tidy data frame with comparison among the groups (by specifying the *.group* argument).
8) *mp_extract_rarecurve* function: to extract rarefaction in a *rarecurve* object, which can be visualized by *ggrarecurve* function.
9) *mp_extract_internal_attr* function: to extract the raw result of *mp_cal_pca*, *mp_cal_pcoa*, *mp_cal_rda*, *mp_cal_cca*, *mp_cal_clust*, *mp_envfit*, *mp_adonis*, *mp_anosim*, *mp_mrpp* and *mp_mantel*.

# Session information {#chapter6}

Here is the output of sessionInfo() on the system on which this document was compiled:

```{r, echo=FALSE}
options(width = 200)
sessioninfo::session_info()
```

# References {#chapter7}