Building a knowledge graph of the wastewater treatment microbiome and its biological context
- About the project
- Data
- Exploratory data analysis
- Microbial associarion networks
- MicW2Graph
- Case studies
- How to run the web app locally?
- Credits and Contributors
- Contact
Wastewater treatment (WWT) is the process of removing contaminants from used water before it is discharged back into the environment, which contributes to address water scarcity and to protect aquatic ecosystems. Recent advances in high-throughput omics technologies have facilitated the study of microbiomes from complex environmental samples such as WWT. A comprehensive study of an environmental microbiome requires integrating data from various studies and meta-omics technologies, as well as biological knowledge to interpret these data.
In this project, we investigated the microbiome of the WWT process to build MicW2Graph, an open-source knowledge graph that integrates metagenomic and metatranscriptomic information with their biological context, including biological processes, environmental and phenotypic features, chemical compounds, and additional metadata. We developed a workflow to collect meta-omics datasets from MGnify and infer potential interactions among microorganisms through microbial association networks. MicW2Graph enables the investigation of research questions related to WWT, focusing on aspects such as microbial connections, community memberships, and potential ecological functions.
The following figure shows the general workflow of the MicW2Graph project:
WWT meta-omics studies were queried from the MGnify API using experiment type and biome parameters. Further filters were applied based on experimental and taxonomic criteria. The abundance tables from the filtered studies were then grouped by biome and experiment type to infer microbial association networks. The workflow for retrieving and filtering WWT meta-omics studies from MGnify is summarized in the diagram below:
The code to retrieve the data from MGnify is available in this GitHub repository.
A general overview of the filtered studies was provided through various plots, describing the number of studies and samples, experiment types, sampling countries, sub-biomes, and other relevant metadata.The exploratory data analysis was encapsulated in a module of the MicW2Graph web application, containing a general overview of all studies, studies by sub-biomes, individual studies, and a section for conducting pairwise comparisons between studies.
MANs are weighted and undirected networks, defined as G = (V, E), where V is a set of nodes and E is a set of edges. Nodes in these networks are Operational Taxonomic Units at a specific taxonomic level, while edges indicate substantial co-presence (positive interaction) or mutual exclusion (negative interaction) trends in microorganism abundances across samples. Weights in MANs correspond to association values among species defined by the inference method, and there is an edge between two nodes if this number is greater than or equal to a given cutoff t.
In this project, we selected the Correlation inference for Compositional data through Lasso (CCLasso) method. Network inference was conducted using the NetCoMi R package. The MANs for this study are available for download and visualization in the MicW2Graph web application.
The code for the network inference and analysis of MANs is available in this GitHub repository.
MicW2Graph incorporates the MANs with the optimal association threshold for each WWT sub-biome and experiment type, the biological context of the species within the MANs, and ontologies that standardize and expand the information of this resource. This KG comprises 1247 nodes and 9749 relationships, categorized into 12 node labels and 8 relationship labels. The relationships in MicW2Graph are classified as taxonomic, functional, and data-driven, reflecting the different layers of knowledge available in the KG.
The MicW2Graph metagraph and a snapshot of the graph database with nodes and edges for all sub-biomes and experiment types are shown below:
The KG and sub-biome subgraphs are available for download and visualization in the MicW2Graph web application.
The use cases demonstrate the potential of MicW2Graph to discover new species associated with WWT biological processes, showing how the available information of well-known species can help to predict potential functions and traits for less studied species. These species and communities can be further investigated as potential candidates to optimize the bioremediation process. The subgraphs for the case studies can be visualized and downloaded in the MicW2Graph web application.
Pyenv and Poetry were used to create a Python virtual environment, which allows the management of python libraries and their dependencies. Each Poetry virtual environment has a pyproject.toml
file with the names and versions of libraries installed, and a poetry.lock
file, a JSON file that contains versions of libraries and their dependencies.
To create a Python virtual environment with libraries and dependencies required for this project, you should install Pyenv and Poetry, create a Python virtual environment with the 3.11 version of Python, clone this GitHub repository, open a terminal, move to the folder containing this repository, and run the following commands:
# Activate 3.11.0 version of Python
$ pyenv local 3.11.0
# Create the Python virtual environment with Poetry
$ poetry install
# Activate the Python virtual environment
$ poetry shell
You can find a detailed guide on how to use Poetry here.
Alternatively, you can create a conda virtual environment with the required libraries using the requirements.txt
file.
After installing the libraries, you can run the streamlit app locally with the command below:
$ streamlit run MicW2Graph_Home.py
- Developed by Sebastián Ayala Ruano under the supervision of Dr. Alberto Santos, head of the Multiomics Network Analytics Group (MoNA) at the Novo Nordisk Foundation Center for Biosustainability (DTU Biosustain).
- MicW2Graph was built for the thesis project from the MSc in Systems Biology at the MoNA group.
- The data for this project was obtained from Mgnify, using the scripts available in this GitHub repository.
If you have comments or suggestions about this project, you can open an issue in this repository.