- Begin by reading the NFDI4Microbiota introduction, Standards and Policies information, and Goals and Milestones
- Next, read the information regarding technical metadata parameter standards
- Third, read the biological/environmental metadata standards section
The National Research Data Infrastructure Germany (NFDI) is currently comprised of 19 consortium members spanning diverse fields, including physical sciences, human health, biology, artificial intelligence, cultural and economic science, among others1. In July 2021, NFDI4Microbiota was selected to become a consortium member and holds a mission "to be the central hub in Germany for supporting the microbiology community with access to data, analysis services, data/metadata standards and training."2 Through building analytical tools, ensuring FAIR principles are followed, and standardizing metadata and data processing, NFDI4Microbiota will contribute to the interdisciplinary NFDI network from the microbiological perspective.
NFDI4Microbiota aims to address issues of microbial data accessibility and consistency. These issues have long presented challenges for the efficient exchange of useable information between research groups, data generators (e.g. sequencing centers), and data repositories. Specifically, Measure 2.1 (M2.1) has the goal "to maximize the quality of data entering the NFDI4Microbiota system by enforcing compliance with existing standards, as well as to identify and promote additional tailored data standards and metadata requirements within the NFDI4Microbiota systems." Establishing standard parameters for metadata will ensure that generated data is reproducible and comparable, both spatially and temporally.
Goals: To maximize the quality of data entering the NFDI4Microbiota system by enforcing compliance with existing standards, as well as to identify and promote additional tailored data standards and metadata requirements within the NFDI4Microbiota systems through the following two milestones:
- Definition of data standards for the different types of raw data established
- Definition of data standards for the technical metadata established
To address metadata quality standards in microbial science, two metadata categories are being considered:
- Technical
- Biological/Environmental
Figure 1 outlines the aspects of both technical and biological/environmental (Bio/Env) that were taken into account when determining metadata parameters that would be applicable across various datasets and microbiomes.
Figure 1. Flow chart of Technical and Biological/Environmental metadata standard development. Technical parameter categories were structured based on data types, and bio/env parameter categories were based on biome type. More specific considerations were taken into account for file type, host, etc.
The following data types were considered when establishing minimal technical metadata standards for M2.1:
- Genomes
- Amplicon
- Metagenomes
- Metagenome assembled genomes
- Transcriptomes
- Metatranscriptomes
- Proteomes
- Metaproteomes
- Metabolomes
Standard parameter considerations for FASTQ and FASTA formats are displayed in Figures 2 and 3, respectively. Parameter applicability to different data types and the time of data generation (i.e. before sequencing or during data processing) are shown in the left and right, respectively.
Additionally, standards are being considered for data integrity and data transfer to ensure quality is maintained throughout various processes of data file exchange.
Technical metadata section 2. Overview of minimal technical FASTQ and FASTA metadata considerations.
Figure 2. Overview of minimal technical metadata considered for FASTQ files. Parameter applicabilty to data types ((meta)genome, (meta)transcriptome, etc.) is listed on the left, and time of metadata generation is listed on the right.
Figure 3. Overview of minimal technical metadata considered for FASTA files. Parameter applicabilty to data types ((meta)genome, (meta)transcriptome, etc.) is listed on the left, and time of metadata generation is listed on the right.
- 2.1 Genome Sequencing
- Genomic FASTQ
- Genomic FASTA
- 2.2 Amplicon Sequencing
- Amplicon FASTQ
- 2.3 Metagenome Sequencing
- Metagenome FASTQ
- Metagenome FASTA
- Metagenome assembled genome FASTA
- 2.4 Transcriptome Sequencing
- Transcriptome FASTQ
- Transcriptome FASTA
- 2.5 Metatranscriptome Sequencing
- Metatranscriptome FASTQ
- Metatranscriptome FASTA
- 2.6 Proteome sequencing
- Proteome
- Proteome - experimental protocol edition
- 2.7 Metaproteome sequencing
- 2.8 Metabolome sequencing
- Metabolome
- Metabolome - experimental protocol edition
- 2.9 BIOM or tabular files
Because file type varies by the instrument used in metabolomic and proteomic analyses, establishing a file-specific metadata standard list presents challenges. Therefore, the metadata standards for these can be found within each technology link.
The work of the Data transfer and data integrity section focuses on:
- Examples of existing data transfer & data integrity checks
- Data integrity considerations by file type
To compile a minimal set of biological and environmental metadata standards, six microbiomes were considered. Environmental and biological parameters were identified as minimums as applicable to individual biomes and/or hosts.
The Minimal Biological and Environmental microbiome metadata standards within M2.1 were established to be applicable to the following biomes:
- Marine
- Terrestrial
- Terrestrial (constructed)
- Plant-associated
- [Animal-associated]./Biological_Environmental/AnimalAssoc_BioEnv_Metadata.md)
- Human-associated
- Microbe-associated
Tentative standard minimal biological and environmental parameter considerations are displayed in Figure 5. Parameter applicability to different biomes are shown on the left axis.
Figure 5. Tentative minimal biological and environmental metadata, divided into two categories; site metadata for specifications and environmental parameters relating to the geographic sampling location and sample material, and host metadata information specific to host-associated systems. Applicability to different microbiomes are shown on the left. Conditional metadata standards include pertinent minimal cultivation information.
The references in the figure are from the following sources:
- MIMS/MIxS: Human Associated package 34, Water 5
- IMG: Joint Genome Institue Integrated Microbial Genomes & Microbiomes 6
- ENA MMC: ENA Marine Microalgae Checklist 7
- TMDB: Terrestrial Metagenome Database 8
- MMgP: Marine Metagenomics Portal 9
- PlanetMicrobe 10
- TO: Tara Oceans 11
- MSI: Metabolomics Standards Initiative 12
- HMgDB: Human Metagenome Database 13
- HMP: Human Metagenome Project 14
- PAMDB: Plant Associated and Environmental Microbes Database 15
In order to determine which metadata standards may be applicable to each dataset, the categorization framework in Figure 6 is being considered. This structure can bridge information about samples which come from marine, terrestrial, or engineered systems. It can also connect samples which were cultivated - either cultured from a commercially-available source, or isolated from an environmental sample by the user. To support searchabilty for downstream analyses, there is also the ability to select multiple environment categories if applicable (e.g. "marine" and "terrestrial" could be selected for a tidal flat site, "engineered" and "terrestrial" for a greenhouse agricultural site, or "engineered" and "marine" for a commercially-avaiable culture initially isolated from the ocean).
Figure 6. Tentative categorization framework for establishing biological/environmental metadata requirements. This structure allows for connecting host-associated systems to marine, terrestrial, or engineered environments. It also allows tracking of data which are affiliated with cultivated samples.
Figures 7-9 show examples of minimal biological/environmental metadata applicability to different sample categorizations.
Figure 7 Example of categorizing a human gut-associated and cultivated sample, and the applicable minimal metadata.
Figure 8 Example of categorizing a tidal flat uncultivated sample, and the applicable minimal metadata. The proposed framework allows for overlapping environments (i.e. terrestrial and marine for intertidal regions) to enchance downstream searchability.
Figure 9 Example of categorizing a known lab cultured sample, and the applicable minimal metadata. Bidirectionality of the categorization framework allows linking known, commercially available cultures and their original sample environments.