Skip to content

Latest commit

 

History

History
262 lines (161 loc) · 17.2 KB

README.md

File metadata and controls

262 lines (161 loc) · 17.2 KB

NFDI4Microbiota - Metadata Standards

Reading this Github

NFDI4Microbiota introduction

The National Research Data Infrastructure Germany (NFDI) is currently comprised of 19 consortium members spanning diverse fields, including physical sciences, human health, biology, artificial intelligence, cultural and economic science, among others1. In July 2021, NFDI4Microbiota was selected to become a consortium member and holds a mission "to be the central hub in Germany for supporting the microbiology community with access to data, analysis services, data/metadata standards and training."2 Through building analytical tools, ensuring FAIR principles are followed, and standardizing metadata and data processing, NFDI4Microbiota will contribute to the interdisciplinary NFDI network from the microbiological perspective.

NFDI4Microbiota - Standards and Policies

NFDI4Microbiota aims to address issues of microbial data accessibility and consistency. These issues have long presented challenges for the efficient exchange of useable information between research groups, data generators (e.g. sequencing centers), and data repositories. Specifically, Measure 2.1 (M2.1) has the goal "to maximize the quality of data entering the NFDI4Microbiota system by enforcing compliance with existing standards, as well as to identify and promote additional tailored data standards and metadata requirements within the NFDI4Microbiota systems." Establishing standard parameters for metadata will ensure that generated data is reproducible and comparable, both spatially and temporally.

Goals and Milestones

Goals: To maximize the quality of data entering the NFDI4Microbiota system by enforcing compliance with existing standards, as well as to identify and promote additional tailored data standards and metadata requirements within the NFDI4Microbiota systems through the following two milestones:

  • Definition of data standards for the different types of raw data established
  • Definition of data standards for the technical metadata established

To address metadata quality standards in microbial science, two metadata categories are being considered:

  • Technical
  • Biological/Environmental

Figure 1 outlines the aspects of both technical and biological/environmental (Bio/Env) that were taken into account when determining metadata parameters that would be applicable across various datasets and microbiomes.

Overview Figure 1. Flow chart of Technical and Biological/Environmental metadata standard development. Technical parameter categories were structured based on data types, and bio/env parameter categories were based on biome type. More specific considerations were taken into account for file type, host, etc.

Technical Metadata Standards

Technical metadata section 1. Data types

The following data types were considered when establishing minimal technical metadata standards for M2.1:

  • Genomes
  • Amplicon
  • Metagenomes
  • Metagenome assembled genomes
  • Transcriptomes
  • Metatranscriptomes
  • Proteomes
  • Metaproteomes
  • Metabolomes

Standard parameter considerations for FASTQ and FASTA formats are displayed in Figures 2 and 3, respectively. Parameter applicability to different data types and the time of data generation (i.e. before sequencing or during data processing) are shown in the left and right, respectively.

Additionally, standards are being considered for data integrity and data transfer to ensure quality is maintained throughout various processes of data file exchange.

Technical metadata section 2. Overview of minimal technical FASTQ and FASTA metadata considerations.

FASTQMetadataTablesOverview_ Figure 2. Overview of minimal technical metadata considered for FASTQ files. Parameter applicabilty to data types ((meta)genome, (meta)transcriptome, etc.) is listed on the left, and time of metadata generation is listed on the right.

FASTAMetadataTablesOverview Figure 3. Overview of minimal technical metadata considered for FASTA files. Parameter applicabilty to data types ((meta)genome, (meta)transcriptome, etc.) is listed on the left, and time of metadata generation is listed on the right.

Technical metadata section 3. Minimal technical metadata by technology and file type

Because file type varies by the instrument used in metabolomic and proteomic analyses, establishing a file-specific metadata standard list presents challenges. Therefore, the metadata standards for these can be found within each technology link.

Technical metadata section 4. Data transfer and data integrity

The work of the Data transfer and data integrity section focuses on:

  • Examples of existing data transfer & data integrity checks
  • Data integrity considerations by file type

Biological and Environmental Metadata Standards

Bio/Env metadata section 1. Biomes considered

To compile a minimal set of biological and environmental metadata standards, six microbiomes were considered. Environmental and biological parameters were identified as minimums as applicable to individual biomes and/or hosts.

The Minimal Biological and Environmental microbiome metadata standards within M2.1 were established to be applicable to the following biomes:

Tentative standard minimal biological and environmental parameter considerations are displayed in Figure 5. Parameter applicability to different biomes are shown on the left axis.

BioEnvMetadata23June2022 Figure 5. Tentative minimal biological and environmental metadata, divided into two categories; site metadata for specifications and environmental parameters relating to the geographic sampling location and sample material, and host metadata information specific to host-associated systems. Applicability to different microbiomes are shown on the left. Conditional metadata standards include pertinent minimal cultivation information.

The references in the figure are from the following sources:

  • MIMS/MIxS: Human Associated package 34, Water 5
  • IMG: Joint Genome Institue Integrated Microbial Genomes & Microbiomes 6
  • ENA MMC: ENA Marine Microalgae Checklist 7
  • TMDB: Terrestrial Metagenome Database 8
  • MMgP: Marine Metagenomics Portal 9
  • PlanetMicrobe 10
  • TO: Tara Oceans 11
  • MSI: Metabolomics Standards Initiative 12
  • HMgDB: Human Metagenome Database 13
  • HMP: Human Metagenome Project 14
  • PAMDB: Plant Associated and Environmental Microbes Database 15

Bio/Env metadata section 2. Data/metadata categorization

In order to determine which metadata standards may be applicable to each dataset, the categorization framework in Figure 6 is being considered. This structure can bridge information about samples which come from marine, terrestrial, or engineered systems. It can also connect samples which were cultivated - either cultured from a commercially-available source, or isolated from an environmental sample by the user. To support searchabilty for downstream analyses, there is also the ability to select multiple environment categories if applicable (e.g. "marine" and "terrestrial" could be selected for a tidal flat site, "engineered" and "terrestrial" for a greenhouse agricultural site, or "engineered" and "marine" for a commercially-avaiable culture initially isolated from the ocean).

CategoryFlowchart

Figure 6. Tentative categorization framework for establishing biological/environmental metadata requirements. This structure allows for connecting host-associated systems to marine, terrestrial, or engineered environments. It also allows tracking of data which are affiliated with cultivated samples.

Figures 7-9 show examples of minimal biological/environmental metadata applicability to different sample categorizations.

HumanGutExample

Figure 7 Example of categorizing a human gut-associated and cultivated sample, and the applicable minimal metadata.

TidalFlatExample

Figure 8 Example of categorizing a tidal flat uncultivated sample, and the applicable minimal metadata. The proposed framework allows for overlapping environments (i.e. terrestrial and marine for intertidal regions) to enchance downstream searchability.

LabCultureExample

Figure 9 Example of categorizing a known lab cultured sample, and the applicable minimal metadata. Bidirectionality of the categorization framework allows linking known, commercially available cultures and their original sample environments.

References

Footnotes

  1. https://www.nfdi.de/

  2. https://nfdi4microbiota.de/

  3. https://www.ncbi.nlm.nih.gov/biosample/docs/packages/MIMS.me.human-associated.5.0/

  4. https://www.ebi.ac.uk/ena/browser/view/ERC000014

  5. https://www.ebi.ac.uk/ena/browser/view/ERC000024

  6. https://img.jgi.doe.gov/cgi-bin/m/main.cgi?section=FindGenomes&page=genomeSearch

  7. https://www.ebi.ac.uk/ena/browser/view/ERC000043

  8. https://webapp.ufz.de/tmdb/

  9. https://mmp2.sfb.uit.no/marref/

  10. https://www.planetmicrobe.org/#/search

  11. https://www.ebi.ac.uk/ena/browser/view/ERC000030

  12. https://github.com/MSI-Metabolomics-Standards-Initiative/CIMR

  13. https://webapp.ufz.de/hmgdb/

  14. https://hmpdacc.org/

  15. pamdb.org