Skip to content

Latest commit

 

History

History
585 lines (493 loc) · 27.1 KB

README.md

File metadata and controls

585 lines (493 loc) · 27.1 KB

NFDI4Microbiota - Metadata Standards

Creative Commons License
This work is licensed under a Creative Commons Attribution 4.0 International License.

The primary objective of this GitHub page is to serve as a centralized repository for existing (meta)data standards. The purpose is to provide the international microbiological community with a comprehensive and easily accessible compilation of established standards, facilitating efficient navigation and utilization for researchers involved in collecting and submitting (meta)data to public repositories.

In line with the NFDI4Microbiota (“German National Research Data Infrastructure” ) project’s objectives, this page aims to address the challenges of microbial (meta)data accessibility and consistency. The efficient exchange of usable information between research groups, sequencing centers, and data repositories has been a long-standing issue. Measure 2.1 (M2.1) specifically focuses on maximizing data quality within the NFDI4Microbiota system by enforcing compliance with existing standards and identifying additional tailored data standards and metadata requirements.

Goals: By centralizing standard parameters for metadata, the project ensures that generated data is reproducible and comparable both spatially and temporally. To achieve this, two milestones have been set:

  • defining data standards for different types of raw data, and ensuring their quality and reliability
  • defining data standards for technical metadata, further enhancing the consistency and usability of the collected metadata.

In the context of metadata quality standards in microbial science, two main categories are being considered:

These categories aim to encompass the necessary information that researchers collecting and submitting metadata to public repositories need to provide. By adhering to these standardized metadata categories, researchers can ensure the integrity and interoperability of their data, enabling effective collaboration and comparative analysis within the international microbiological community.

Reading this Github

The key aspects considered for determining minimal metadata standards that can be universally applicable across various datasets and microbiomes are outlined in Figure 1. These aspects encompass both technical and biological/environmental (Bio/Env) considerations. The figure illustrates the comprehensive approach used to establish minimal metadata standards for diverse research settings by combining already established standards for differing data types and biomes.

Overview Figure 1. Flow Chart of Technical and Biological/Environmental Metadata Standard Development

This flow chart illustrates the process of developing metadata standards for both Technical and Biological/Environmental aspects. Technical parameters are categorized based on data types, while Bio/Env parameters are organized according to biome types. Additionally, specific considerations, such as file type and host, are taken into account to enhance the comprehensiveness of the standards.

Technical Metadata Standards

Technical metadata section

1. Data types

The following data types were considered when establishing minimal technical metadata standards for M2.1:

  • Genomes
  • Amplicon
  • Metagenomes
  • Metagenome assembled genomes (MAGs)
  • Transcriptomes
  • Metatranscriptomes
  • Proteomes
  • Metaproteomes
  • Metabolomes

Standard parameter considerations for FASTQ and FASTA formats are displayed in Figure 2. and Figure 3., respectively. Parameter applicability to different data types and the time of data generation (i.e., before sequencing or during data processing) are shown on the left and right, respectively.

Additionally, standards are being considered for data transfer and data integrity to ensure quality is maintained throughout various processes of data file exchange.

2. Overview of minimal technical FASTQ and FASTA metadata considerations.

FASTQMetadataTablesOverview Figure 2. Overview of Minimal Technical Metadata for FASTQ Files

This figure provides an overview of the minimal technical metadata relevant to FASTQ files. The left side lists the applicability of parameters to different data types, such as (meta)genome, (meta)transcriptome, etc. On the right side, the time of metadata generation is indicated.

FASTAMetadataTablesOverview Figure 3. Overview of Minimal Technical Metadata for FASTA Files

This figure presents an overview of the minimal technical metadata relevant to FASTA files. On the left side, the applicability of parameters to different data types, including (meta)genome, (meta)transcriptome, etc., is listed. The right side provides information about the time of metadata generation.

3. Minimal technical metadata by technology and file type

Establishing a file-specific metadata standard list poses a significant challenge due to variations in file types across instruments used in metabolomic and proteomic analyses. Thus, researchers can find the metadata standards for each specific technology within corresponding links. This approach recognizes the complexities of defining comprehensive and universally applicable metadata standards that differ based on technology.

4. Data transfer and data integrity

The work of the Data transfer and data integrity section focuses on:

  • Examples of existing data transfer & data integrity checks
  • Data integrity considerations by file type

Biological and Environmental Metadata Standards

Bio/Env metadata section

1. Biomes considered

Six microbiomes were considered to compile a minimal set of biological and environmental metadata standards. Environmental and biological parameters were identified as minimums applicable to individual biomes and/or hosts.

The Minimal Biological and Environmental microbiome metadata standards within M2.1 were collected to apply to the following biomes:

Tentative standard minimal biological and environmental parameter considerations are displayed in Figure 4. Parameter applicability to different biomes are shown on the left axis.

BioEnvMetadata23June2022 Figure 4. Tentative Minimal Biological and Environmental Metadata.

This figure presents the division of minimal biological and environmental metadata into distinct categories. Site metadata includes specifications and environmental parameters related to the geographic sampling location, while sample material and host metadata provide information specific to host-associated systems. The applicability of these standards to different microbiomes is shown on the left. Additionally, conditional metadata standards encompass pertinent minimal cultivation information.

The references in the figure are from the following sources:

2. Data/metadata categorization

The categorization framework in Figure 5 should be considered when determining the applicable metadata standards for each dataset. This framework can serve as a valuable tool for connecting information about samples from marine, terrestrial, or engineered systems. Additionally, it facilitates the inclusion of cultivated samples, whether they were cultured from a commercially-available source or isolated from an environmental sample by the user.

To enhance searchability in downstream analyses, users can select multiple environment categories if relevant. For instance, they may choose both “marine” and “terrestrial” for a tidal flat site, “engineered” and “terrestrial” for a greenhouse agricultural site, or “engineered” and “marine” for a commercially-available culture initially isolated from the ocean.

Category Flow Chart

Figure 5. Tentative Categorization Framework for Biological/Environmental Metadata Requirements

This figure showcases a preliminary categorization framework to establish minimal biological/environmental metadata requirements. The framework connects host-associated systems to marine, terrestrial, or engineered environments while enabling effective tracking of data affiliated with cultivated samples. The structure should provide valuable insights for organizing and comprehensively accessing diverse datasets.

Figures 6 - 8 show examples of minimal biological/environmental metadata applicability to different sample categorizations.

Human Gut Example

Figure 6. Example of Categorizing a Human Gut-Associated and Cultivated Sample with Applicable Minimal Metadata

This figure provides an illustrative example of the categorization process for a human gut-associated and cultivated sample. It showcases the minimal metadata that are applicable and relevant for this specific sample type.

Tidal Flat Example

Figure 7. Example of Categorizing a Tidal Flat and Cultivated Sample with Applicable Minimal Metadata

This figure presents a practical example of categorizing a tidal flat cultivated sample, along with the relevant minimal metadata. The illustration demonstrates how the proposed framework accommodates overlapping environments, such as terrestrial and marine, specifically for intertidal regions.

Lab Culture Example

Figure 8. Example of Categorizing a Known Lab Cultured Sample with Applicable Minimal Metadata

This figure presents an example of categorizing a known lab-cultured sample, along with the corresponding minimal metadata. The bidirectionality of the categorization framework is highlighted, as it enables the linkage between known, commercially available cultures and their original sample environments.

References

Bowers, R., N. Kyrpides, R. Stepanauskas, et al. 2017. “Minimum Information about a Single Amplified Genome (MISAG) and a Metagenome-Assembled Genome (MIMAG) of Bacteria and Archaea.” Nat Biotechnol 35: 725–31. https://doi.org/10.1038/nbt.3893.

Buttigieg, P. L., N. Morrison, B. Smith, C. J. Mungall, S. E. Lewis, and ENVO Consortium. 2013. “The Environment Ontology: Contextualising Biological and Biomedical Entities.” Journal of Biomedical Semantics 4 (1): 43. https://doi.org/10.1186/2041-1480-4-43.

“ENA Marine Microalgae Checklist; Checklist: ERC000043.” https://www.ebi.ac.uk/ena/browser/view/ERC000043.

“ENA Tara Oceans; Checklist: ERC000030.” https://www.ebi.ac.uk/ena/browser/view/ERC000030.

Field, D., G. Garrity, T. Gray, N. Morrison, J. Selengut, P. Sterk, T. Tatusova, et al. 2008. “The Minimum Information about a Genome Sequence (MIGS) Specification.” Nature Biotechnology. 2008. https://doi.org/10.1038/nbt1360.

“German National Research Data Infrastructure.” https://www.nfdi.de/?lang=en.

“GSC MIxS Human Associated; ENA Checklist: ERC000014.” https://www.ebi.ac.uk/ena/browser/view/ERC000014.

“GSC MIxS Water; ENA Checklist: ERC000024.” https://www.ebi.ac.uk/ena/browser/view/ERC000024.

“GSC MIxS: AgricultureMIMS.” https://genomicsstandardsconsortium.github.io/mixs/AgricultureMIMS/.

“GSC MIxS: BuiltEnvironmentMIMS.” https://genomicsstandardsconsortium.github.io/mixs/BuiltEnvironmentMIMS/.

“GSC MIxS: Host-associatedMIMS.” https://genomicsstandardsconsortium.github.io/mixs/Host-associatedMIMS/.

“GSC MIxS: Human-associatedMIMS.” https://genomicsstandardsconsortium.github.io/mixs/Human-associatedMIMS/.

“GSC MIxS: Human-gutMIMS.” https://genomicsstandardsconsortium.github.io/mixs/Human-gutMIMS/.

“GSC MIxS: Human-oralMIMS.” https://genomicsstandardsconsortium.github.io/mixs/Human-oralMIMS/.

“GSC MIxS: Human-skinMIMS.” https://genomicsstandardsconsortium.github.io/mixs/Human-skinMIMS/.

“GSC MIxS: Human-vaginalMIMS.” https://genomicsstandardsconsortium.github.io/mixs/Human-vaginalMIMS/.

“GSC MIXS: MicrobialMatBiofilmMIMS.” https://genomicsstandardsconsortium.github.io/mixs/MicrobialMatBiofilmMIMS/.

“GSC MIxS: MiscellaneousNaturalOrArtificialEnvironmentMIMS.” https://genomicsstandardsconsortium.github.io/mixs/MiscellaneousNaturalOrArtificialEnvironmentMIMS/.

“GSC MIxS: Plant-associatedMIMS.” https://genomicsstandardsconsortium.github.io/mixs/Plant-associatedMIMS/.

“GSC MIxS: SedimentMIMS.” https://genomicsstandardsconsortium.github.io/mixs/SedimentMIMS/.

“GSC MIxS: SoilMIMS.” https://genomicsstandardsconsortium.github.io/mixs/SoilMIMS/.

“GSC MIxS: Symbiont-associatedMIMS.” https://genomicsstandardsconsortium.github.io/mixs/Symbiont-associatedMIMS/.

“GSC MIxS: WastewaterSludgeMIMS.” https://genomicsstandardsconsortium.github.io/mixs/WastewaterSludgeMIMS/.

“GSC MIxS: WaterMIMS.” https://genomicsstandardsconsortium.github.io/mixs/WaterMIMS/.

Kottmann, R., T. Gray, S. Murphy, L. Kagan, S. Kravitz, T. Lombardot, D. Field, F. O. Glöckner, and Genomic Standards Consortium. 2008. “A Standard MIGS/MIMS Compliant XML Schema: Toward the Development of the Genomic Contextual Data Markup Language (GCDML).” OMICS: A Journal of Integrative Biology. 2008. https://doi.org/10.1089/omi.2008.0A10.

“MIMS: Metagenome/Environmental, Human-Associated; Version 6.0 Package.” https://www.ncbi.nlm.nih.gov/biosample/docs/packages/MIMS.me.human-associated.5.0/.

Morrison, Norman, Daniel Bearden, Jacob G. Bundy, Timothy Collette, Fraser Currie, Matthew Davey, Migdalia Dominguez, et al. 2007. “Standard Reporting Requirements for Biological Samples in Metabolomics Experiments: Environmental Context.” Metabolomics 3 (2): 203–10. https://doi.org/10.1007/s11306-007-0067-1.

Murray, A. E., J. Freudenstein, S. Gribaldo, et al. 2020. “Roadmap for Naming Uncultivated Archaea and Bacteria.” Nat Microbiol 5: 987–94. https://doi.org/10.1038/s41564-020-0733-x.

“NFDI4Microbiota.” https://nfdi4microbiota.de/.

“U.s. Office of Management and Budget (OMB): About the Topic of Race.” https://www.census.gov/topics/population/race/about.html.

Yilmaz, Pelin et al. 2011. “Minimum Information about a Marker Gene Sequence (MIMARKS) and Minimum Information about Any (x) Sequence (MIxS) Specifications.” Nature Biotechnology 29 (5): 415–20. https://doi.org/10.1038/nbt.1823.