2.3 Metagenome sequencing

A searchable and exportable tab-separated table of the following metadata is now available.

Minimal technical metadata for `Metagenomic FASTQ` data

🔹 italics = potential considerations

metadata	definition	reference of definition[<url_to_definition>]	expected unit of measurement	examples	source
sample_name	A local identifier or name that for the material sample used for extracting nucleic acids, and subsequent sequencing. It can refer either to the original material collected or to any derived sub-samples. It can have any format, but we suggest that you make it concise, unique and consistent within your lab, and as informative as possible. INSDC requires every sample name from a single Submitter to be unique.	MIXS:0001107	free text with identifier	e.g. ISDsoil1	GSC MIxS/MIGS Bacteria (“GSC MIXS: MIGSBacteria”)
seq_meth	Sequencing machine used. Where possible the term should be taken from the OBI list of DNA sequencers (http://purl.obolibrary.org/obo/OBI_0400103)	MIXS:0000050	<name_of_seq_machine>[ontology]	e.g. 454 Genome Sequencer FLX [OBI:0000702]	GSC MIxS/MIGS Bacteria (“GSC MIXS: MIGSBacteria”), ENA Metadata Validation: Instrument (“ENA Metadata Validation: Instrument”)
lib_source	The lib_source specifies the type of source material that is being sequenced	Link to permitted values	Free text from selected list of values	e.g. METAGENOMIC.	ENA Metadata Validation: Source (“ENA Metadata Validation: Source”)
lib_strategy	Sequencing technique intended for this library	Link to permitted values	Free text from selected list of values	e.g. WGS, WGA, etc.	ENA Metadata Validation: Strategy (“ENA Metadata Validation: Strategy”)
lib_selection	Whether any method was used to select and/or enrich the material being sequenced	Link to permitted values	Free text from selected list of values	e.g. RANDOM, cDNA_oligo_dT etc.	ENA Metadata Validation: Selection (“ENA Metadata Validation: Selection”)
nucl_acid_ext	Literature reference or SOP describing nucleic extraction	MIXS:0000037	Free text to the reference	e.g. CTAB extraction, Phenol-Cloroform Extraction	GSC MIxS/MIGS Bacteria (“GSC MIXS: MIGSBacteria”)
nucl_acid_amp	A link to a literature reference, electronic resource or a standard operating procedure (SOP), that describes the enzymatic amplification (PCR, TMA, NASBA) of specific nucleic acids	MIXS:0000038	PMID, DOI, URL	e.g. https://phylogenomics.me/protocols/16s-pcr-protocol/	GSC MISAG [gsc_migs_bacteria]
sequence_count	Number of reads in the library (sequencing depth), assigned at submission	Link to submission of genomes	integer value + unit of measurement	e.g. 32,283,453 OR 32.3M	Adapted from NCBI-SRA (Leinonen et al. 2011)
basepairs_count	Number of base pairs (nucleotides) in the library, assigned at submission	Link to submission of genomes	integer value + unit of measurement	e.g. 6,400,000 or 6.4M	Adapted from NCBI-SRA (Leinonen et al. 2011)
average_length (optional)	basepairs_count divided by sequence_count	As defined here	Integer	e.g. 198	Calculated as basepairs_count/sequence_count
sequence_q30 (optional)	Percentage of reads in the library (sequencing depth) with quality above 30	Link to resource to calculate	Integer from 0-100	e.g. 85	SRA-Tinder (NCBI Hackathons)
basepairs_q30 (optional)	Percentage of base pairs (nucleotides) in the library with quality above 30	Link to resource to calculate	Integer from 0-100	e.g. 80	SRA-Tinder (NCBI Hackathons)
checksum	Hash value for data integrity	Link to ENA MD5 Checksum	string with checksum	e.g. MD5: cbc41d0e49636872a765b950cb7f410a	Data transfer and data integrity

Minimal technical metadata for `Metagenome Assembled Genome (MAG) FASTA` file

metadata	definition	reference of definition[<url_to_definition>]	expected unit of measurement	example	source
run_ref	Accessions/identifiers linking to the raw data (FASTQ)	Link to reference	run_accession in the format SRR, ERR or DRR	e.g. RUN_REF accession = “ERR178314”	Adapted from ENA (“ENA How to Submit Other Analyses: Submitting Read Alignments”)
tax_ident	The phylogenetic marker(s) used to assign an organism name to the genome	MIXS:0000053	free text	e.g. 16s rRNA gene, multi-marker approach, other	GSC MIXS: MIMAG (“GSC MIXS: MIMAG”)
assembly_qual	The assembly quality category is based on sets of criteria outlined for each assembly quality category. For MISAG/MIMAG; Finished: Single, validated, contiguous sequence per replicon without gaps or ambiguities with a consensus error rate equivalent to Q50 or better. High Quality Draft:Multiple fragments where gaps span repetitive regions. Presence of the large subunit (LSU) RNA, small subunit (SSU) and the presence of 5.8S rRNA or 5S rRNA depending on whether it is a eukaryotic or prokaryotic genome, respectively. Medium Quality Draft:Many fragments with little to no review of assembly other than reporting of standard assembly statistics. Low Quality Draft:Many fragments with little to no review of assembly other than reporting of standard assembly statistics. Assembly statistics include, but are not limited to total assembly size, number of contigs, contig N50/L50, and maximum contig length. Genome fragment(s): One or multiple fragments, totalling < 90% of the expected genome or replicon sequence, or for which no genome size could be estimated	MIXS:0000056	free text from predetermined strings	e.g. Medium Quality Draft	GSC MIxS/MIGS Bacteria (“GSC MIXS: MIGSBacteria”)
assembly_software	Tool(s) used for assembly, including version number and parameters	MIXS:0000058	free text	e.g. metaSPAdes (3.11.0);kmer set 21,33,55,77,99,121, default parameters otherwise	GSC MIxS/MIGS Bacteria (“GSC MIXS: MIGSBacteria”)
compl_score	Completeness score is typically based on either the fraction of markers found as compared to a database or the percent of a genome found as compared to a closely related reference genome. High Quality Draft: >90%, Medium Quality Draft: >50%, and Low Quality Draft: < 50% should have the indicated completeness scores	MIXS:0000069	integer value (%)	e.g. med; 60%	GSC MIXS: MIMAG (“GSC MIXS: MIMAG”)
compl_software	Tools used for completion estimate	MIXS:0000070	free text string	e.g. checkm (v1.1.6)	GSC MIXS: MIMAG (“GSC MIXS: MIMAG”)
contam_score	The contamination score is based on the fraction of single-copy genes that are observed more than once in a query genome. The following scores are acceptable for; High Quality Draft: < 5%, Medium Quality Draft: < 10%, Low Quality Draft: < 10%. Contamination must be below 5% for a SAG or MAG to be deposited into any of the public databases.	MIXS:0000072	integer value (%)	e.g. 0.01	GSC MIXS: MIMAG (“GSC MIXS: MIMAG”)
mag_cov_software	Tool(s) used to determine the genome coverage if coverage is used as a binning parameter in the extraction of genomes from metagenomic datasets	MIXS:0000080	Free text	e.g. bbmap	GSC MIXS: MIMAG (“GSC MIXS: MIMAG”)
bin_software	Tool(s) used for the extraction of genomes from metagenomic datasets, where possible include a product ID (PID) of the tool(s) used	MIXS:0000078	Free text	e.g. MaxBin 2.0 (https://doi.org/10.1093/bioinformatics/btv638)	GSC MIXS: MIMAG (“GSC MIXS: MIMAG”)
bin_param	The parameters that have been applied during the extraction of genomes from metagenomic datasets	MIXS:0000077	Free text	e.g. kmer, coverage, etc	GSC MIXS: MIMAG (“GSC MIXS: MIMAG”)
coverage	The estimated depth of sequencing coverage	Link to usecase	integer value	30.5	ENA Submitting Metagenome Assemblies (“ENA Submitting Metagenome Assemblies”)
number_contig	Total number of contigs in the cleaned/submitted assembly that makes up a given genome, SAG, MAG, or UViG	MIXS:0000060	integer value	e.g. 40	GSC MIxS/MIGS Bacteria (“GSC MIXS: MIGSBacteria”), Roadmap for naming uncultivated Archaea and Bacteria (Murray et al. 2020)
N50	The length of the shortest contig representing half of the assembly length	Link to reference 1 Link to reference 2	integer value + unit	e.g. N50=4kb	Roadmap for naming uncultivated Archaea and Bacteria (Murray et al. 2020)
x16s_recover (optional)	Can a 16S gene be recovered from the submitted sequence	MIXS:0000065	free text string	e.g. yes	Adapted from GSC MIXS: MIMAG (“GSC MIXS: MIMAG”)
x16s_recover_software (optional)	Tools used for 16S rRNA gene extraction	MIXS:0000066	free text string	e.g. rambl (v2); default parameters	Adapted from GSC MIXS: MIMAG (“GSC MIXS: MIMAG”)
trnas (optional)	Total number of tRNAs identified from the genome	MIXS:0000067	integer value	e.g. 18	GSC MIXS: MIMAG (“GSC MIXS: MIMAG”), Roadmap for naming uncultivated Archaea and Bacteria (Murray et al. 2020)
trna_ext_software (optional)	Tools used for tRNA identification	MIXS:0000068	free text string	e.g. infernal (v2); default parameters	GSC MIXS: MIMAG (“GSC MIXS: MIMAG”)

References

Bowers, R., N. Kyrpides, R. Stepanauskas, et al. 2017. “Minimum Information about a Single Amplified Genome (MISAG) and a Metagenome-Assembled Genome (MIMAG) of Bacteria and Archaea.” Nat Biotechnol 35: 725–31. https://doi.org/10.1038/nbt.3893.

“ENA How to Submit Other Analyses: Submitting Read Alignments.” https://ena-docs.readthedocs.io/en/latest/submit/analyses/read-alignments.html.

“ENA Metadata Validation: Instrument.” https://ena-docs.readthedocs.io/en/latest/submit/reads/webin-cli.html#instrument.

“ENA Metadata Validation: Selection.” https://ena-docs.readthedocs.io/en/latest/submit/reads/webin-cli.html#selection.

“ENA Metadata Validation: Source.” https://ena-docs.readthedocs.io/en/latest/submit/reads/webin-cli.html#source.

“ENA Metadata Validation: Strategy.” https://ena-docs.readthedocs.io/en/latest/submit/reads/webin-cli.html#strategy.

“ENA Submitting Metagenome Assemblies.” https://ena-docs.readthedocs.io/en/latest/submit/assembly/metagenome.html.

Field, D., G. Garrity, T. Gray, N. Morrison, J. Selengut, P. Sterk, T. Tatusova, et al. 2008. “The Minimum Information about a Genome Sequence (MIGS) Specification.” Nature Biotechnology. 2008. https://doi.org/10.1038/nbt1360.

“GSC MIXS: MIGSBacteria.” https://genomicsstandardsconsortium.github.io/mixs/MIGSBacteria/.

“GSC MIXS: MIMAG.” https://genomicsstandardsconsortium.github.io/mixs/MIMAG/.

Leinonen, R., H. Sugawara, M. Shumway, and International Nucleotide Sequence Database Collaboration. 2011. “The Sequence Read Archive.” Nucleic Acids Research 39 (Database issue): D19–21. https://doi.org/10.1093/nar/gkq1019.

Murray, A. E., J. Freudenstein, S. Gribaldo, et al. 2020. “Roadmap for Naming Uncultivated Archaea and Bacteria.” Nat Microbiol 5: 987–94. https://doi.org/10.1038/s41564-020-0733-x.

NCBI Hackathons. “SRA-Tinder: A Tool to Discover Related Sequence Read Archive (SRA) Experiments.” https://github.com/NCBI-Hackathons/SRA_Tinder.

Rocca-Serra, Philippe, Alasdair J G Gray, Alejandra Delfin Rossaro, Andrea Splendiani, Andrea Zaliani, Andreas Pippow, Anne Cambon-Thomsen, et al. 2022. “The FAIR Cookbook.” https://github.com/FAIRplus/the-fair-cookbook/.

Yilmaz, Pelin et al. 2011. “Minimum Information about a Marker Gene Sequence (MIMARKS) and Minimum Information about Any (x) Sequence (MIxS) Specifications.” Nature Biotechnology 29 (5): 415–20. https://doi.org/10.1038/nbt.1823.

##################################

Minimal technical metadata for `Metagenomic FASTA` file

metadata	definition	examples	source
run_ref	Accessions/identifiers linking to the raw data (FASTQ)	e.g. accession = “ERR178314”	Adapted from ENA (“ENA How to Submit Other Analyses: Submitting Read Alignments”)
tax_ident	The phylogenetic marker(s) used to assign an organism name to the SAG or MAG	e.g. 16s rRNA gene, multi-marker approach, other	GSC MIXS: MIMAG (“GSC MIXS: MIMAG”)
assembly_qual	Assembly quality category	e.g. Medium Quality Draft	GSC MIxS/MIGS Bacteria (“GSC MIXS: MIGSBacteria”)
assembly_software	Tool(s) used, version and parameters	e.g. metaSPAdes (3.11.0);kmer set 21,33,55,77,99,121, default parameters otherwise	GSC MIxS/MIGS Bacteria (“GSC MIXS: MIGSBacteria”)
coverage	The estimated depth of sequencing coverage (in x)		ENA Submitting Metagenome Assemblies (“ENA Submitting Metagenome Assemblies”)
number_contig	Total number of contigs	e.g. 40	GSC MIxS/MIGS Bacteria (“GSC MIXS: MIGSBacteria”), Roadmap for naming uncultivated Archaea and Bacteria (Murray et al. 2020)
LSU_recover	Detection of the 23S rRNA (BA) or 5.8S/28S rRNA (E)	e.g. yes	Adapted from GSC MIXS: MIMAG (“GSC MIXS: MIMAG”)
LSU_recover_software	Tools for LSU extraction		Adapted from GSC MIXS: MIMAG (“GSC MIXS: MIMAG”)
SSU_recover	Detection of the 16S rRNA (BA) or 18S rRNA (E)	e.g. yes	Adapted from GSC MIXS: MIMAG (“GSC MIXS: MIMAG”)
SSU_recover_software	Tools for SSU extraction	e.g. rambl (v2); default parameters	Adapted from GSC MIXS: MIMAG (“GSC MIXS: MIMAG”)
trnas	Total number of tRNAs identified from the MAG	e.g. 18	GSC MIXS: MIMAG (“GSC MIXS: MIMAG”), Roadmap for naming uncultivated Archaea and Bacteria (Murray et al. 2020)
trna_ext_software	Tools used for tRNA identification	e.g. infernal (v2); default parameters	GSC MIXS: MIMAG (“GSC MIXS: MIMAG”)
compl_score	Completeness score	e.g. med; 60%	GSC MIXS: MIMAG (“GSC MIXS: MIMAG”)
compl_software	Tools used for completion estimate	e.g. checkm (v1.1.6)	GSC MIXS: MIMAG (“GSC MIXS: MIMAG”)
contam_score	Contamination score	e.g. 0.01	GSC MIXS: MIMAG (“GSC MIXS: MIMAG”)
contam_software	Tool(s) used in contamination screening	e.g. checkm (v1.1.6)	GSC MIXS: MIMAG (“GSC MIXS: MIMAG”)

Comments/questions:
Is coverage factored into completeness? If not, it seems we should consider separating genome coverage and sequence depth -NME 27APR22
We need a reference for a consensual definitions of these terms to avoid confusion -CP 17JUL22
Took the definitions from ENA Submitting Metagenome Asseblies, replaced coverage and depth with one definition -MB 11AUG23

OBS: THE MIM FOR FASTA METAGENOME HAVE TO BE FURTHER DISCUSSED IF NEEDED OR IF IT SHOULD BE MERGED IN A DIFFERENT CATEGORY.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Metagenome_Technical_Metadata.md

Metagenome_Technical_Metadata.md

2.3 Metagenome sequencing

Minimal technical metadata for `Metagenomic FASTQ` data

Minimal technical metadata for `Metagenome Assembled Genome (MAG) FASTA` file

References

Minimal technical metadata for `Metagenomic FASTA` file

Files

Metagenome_Technical_Metadata.md

Latest commit

History

Metagenome_Technical_Metadata.md

File metadata and controls

2.3 Metagenome sequencing

Minimal technical metadata for Metagenomic FASTQ data

Minimal technical metadata for Metagenome Assembled Genome (MAG) FASTA file

References

Minimal technical metadata for Metagenomic FASTA file

Minimal technical metadata for `Metagenomic FASTQ` data

Minimal technical metadata for `Metagenome Assembled Genome (MAG) FASTA` file

Minimal technical metadata for `Metagenomic FASTA` file