Note for existing users: as of the version published on 15/09/25, metathresholds requires a different database structure, so you'll need to produce a fresh metathresholds_db.csv file the first time you run metathresholds after updating to this version.
This R package is designed to apply filters to the raw outputs of metagenomics classifiers like Kraken2. This is because such classifiers sometimes return very large lists of species, some of which are not truly present in your sample. This package allows you to quickly compare to the negative control and calculate which species are likely to be present based on the thresholds you input.
This R package is not yet available on CRAN so it can be installed as follows using RStudio.
- Install R devtools if required.
install.packages(“devtools”)
- Install the latest version of the package from this repo.
devtools::install_github("sarah-buddle/metathresholds")
library(metathresholds)
If you're running the package for the first time, you need a filled in samplesheet and thresholds file. You should also provide a filepath to wherever you want to create the taxonomizr and metathresholds db file (so the folder should exist, but not the file).
Samplesheet: Path to csv samplesheet file formatted according to the provided template. Leave the column names exactly as they are in the template, and just leave cells blank where you don't have outputs for that classifier.
-
sample_id: You should list the sample IDs of all your samples, including negative controls, in this column.
-
corresponding_control: Contains the sample_id of the corresponding negative control for each sample. For the negative controls, write control. You can provide up to 3 corresponding controls per sample. If you have less than this, just leave the remaining corresponding control columns blank, or put NA. We recommend that you run a negative control with every sample. However, if you don't have any, you can leave these columns blank and any thresholds you provide for reads_ratio and rpm_ratio will be disregarded. Don't mix samples with and without negative controls in the same metathresholds samplesheet - run them separately. You can however run samples with different numbers of corresponding controls (1-3) together.
-
dnarna: Either DNA, RNA or DNARNA, as appropriate for your sample. This is only used if you use the combineDNARNA function.
-
dnarna_pair: If you want to combine separate DNA and RNA seq results for the same sample, give each DNA-RNA pair a unique name in this column. If you don't want to combine anything, just use the sample_id here.
-
raw_read_1: Number of raw reads (R1 only for paired-end reads). This is used for calculating reads per million values.
-
raw_read_2: Number of raw reads (R2 only for paired-end reads, leave blank for single-end reads). This is used for calculating the reads per million values.
-
xx_filepath: Full global filepath to the output file of the relevant taxonomic classifier. For Kraken2-based classifiers and metaMix, this is the report, not the read-level file. For CZID, the csv report should be downloaded individually for each sample. For OneCodex, it expects that the report will contain results for more than one sample, and looks for the sample_id given in the Sample Name field of the OneCodex output. If your classifier is not in the list, you can use custom_filepath. This should be a two-column csv file with colnames "taxid" and "reads", containing the number of reads assigned to NCBI taxids by your classifier (most should provide this).
Thresholds file: Path to csv file thresholds file formatted according to the provided template. These are the thresholds that will be applied to your data. You can set different thresholds for bacteria, viruses, fungi and other eukaryotes, and for pathogens and non-pathogens. Taxa have to pass all thresholds to be called as positive. You should still include all the columns and rows shown in the template, even if you plan to set some thresholds to 0 (as we would recommend!).
Species-specific thresholds file (optional): Allows you to override your main thresholds for specific taxa. You should still include all the columns and rows shown in the template, even if you plan to set some thresholds to 0. The name column provided in the file is disregarded by metathresholds - it uses the taxids only.
-
reads: raw reads. A reads threshold of 3 for bacteria means reject any bacteria with fewer than 3 reads assigned.
-
reads_ratio reads ratio = reads in sample / reads in negative control. Can be used to compare with negative control.
-
rpm reads per million. Helps normalise for differences in sequencing depth between samples.
-
rpm_ratio: reads per million ratio = reads per million in sample / reads per million in negative control. Can be used to compare with negative control.
-
internal_control_ratio: internal control ratio = reads / reads assigned to internal control. If you don't set an internal control taxid, the internal control ratio for each species will be set to the number of reads.
-
internal_control_taxid: taxid to use as internal control for calculation of internal control ratio. Leave as 0 if you don't want to use the internal control ratio threshold.
-
proportion_raw: proportion of total raw reads in the sample. Range: 0-1. A proportion_raw of 0.01 means reject any taxid that makes up less than 1% of the total raw reads.
-
proportion_nonhuman_classified: proportion of the reads that are classified to something other than human (i.e. the microbial reads). Range: 0-1.
-
proportion_type proportion of the reads of that microbe type. Range: 0-1. A proportion_type of 0.01 for viruses means reject any taxid that makes up less than 1% of the reads classified as viral.
-
proportion_family proportion of the reads of that microbial family. Range: 0-1. A proportion_family of 0.01 means reject any taxid that makes up less than 1% of the reads classified within that taxonomic family.
-
proportion_genus proportion of the reads of that genus. Range: 0-1. A proportion_family of 0.01 means reject any taxid that makes up less than 1% of the reads of that genus.
-
low_level: If TRUE, only apply reads per million ratio threshold where there is at least one read corresponding to the taxon in the negative control. Where there are no reads corresponding to a particular taxon in the negative control, you can't calculate a reads per million ratio because you'd be dividing by 0. In this case, the reads per million ratio is calculated as if there was just one read in the control, which means the rpmr = number of raw reads in sample. This means that by default, if you have a virus_rpm of 5, you would automatically reject any viruses with fewer than 5 reads. To avoid this and still accept say a virus with 4 reads if there are no reads in the control (but reject if there are any reads in the control and this brings it under the rpm ratio you define), set this as TRUE.
-
max_positive Return a maximum number of species per type. For example, if you are only interested in the top 10 most abundant bacteria, set max_positive for bacteria to 10. If you don't have a maximum, set these values very high. This will reject lower abundance species even when species-specific thresholds have been set, so we would recommend that in general you don't use this option if you're providing species-specific thresholds.
Taxonomizr db file: Path to nameNode.sqlite file generated by taxonomizr::prepareDatabase (full global path, not just directory, even if the file doesn't exist yet). metathresholds depends on the taxonomizr package by Scott Sherrill-Mix (https://cran.rproject.org/web/packages/taxonomizr/index.html). If you haven't provided this database file, the package will generate one automatically, but this only has to be done once, and you can provide the filepath to speed up subsequent runs. This file is built using two taxonomy files called names.dmp and nodes.dmp. If these are present in the folder where you create the nameNode.sqlite file, it uses these rather than downloading new ones. Metathresholds works best if you use the same taxonomy files (nodes.dmp and names.dmp) as were used to build your metagenomics database. (NB if you're downloading recent prebuilt kraken2 databases from https://benlangmead.github.io/aws-indexes/k2, these are included in the kraken database folder.) If you don’t have them, if you know roughly when the database was built (or rather when the fastq files used to build it were downloaded), you can download the files closest in date to them at https://ftp.ncbi.nlm.nih.gov/pub/taxonomy/taxdump_archive/ and then put them in the folder where you want to build the nameNode.sqlite file. You can skip these steps and the package will automatically produce a nameNode.sqlite based on just the latest taxonomy. However, this may mean some taxon IDs cannot be identified. If you're not using the exact names.dmp and nodes.dmp files used to build your database, it is highly recommended that you specify an invalid taxonomy filepath (see below) and check its output. You can manually add entries to your metathresholds db file (see below) if needed.
Metathresholds db file: Path to csv file where metathresholds can store its database (full global path, not just directory, even if the file doesn't exist yet). metathresholds creates its own database of all of the taxons in your analysis and whether they are a bacteria, virus, fungus etc and if they are a subspecies, which species they belong to. Again, if you haven't provided this database file, the package will generate one automatically, but you can provide the filepath to save the database in a location of your choice and speed up subsequent runs.
Additional taxonomy file (optional): Path to csv file containing an additional database, formatted in the same way as the metathresholds db file. Allows you to provide a database you have produced manually.
Invalid taxonomy filepath (optional but recommended): Path to output csv file to which to write invalid taxids from input.
Pathogens filepath (optional): Path to csv file containing known pathogens. By default, the provided list will be used, but you can create your own list using cleanKnownPathogensContaminants if preferred (see below).
Contaminants filepath (optional): Path to csv file containing known contaminants. Produce this file using cleanKnownPathogensContaminants (see below).
Positive taxids list (optional): Path to csv containing the taxon ids you expect to find (one on each line). Your csv can contain other columns as well, but these won't be used. If you know what species you are expected to be in your samples, you can provide a list of the corresponding taxon IDs and this allows metathresholds to calculate the sensitivity/specificity etc.
Other options
-
include_history: This option allows you to keep a record of the mean and standard deviation of the read per milliom (RPM) value of each taxa across all previous runs with the same metathreshold_db.csv where this option is enabled. It is designed so if you are regularly running samples of a similar type, you can quickly identify any new taxa, which may represent clinically relevant organisms or new laboratory contaminants. If TRUE, uses the RPM values from the datasets to update the mean and standard deviation in the database file. When this option is enabled, a backup of the database CSV file will be generated in case you wish to revert the running totals to their previous state.
-
exclude_contaminants: If you wish to exclude any reads assiged to contaminants from your contaminants_filepath from the proportion calculations and not call them as positive, set to TRUE.
-
collapse_species: If TRUE, assigns any reads classified as subspecies or strains to their corresponding species.
-
keep_species_only: If TRUE, only keep taxa with the rank species in the results. If true, we recommend that collapse_species is also set to true.
-
virus_only: If TRUE, only keep viruses in the final output.
The main function is makeFullReport, which will import all your files from the filepaths on your samplesheet and call them as positive or negative according to your thresholds. A typical command might look like:
report <- makeFullReport(samplesheet_filepath = "samplesheet.csv",
taxonomizr_sql = "nameNode.sqlite",
db_filepath = "metathresholds_db.csv",
thresholds_filepath = "thresholds.csv",
invalid_taxids = "invalid_taxids.csv")
If you've run separate DNA and RNA Seq on the same sample, you might want to combine the results and keep whichever gives the higher result. To do this, run:
combined_report <- combineDNARNA(report)
If you've run makeFullReport and provided a positive species list, you can count the total number of true and false positive and negative species and calculate sensitivity and specificity.
report_stats <- countSpecies(combined_report)
By default, metathresholds will use the provided list of human pathogens, which was adapted from CZ ID's pathogen list. If you want to make your own list, create a csv file with the NCBI taxids of each species you wish to include. It can contain other columns if desired. Then run:
cleanKnownPathogensContaminants(input_taxids_filepath = "your_known_pathogens_input.csv",
knownPathogensContaminants_filepath = "known_pathogens_custom.csv",
taxonomizr_sql = "nameNode.sqlite")
Then provide the "known_pathogens_custom.csv" to makeFullReport.
You may wish to exclude known contaminants, which may be specific to your laboratory, from your analysis. First, create a csv file with the NCBI taxids of each species you wish to include. It can contain other columns if desired. Then run:
cleanKnownPathogensContaminants(input_taxids_filepath = "your_known_contaminants_input.csv",
knownPathogensContaminants_filepath = "contaminants_custom.csv",
taxonomizr_sql = "nameNode.sqlite")
Then provide the "contaminants_custom.csv" to makeFullReport.