PICRUSt2-v2.6.0 (#372)

Summary: - Updating to PICRUSt2-v2.6.0 to work with new PICRUSt2-MPGA database. Please see [Wiki page](https://github.com/picrust/picrust2/wiki/PICRUSt2%E2%80%90MPGA-database) for full details. - New PICRUSt2-MPGA database contains separate reference files and annotations for bacteria and archaea. - New default workflow: - Previous place sequences step is split to placing in bacterial and archaeal reference trees by default. - Hidden state prediction is run separately for 16S for bacteria and archaea to determine NSTI for all sequences for both reference sets. The best-fitting domain for each sequence is then chosen (based on lowest NSTI) and the files are filtered to contain only those that fit the best. - Hidden state prediction is then run for KOs and EC numbers for both bacteria and archaea. - Combine bacterial and archaeal predictions for KOs and EC numbers. - Closest reference genome is now given for all sequences. All squashed commits: * Add new reference files Add new reference files - separate files for bacteria and archaea * Add check for duplicated IDs in the input FASTA Previously, there was a check that looked for overlap between all ASVs in the input FASTA and the feature table, but there were downstream errors that could be caused by having duplicated sequence IDs in the input fasta. This fix adds a check that no sequence IDs in the input FASTA appear more than once. * Update files - Zipped default file that was previously unzipped - Added metacyc reaction mapping pathways (modified from HUMAnN3) - Added line to castor_hsp.R that ensures no issues running maximum parsimony method even if edge lengths of tree have zeroes * Split domains and new default files Some new scripts have been added: - default_split.py: locations of new default files for when we're running bacteria/archaea separately - split_domains.py: functions for choosing the best domain for each sequence based on which has the lowest NSTI - pick_best_domain.py: wrapper for picking the best domain to use for each sequence when we're running bacteria/archaea separately. Note that this would be run between hsp.py with the 16S/marker gene file and running hsp.py with any other trait files - combine_domains.py: wrapper for combining functional predictions from hsp.py for when we're running multiple domains. This would be run before the metagenome_pipeline step - Functions in util.py have been added for steps like reading in and pruning the tree files * Add scripts for splitting to bacteria and archaea Update to add scripts for running both bacterial and archaeal predictions * Added requirement for ete3 Added requirement for ete3 to yaml file * Update pipeline_split.py Added check for when no sequences match with one of the domains * pipeline_split.py fix * Add information on bacterial and archaeal genomes * Add closest genome match to NSTI file * Fix genome ID * Fix genome name in NSTI file * Make add_descriptions_split.py and fix ko_name.txt.gz * Fix issue where script broke if skip_norm was set * Update version number * Updated scripts for two domains running each step individually * Changing names of PICRUSt2 scripts that run using the previous database * Revert back to previous file names * Update scripts to have oldIMG database option * Add archaea raxml_info file Note that this file still needs testing * Update arc_ref.raxml_info Archaea reference files now work with SEPP * Fix test files (still for working with oldIMG database currently) * Update to fix tests (still using oldIMG at this point) * Still trying to fix tests, still not updated to PICRUSt2-v2.6.0 * Update config options before push * Fixed test_workflow.py Added new -db flag to pathway_pipeline.py * Added necessary raxml_info files for use with SEPP for each of bacteria and archaea * Add raxml_info to default file * Update arc_ref.raxml_info * Update bac_ref.raxml_info * Update arc_ref.raxml_info * Remove DS_Store
picrust · Jan 27, 2025 · 9ed5110 · 9ed5110
1 parent c038d18
commit 9ed5110
Show file tree

Hide file tree

Showing 66 changed files with 805,585 additions and 160 deletions.
diff --git a/.gitignore b/.gitignore
@@ -6,3 +6,4 @@ tests/__pycache__/
 picrust2/default_files/prokaryotic/untracked/
 picrust2/default_files/prokaryotic/pro_ref/pro_ref.fna.COPY
 tests/test_data/pathway_pipeline/per_seq_contrib/humann2_run/*
+.DS_Store
diff --git a/picrust2-env.yaml b/picrust2-env.yaml
@@ -27,3 +27,4 @@ dependencies:
 - r-castor >=1.7.2
 - scipy >=1.2.1
 - sepp=4.4.0
+- ete3=3.1.3
diff --git a/picrust2/Rscripts/castor_hsp.R b/picrust2/Rscripts/castor_hsp.R
@@ -7,6 +7,7 @@ Args <- commandArgs(TRUE)
 
 # Read in command-line arguments.
 full_tree <- read_tree(file=Args[1], check_label_uniqueness = TRUE)
+full_tree$edge.length[which(full_tree$edge.length == 0)] <- 0.00001
 trait_values <- read.delim(Args[2], check.names=FALSE, row.names=1)
 hsp_method <- Args[3]
 edge_exponent_set <- as.numeric(Args[4])

diff --git a/picrust2/Rscripts/castor_nsti.R b/picrust2/Rscripts/castor_nsti.R
@@ -26,9 +26,13 @@ known_tip_range <- which(! full_tree$tip.label %in% unknown_tips)
 nsti_values <- find_nearest_tips(full_tree,
                                  target_tips=known_tip_range,
                                  check_input=TRUE)$nearest_distance_per_tip[unknown_tips_index]
+nsti_genomes <- find_nearest_tips(full_tree,
+                                 target_tips=known_tip_range,
+                                 check_input=TRUE)$nearest_tip_per_tip[unknown_tips_index]
+nsti_genomes = full_tree$tip.label[nsti_genomes]
 
 # Make dataframe of study sequences (unknown tips) and nsti values as 2nd column.
-write.table(x = data.frame("sequence" = unknown_tips, "metadata_NSTI" = nsti_values),
+write.table(x = data.frame("sequence" = unknown_tips, "metadata_NSTI" = nsti_values, "closest_reference_genome" = nsti_genomes),
             file = output_path,
             sep="\t",
             quote = FALSE,

diff --git a/picrust2/default.py b/picrust2/default.py
@@ -4,60 +4,89 @@
 
 project_dir = path.dirname(path.abspath(__file__))
 
-default_ref_dir = path.join(project_dir, "default_files", "prokaryotic",
-                            "pro_ref")
+default_ref_dir_bac = path.join(project_dir, "default_files", "bacteria",
+                            "bac_ref")
 
-default_fasta = path.join(default_ref_dir, "pro_ref.fna")
+default_ref_dir_arc = path.join(project_dir, "default_files", "archaea",
+                            "arc_ref")
 
-default_tree = path.join(default_ref_dir, "pro_ref.tre")
+default_fasta_bac = path.join(default_ref_dir_bac, "bac_ref.fna")
 
-default_hmm = path.join(default_ref_dir, "pro_ref.hmm")
+default_fasta_arc = path.join(default_ref_dir_arc, "arc_ref.fna")
 
-default_model = path.join(default_ref_dir, "pro_ref.model")
+default_tree_bac = path.join(default_ref_dir_bac, "bac_ref.tre")
 
-default_raxml_info = path.join(default_ref_dir, "pro_ref.raxml_info")
+default_tree_arc = path.join(default_ref_dir_arc, "arc_ref.tre")
+
+default_hmm_bac = path.join(default_ref_dir_bac, "bac_ref.hmm")
+
+default_hmm_arc = path.join(default_ref_dir_arc, "arc_ref.hmm")
+
+default_model_bac = path.join(default_ref_dir_bac, "bac_ref.model")
+
+default_model_arc = path.join(default_ref_dir_arc, "arc_ref.model")
+
+default_raxml_info_bac = path.join(default_ref_dir_bac, "bac_ref.raxml_info")
+
+default_raxml_info_arc = path.join(default_ref_dir_arc, "arc_ref.raxml_info")
 
 default_regroup_map = path.join(project_dir, "default_files",
                                 "pathway_mapfiles",
-                                "ec_level4_to_metacyc_rxn.tsv")
+                                "ec_level4_to_metacyc_rxn_new.tsv")
 
 default_pathway_map = path.join(project_dir, "default_files",
                                 "pathway_mapfiles",
-                                "metacyc_path2rxn_struc_filt_pro.txt")
+                                "metacyc_pathways_structured_filtered_v24_subreactions.txt")
 
-fungi_pathway_map = path.join(project_dir, "default_files", "pathway_mapfiles",
-                              "metacyc_path2rxn_struc_filt_fungi.txt")
+#fungi_pathway_map = path.join(project_dir, "default_files", "pathway_mapfiles",
+#                              "metacyc_path2rxn_struc_filt_fungi.txt")
 
 # Inititalize default trait table files for hsp.py.
-prokaryotic_dir = path.join(project_dir, "default_files", "prokaryotic")
+bacteria_dir = path.join(project_dir, "default_files", "bacteria")
+
+default_tables_bac = {"16S": path.join(bacteria_dir, "16S.txt.gz"),
+
+                  "EC": path.join(bacteria_dir, "ec.txt.gz"),
+
+                  "KO": path.join(bacteria_dir, "ko.txt.gz"),
 
-default_tables = {"16S": path.join(prokaryotic_dir, "16S.txt.gz"),
+                  "GO": path.join(bacteria_dir, "go.txt.gz"),
 
-                  "COG": path.join(prokaryotic_dir, "cog.txt.gz"),
+                  "PFAM": path.join(bacteria_dir, "pfam.txt.gz"),
 
-                  "EC": path.join(prokaryotic_dir, "ec.txt.gz"),
+                  "BIGG": path.join(bacteria_dir, "bigg_reaction.txt.gz"),
 
-                  "KO": path.join(prokaryotic_dir, "ko.txt.gz"),
+                  "CAZY": path.join(bacteria_dir, "cazy.txt.gz"),
 
-                  "PFAM": path.join(prokaryotic_dir, "pfam.txt.gz"),
+                  "GENE_NAMES": path.join(bacteria_dir, "preferred_name.txt.gz")}
 
-                  "TIGRFAM": path.join(prokaryotic_dir, "tigrfam.txt.gz"),
+archaea_dir = path.join(project_dir, "default_files", "archaea")
 
-                  "PHENO": path.join(prokaryotic_dir, "pheno.txt.gz")}
+default_tables_arc = {"16S": path.join(archaea_dir, "16S.txt.gz"),
+
+                  "EC": path.join(archaea_dir, "ec.txt.gz"),
+
+                  "KO": path.join(archaea_dir, "ko.txt.gz"),
+
+                  "GO": path.join(archaea_dir, "go.txt.gz"),
+
+                  "PFAM": path.join(archaea_dir, "pfam.txt.gz"),
+
+                  "BIGG": path.join(archaea_dir, "bigg_reaction.txt.gz"),
+
+                  "CAZY": path.join(archaea_dir, "cazy.txt.gz"),
+
+                  "GENE_NAMES": path.join(archaea_dir, "preferred_name.txt.gz")}
 
 
 # Initialize default mapfiles to be used with add_descriptions.py
 map_dir = path.join(project_dir, "default_files", "description_mapfiles")
 
 default_map = {"METACYC": path.join(map_dir,
-                                    "metacyc_pathways_info.txt.gz"),
-
-               "COG": path.join(map_dir, "cog_info.tsv.gz"),
-
-               "EC": path.join(map_dir, "ec_level4_info.tsv.gz"),
+                                    "metacyc-pwy_name.txt.gz"),
 
-               "KO": path.join(map_dir, "ko_info.tsv.gz"),
+               "EC": path.join(map_dir, "ec_name.txt.gz"),
 
-               "PFAM": path.join(map_dir, "pfam_info.tsv.gz"),
+               "KO": path.join(map_dir, "ko_name.txt.gz"),
 
-               "TIGRFAM": path.join(map_dir, "tigrfam_info.tsv.gz")}
+               "GO": path.join(map_dir, "map_go_name.txt.gz")}
diff --git a/picrust2/default_files/archaea/16S.txt.gz b/picrust2/default_files/archaea/16S.txt.gz