v.0.1.19 updated GTDB-tk and db version, added --min_mapped_signature…

…_genes and --min_samples and other smaller changes
Russel88 · Jan 4, 2024 · a99b263 · a99b263
1 parent 11aec32
commit a99b263
Show file tree

Hide file tree

Showing 16 changed files with 73 additions and 45 deletions.
diff --git a/README.md b/README.md
@@ -23,10 +23,10 @@ conda activate maginator
 pip install maginator
 ```
 
-Furthermore, MAGinator also needs the GTDB-tk database version R207_v2 downloaded. If you don't already have it, you can run the following:
+Furthermore, MAGinator also needs the GTDB-tk database downloaded. Here we download release 214. If you don't already have it, you can run the following:
 ```sh
-wget https://data.gtdb.ecogenomic.org/releases/release207/207.0/auxillary_files/gtdbtk_r207_v2_data.tar.gz
-tar xvzf gtdbtk_v2_data.tar.gz
+wget https://data.ace.uq.edu.au/public/gtdb/data/releases/release214/214.1/auxillary_files/gtdbtk_r214_data.tar.gz
+tar xvzf *.tar.gz
 ```
 
 ## Usage
@@ -39,7 +39,7 @@ MAGinator needs 3 input files:
 
 Run MAGinator:
 ```sh
-maginator -v vamb_clusters.tsv -r reads.csv -c contigs.fasta -o my_output -g "/path/to/GTDB-Tk/database/release207_v2/"
+maginator -v vamb_clusters.tsv -r reads.csv -c contigs.fasta -o my_output -g "/path/to/GTDB-Tk/database/release214/"
 ```
 
 A testset can be found in the test_data directory. 
@@ -65,11 +65,18 @@ A test set can be found in the maginator/test_data directory.
 4. Unzip the contigs.fasta.gz 
 5. Run MAGinator
 
-MAGinator has been run on the test data on a slurm server with the following command:
+MAGinator can been run on the test data on a slurm server with the following command:
 ```sh
-maginator --vamb_clusters clusters.tsv --reads reads.csv --contigs contigs.fasta --gtdb_db data/release207_v2/ --output test_out --cluster slurm --cluster_info "-n {cores} --mem {mem_gb}gb -t {runtime}" --max_mem 180
+maginator --vamb_clusters clusters.tsv --reads reads.csv --contigs contigs.fasta --gtdb_db data/release214/ --output test_out --cluster slurm --cluster_info "-n {cores} --mem {mem_gb}gb -t {runtime}" --max_mem 180
 ```
-The expected output can be found as a zipped file on Zenodo: https://doi.org/10.5281/zenodo.8279036
+The expected output can be found as a zipped file on Zenodo: https://doi.org/10.5281/zenodo.8279036. MAGinator has been run on the test data (using GTDB-tk db release207_v2) on a slurm server.
+
+On the compute cluster each job have had access to 180gb RAM, with the following time consumption: 
+real	72m27.379s
+user	0m18.830s
+sys	1m0.454s
+
+If you run on a smaller server you can set the parameters --max_cores and --max_mem.
 
 ## Recommended workflow 
 
@@ -120,11 +127,13 @@ This is what MAGinator does with your input (if you want to see all parameters r
     * Use --clustering_coverage to toggle the clustering coverage
     * Use --clustering_type to toggle whether to cluster on amino acid or nucleotide level
 * Map reads to the non-redundant gene catalogue and create a matrix with gene counts for each sample
-* Pick non-redundant genes that are only found in one MGS each
-* Fit signature gene model and use the resulting signature genes to get the abundance of each MGS
-* Prepare for generation of phylogenies for each MGS by finding outgroups and marker genes which will be used for rooting the phylogenies
+* Pick non-redundant genes that are only found in one MAG cluster each
+* Fit signature gene model and use the resulting signature genes to get the abundance of each MAG cluster
+    * Use --min_mapped_signature_genes to change minimum number of signature genes to be detected in the sample to be included in the analysis
+    * Use --min_samples to alter the number of samples with the MAG cluster present in order to perform signature gene refinement
+* Prepare for generation of phylogenies for each MAG cluster by finding outgroups and marker genes which will be used for rooting the phylogenies
 * Use the read mappings to collect SNV information for each signature gene and marker gene for each sample
-* Align signature and marker genes, concatenate alignments and infer phylogenetic trees for each MGS
+* Align signature and marker genes, concatenate alignments and infer phylogenetic trees for each MAG cluster
     * Use --phylo to toggle whether use fasttree (fast, approximate) or iqtree (slow, precise) to infer phylogenies
 * Infer the taxonomic scope of each gene cluster. That is, at what taxonomic level are genes from a given gene cluster found in
     * Use --tax_scope_threshold to toggle the threshold for how to find the taxonomic scope consensus
@@ -144,6 +153,7 @@ This is what MAGinator does with your input (if you want to see all parameters r
     * all_genes_cluster.tsv - Gene clusters
     * matrix/
         * gene_count_matrix.tsv - Read count for each gene cluster for each sample
+        * small_gene_count_matrix.tsv - Read count matrix only containing the genes, that does not cluster across MAG cluster
     * synteny/ - Intermediate files for synteny clustering of gene clusters
 * gtdbtk/
     * <cluster>/ - GTDB-tk taxonomic annotation for each VAMB cluster
@@ -152,19 +162,19 @@ This is what MAGinator does with your input (if you want to see all parameters r
     * bams/ - Bam files for mapping reads to gene clusters
 * phylo/
     * alignments/ - Alignments for each signature gene
-    * cluster_alignments/ - Concatenated alignments for each MGS
-    * pileup/ - SNV information for each MGS and each sample
-    * trees/ - Phylogenetic trees for each MGS
+    * cluster_alignments/ - Concatenated alignments for each MAG cluster
+    * pileup/ - SNV information for each MAG cluster and each sample
+    * trees/ - Phylogenetic trees for each MAG cluster
     * stats.tab - Mapping information such as non-N fraction, number of signature genes and marker genes, read depth, and number of bases not reaching allele frequency cutoff 
     * stats_genes.tab - Same as above but the information is split per gene
 * signature_genes/ 
     * \- R data files with signature gene optimization
-    * read-count_detected-genes.pdf - Figure for each MGS/cluster displaying number of identified SG's in each sample along with the number of reads mapped.
+    * read-count_detected-genes.pdf - Figure for each MAG cluster displaying number of identified SG's in each sample along with the number of reads mapped.
 * tabs/
     * gene_cluster_bins.tab - Table listing which bins each gene cluster was found in
     * gene_cluster_tax_scope.tab - Table listing the taxonomic scope of each gene cluster
-    * metagenomicspecies.tab - Table listing which, if any, clusters where merged in MGS and the taxonomy of those
-    * signature_genes_cluster.tsv - Table with the signature genes for each MGS/cluster
+    * metagenomicspecies.tab - Table listing which, if any, clusters where merged in MAG cluster and the taxonomy of those
+    * signature_genes_cluster.tsv - Table with the signature genes for each MAG cluster
     * synteny_clusters.tab - Table listing the synteny cluster association for the gene clusters. Gene clusters from the same synteny cluster are genomically adjacent.
-    * tax_matrix.tsv - Table with taxonomy information for MGS
+    * tax_matrix.tsv - Table with taxonomy information for MAG cluster
 
diff --git a/maginator/main.py b/maginator/main.py
@@ -69,6 +69,10 @@ def cli():
     app.add_argument('--clustering_type', help='Sequence type for gene clustering with MMseqs2. Nucleotide- or protein-level [%(default)s]', default='protein', type=str, choices=['nucleotide', 'protein'])
     app.add_argument('--min_gtdb_markers', help='Minimum GTDBtk marker genes shared between MGS and outgroup for rooting trees [%(default)s]', default=10, type=int)
     app.add_argument('--marker_gene_cluster_prevalence', help='Minimum prevalence of marker genes to be selected for rooting of MGS trees [%(default)s]', default=0.5, type=float)
+
+    app.add_argument('--min_mapped_signature_genes', help='Minimum number of signature genes with reads mapped for the sample to be included in the refinement [%(default)s]', default=3, type=int)
+    app.add_argument('--min_samples', help='Minimum number of samples containing the MAG cluster (more than "min_mapped_signature_genes" present) for the MAG cluster to identidy SG [%(default)s]', default=3, type=int)
+
     app.add_argument('--min_af', help='Minimim allele frequency for calling a base when creating phylogenies [%(default)s]', default=0.8, type=float)
     app.add_argument('--min_depth', help='Minimim read depth for calling a base when creating phylogenies [%(default)s]', default=2, type=int)
     app.add_argument('--min_nonN', help='Minimum fraction of non-N characters of a sample to be included in a phylogeny [%(default)s]', default=0.5, type=float)

diff --git a/maginator/workflow.py b/maginator/workflow.py
@@ -31,7 +31,7 @@ def run(self, snakefile):
         # Define core snakemake command
         cmd = ['snakemake',
                '--use-conda',
-               '--latency-wait', '80',
+               '--latency-wait', '150',
                '-s', snakefile,
                '--config',
                'wd='+self.output,

diff --git a/maginator/workflow/envs/filter_geneclusters.yaml b/maginator/workflow/envs/filter_geneclusters.yaml
@@ -4,5 +4,5 @@ channels:
 dependencies:
  - bbmap=38.96
  - bwa-mem2=2.2.1
- - samtools=1.10
+ - samtools=1.18
 
diff --git a/maginator/workflow/envs/filter_gtdbtk.yaml b/maginator/workflow/envs/filter_gtdbtk.yaml
@@ -2,11 +2,11 @@ channels:
  - bioconda
  - conda-forge
 dependencies:
- - gtdbtk=2.1
+ - gtdbtk=2.3.2
  - biopython=1.79
  - pandas=1.4
  - numpy=1.23.1
  - mmseqs2=13.45111
  - bbmap=38.96
  - bwa-mem2=2.2.1
- - samtools=1.10
+ - samtools=1.18
diff --git a/maginator/workflow/envs/phylo.yaml b/maginator/workflow/envs/phylo.yaml
@@ -5,7 +5,7 @@ channels:
 dependencies:
  - biopython=1.79
  - pandas=1.4
- - samtools=1.10
+ - samtools=1.18
  - fasttree=2
  - perl=5
  - mafft=7

diff --git a/maginator/workflow/filter.Snakefile b/maginator/workflow/filter.Snakefile
@@ -22,7 +22,7 @@ mem = math.ceil(n_contigs/1000000)*30
 if mem > int(param_dict['max_mem']):
     mem = int(param_dict['max_mem'])
 ## time is 1 hour per million
-tim = str(math.ceil(n_contigs/1000000)*60*60) # runtime in seconds
+tim = str(max(1800, min(math.ceil(n_contigs/100000)*60*60, 72000))) # runtime in seconds, max 20h, min 30min
 
 rule all:
     input:

diff --git a/maginator/workflow/gtdbtk.Snakefile b/maginator/workflow/gtdbtk.Snakefile
@@ -26,14 +26,14 @@ wildcard_constraints:
 
 rule all:
     input:
-        expand(os.path.join(WD, 'gtdbtk', '{cluster}/classify/gtdbtk.bac120.summary.tsv'), cluster=CLUSTERS)
+        expand(os.path.join(WD, 'gtdbtk', '{cluster}/gtdbtk.done'), cluster=CLUSTERS)
 
 rule gtdbtk:
     input:
-        os.path.join(F_DIR, '{cluster}/')
+        ancient(os.path.join(F_DIR, '{cluster}/'))
     output:
         directory(os.path.join(WD, 'gtdbtk', '{cluster}')),
-        os.path.join(WD, 'gtdbtk', '{cluster}/classify/gtdbtk.bac120.summary.tsv')
+        os.path.join(WD, 'gtdbtk', '{cluster}/gtdbtk.done')
     conda:
         "envs/filter_gtdbtk.yaml"
     params:
@@ -45,5 +45,8 @@ rule gtdbtk:
     shell:
         '''
         export GTDBTK_DATA_PATH={params.gtdbtk};
-        gtdbtk classify_wf --genome_dir {input} --out_dir {output[0]} --cpus {resources.cores} --extension fa --keep_intermediates || true
+        gtdbtk classify_wf --genome_dir {input} --out_dir {output[0]} --cpus {resources.cores} --extension fa --keep_intermediates --skip_ani_screen || true
+        if [[ -f {output[0]}/classify/gtdbtk.bac120.summary.tsv || -f {output[0]}/classify/gtdbtk.ar53.summary.tsv ]]; then
+            touch {output[1]}
+        fi
         '''
diff --git a/maginator/workflow/prescreening_genes.Snakefile b/maginator/workflow/prescreening_genes.Snakefile
@@ -83,6 +83,8 @@ rule prescreening_genes:
         gene_lengths = os.path.join(WD, 'signature_genes', 'gene_lengths.RDS')
     output:
         clusters_sorted = os.path.join(WD, 'signature_genes', 'clusters_sorted.RDS'),
+    params:
+        min_mapped_signature_genes = param_dict['min_mapped_signature_genes']
     conda:
         "envs/signature_genes.yaml"
     resources:

diff --git a/maginator/workflow/scripts/SG_refinement.R b/maginator/workflow/scripts/SG_refinement.R
@@ -20,7 +20,8 @@ snakemake@source(snakemake@params[["functions"]])
 n.genes <- 100
 
 #minimum number of mapped genes required
-n.mapped.minimum <- 3
+n.mapped.minimum <- as.integer(snakemake@params[["min_mapped_signature_genes"]])
+n.minimum.samples <- as.integer(snakemake@params[["min_samples"]])
 
 ids <- names(Clusterlist) #the ids of the MGS
 id <- tail(strsplit(strsplit(snakemake@input[["clusters_dir"]], ".RDS")[[1]], "/")[[1]], n=1)
@@ -59,12 +60,12 @@ run_one_id <- function(id){
 
   #colsum is the amount mapped genes in any given sample
   colsum <- colSums(Clusterlist[[id]][1:n.genes, ])
-  # extracting only the samples which contains above 3 mapped reads -- if there are 3 reads we believe this is a true detection
+  # extracting only the samples which contains above n.mapped.minimum mapped reads -- if there are above n.mapped.minimum reads we believe this is a true detection
   mapped <- colsum[colsum >= n.mapped.minimum]
 
   n_samples <- setNames(c(n_samples, sum(colsum>=n.mapped.minimum)), c(names(n_samples), id))
-  #if less than 3 samples are mapped to the MGS, the MGS should be skipped
-  if (length(mapped) < 3){
+  #if less than n.minimum.samples are mapped to the MGS, the MGS should be skipped
+  if (length(mapped) < n.minimum.samples){
 #    print("Not enough samples are mapped to the cluster")
     return()}
   genes <- names(Clusterlist[[id]][1:n.genes, 1])

diff --git a/maginator/workflow/scripts/abundance_profiles.R b/maginator/workflow/scripts/abundance_profiles.R
@@ -18,7 +18,7 @@ colnames(taxonomy) <- c("Cluster","Taxonomy")
 #setting important variables
 gene_index <- seq(1,length(GeneLengths))
 gene_names <- names(GeneLengths)
-n.mapped.minimum <- 3 #The number of genes that needs reads that map to count the cluster as present
+n.mapped.minimum <- as.integer(snakemake@params[["min_mapped_signature_genes"]]) #The number of genes that needs reads that map to count the cluster as present
 n.genes <- 100 # number of signature genes
 
 # inserting NA for the Clusters that do not have a annotation

diff --git a/maginator/workflow/scripts/gene_refinement_plots.R b/maginator/workflow/scripts/gene_refinement_plots.R
@@ -9,7 +9,7 @@ taxmat <- read.csv(snakemake@input[["tax_matrix"]], sep="\t", header=FALSE)
 
 # Initializing relevant parameters
 n_signature_genes_expected <- '95'
-minimum_sampels <- 3
+minimum_sampels <- as.integer(snakemake@params[["min_samples"]])
 n.genes <- 100
 
 #loading colors

diff --git a/maginator/workflow/scripts/parse_gtdbtk.py b/maginator/workflow/scripts/parse_gtdbtk.py
@@ -37,14 +37,16 @@ def most_common(ll):
 
     # Try reading both bacterial and archaeal summaries.
     try:
-        tax_bac = pd.read_csv(os.path.join(snakemake.input[0], clust, 'gtdbtk.bac120.summary.tsv'), sep='\t', header=0)  
+        tax_bac = pd.read_csv(glob.glob(os.path.join(snakemake.input[0], clust, 'gtdbtk.bac*.summary.tsv'))[0], sep='\t', header=0)  
         if tax_bac.iloc[0,1]=='Unclassified':
             tax_bac=None
-    except FileNotFoundError:
+        elif tax_bac.iloc[0,1]=='Unclassified Bacteria':
+            tax_bac=None
+    except (IndexError, FileNotFoundError):
         tax_bac = None
     try: 
-        tax_ar = pd.read_csv(os.path.join(snakemake.input[0], clust, 'gtdbtk.ar122.summary.tsv'), sep='\t', header=0)  
-    except FileNotFoundError:
+        tax_ar = pd.read_csv(glob.glob(os.path.join(snakemake.input[0], clust, 'gtdbtk.ar*.summary.tsv'))[0], sep='\t', header=0)  
+    except (IndexError, FileNotFoundError):
         tax_ar = None
 
     # Combine
@@ -59,7 +61,7 @@ def most_common(ll):
 
         # Remove unclassified
         classification = [x for x in classification if len(x) == 7]
-
+        
         # Traverse from species annotation and up
         # Pick the annotation if the most common annotation is above prevalence set by parameter
         level = 6
@@ -145,7 +147,7 @@ def most_common(ll):
 # Collect all unique markers for phylogenetic analyses
 def unique_gtdb(domain, output):
     with open(output, 'w') as wfh:
-        for f in glob.glob(os.path.join(snakemake.input[0], '*', 'identify', 'gtdbtk.'+domain+'.markers_summary.tsv')):
+        for f in glob.glob(os.path.join(snakemake.input[0], '*', 'identify', 'gtdbtk.'+domain+'*.markers_summary.tsv')):
             with open(f, 'r') as rfh:
                 nl = 0
                 for ll in rfh:
@@ -154,5 +156,5 @@ def unique_gtdb(domain, output):
                         wfh.write(line[0]+'\t'+line[5]+'\n')
                     nl += 1
 
-unique_gtdb('bac120', snakemake.output[3])
-unique_gtdb('ar122', snakemake.output[4])
+unique_gtdb('bac', snakemake.output[3])
+unique_gtdb('ar', snakemake.output[4])
diff --git a/maginator/workflow/scripts/prescreening_genes.R b/maginator/workflow/scripts/prescreening_genes.R
@@ -10,7 +10,7 @@ GeneLengths <- readRDS(snakemake@input[["gene_lengths"]])
 n.genes <- 100
 
 #minimum number of mapped genes required
-n.mapped.minimum <- 3
+n.mapped.minimum <- as.integer(snakemake@params[["min_mapped_signature_genes"]])
 
 ids <- names(Clusterlist) #the ids of the MGS
 

diff --git a/maginator/workflow/signature_genes.Snakefile b/maginator/workflow/signature_genes.Snakefile
@@ -30,7 +30,9 @@ rule refinement:
     conda:
         "envs/signature_genes.yaml"
     params:
-        functions = "Functions_v4.R"
+        functions = "Functions_v4.R",
+        min_mapped_signature_genes=param_dict['min_mapped_signature_genes'],
+        min_samples = param_dict['min_samples']
     resources:
         cores = 1,
         mem_gb = 12,
@@ -69,6 +71,8 @@ rule abundance_profile:
         physeq_abundance = os.path.join(WD, 'abundance', 'abundance_phyloseq.RData'),
         tax_matrix = os.path.join(WD, 'tabs', 'tax_matrix.tsv'),
         sg_cluster = os.path.join(WD, 'tabs', 'signature_genes_cluster.tsv')
+    params:
+        min_mapped_signature_genes=param_dict['min_mapped_signature_genes']
     conda:
         "envs/signature_genes.yaml"
     resources:
@@ -89,6 +93,8 @@ rule gene_refinement_plots:
         tax_matrix = os.path.join(WD, 'tabs', 'tax_matrix.tsv')
     output:
         plot_pdf = os.path.join(WD, 'signature_genes', 'read-count_detected-genes.pdf')
+    params:
+        min_samples=param_dict['min_samples']
     conda:
         "envs/signature_genes.yaml"
     resources:

diff --git a/setup.py b/setup.py
@@ -5,7 +5,7 @@
 
 setuptools.setup(
     name="maginator", 
-    version="0.1.17",
+    version="0.1.19",
     author="Jakob Russel & Trine Zachariasen",
     author_email="russel2620@gmail.com,trine_zachariasen@hotmail.com",
     description="MAGinator: Abundance, strain, and functional profiling of MAGs",