Skip to content

Commit

Permalink
v.0.1.19 updated GTDB-tk and db version, added --min_mapped_signature…
Browse files Browse the repository at this point in the history
…_genes and --min_samples and other smaller changes
  • Loading branch information
trinezac committed Jan 4, 2024
1 parent 11aec32 commit a99b263
Show file tree
Hide file tree
Showing 16 changed files with 73 additions and 45 deletions.
46 changes: 28 additions & 18 deletions README.md
Original file line number Diff line number Diff line change
Expand Up @@ -23,10 +23,10 @@ conda activate maginator
pip install maginator
```

Furthermore, MAGinator also needs the GTDB-tk database version R207_v2 downloaded. If you don't already have it, you can run the following:
Furthermore, MAGinator also needs the GTDB-tk database downloaded. Here we download release 214. If you don't already have it, you can run the following:
```sh
wget https://data.gtdb.ecogenomic.org/releases/release207/207.0/auxillary_files/gtdbtk_r207_v2_data.tar.gz
tar xvzf gtdbtk_v2_data.tar.gz
wget https://data.ace.uq.edu.au/public/gtdb/data/releases/release214/214.1/auxillary_files/gtdbtk_r214_data.tar.gz
tar xvzf *.tar.gz
```

## Usage
Expand All @@ -39,7 +39,7 @@ MAGinator needs 3 input files:

Run MAGinator:
```sh
maginator -v vamb_clusters.tsv -r reads.csv -c contigs.fasta -o my_output -g "/path/to/GTDB-Tk/database/release207_v2/"
maginator -v vamb_clusters.tsv -r reads.csv -c contigs.fasta -o my_output -g "/path/to/GTDB-Tk/database/release214/"
```

A testset can be found in the test_data directory.
Expand All @@ -65,11 +65,18 @@ A test set can be found in the maginator/test_data directory.
4. Unzip the contigs.fasta.gz
5. Run MAGinator

MAGinator has been run on the test data on a slurm server with the following command:
MAGinator can been run on the test data on a slurm server with the following command:
```sh
maginator --vamb_clusters clusters.tsv --reads reads.csv --contigs contigs.fasta --gtdb_db data/release207_v2/ --output test_out --cluster slurm --cluster_info "-n {cores} --mem {mem_gb}gb -t {runtime}" --max_mem 180
maginator --vamb_clusters clusters.tsv --reads reads.csv --contigs contigs.fasta --gtdb_db data/release214/ --output test_out --cluster slurm --cluster_info "-n {cores} --mem {mem_gb}gb -t {runtime}" --max_mem 180
```
The expected output can be found as a zipped file on Zenodo: https://doi.org/10.5281/zenodo.8279036
The expected output can be found as a zipped file on Zenodo: https://doi.org/10.5281/zenodo.8279036. MAGinator has been run on the test data (using GTDB-tk db release207_v2) on a slurm server.

On the compute cluster each job have had access to 180gb RAM, with the following time consumption:
real 72m27.379s
user 0m18.830s
sys 1m0.454s

If you run on a smaller server you can set the parameters --max_cores and --max_mem.

## Recommended workflow

Expand Down Expand Up @@ -120,11 +127,13 @@ This is what MAGinator does with your input (if you want to see all parameters r
* Use --clustering_coverage to toggle the clustering coverage
* Use --clustering_type to toggle whether to cluster on amino acid or nucleotide level
* Map reads to the non-redundant gene catalogue and create a matrix with gene counts for each sample
* Pick non-redundant genes that are only found in one MGS each
* Fit signature gene model and use the resulting signature genes to get the abundance of each MGS
* Prepare for generation of phylogenies for each MGS by finding outgroups and marker genes which will be used for rooting the phylogenies
* Pick non-redundant genes that are only found in one MAG cluster each
* Fit signature gene model and use the resulting signature genes to get the abundance of each MAG cluster
* Use --min_mapped_signature_genes to change minimum number of signature genes to be detected in the sample to be included in the analysis
* Use --min_samples to alter the number of samples with the MAG cluster present in order to perform signature gene refinement
* Prepare for generation of phylogenies for each MAG cluster by finding outgroups and marker genes which will be used for rooting the phylogenies
* Use the read mappings to collect SNV information for each signature gene and marker gene for each sample
* Align signature and marker genes, concatenate alignments and infer phylogenetic trees for each MGS
* Align signature and marker genes, concatenate alignments and infer phylogenetic trees for each MAG cluster
* Use --phylo to toggle whether use fasttree (fast, approximate) or iqtree (slow, precise) to infer phylogenies
* Infer the taxonomic scope of each gene cluster. That is, at what taxonomic level are genes from a given gene cluster found in
* Use --tax_scope_threshold to toggle the threshold for how to find the taxonomic scope consensus
Expand All @@ -144,6 +153,7 @@ This is what MAGinator does with your input (if you want to see all parameters r
* all_genes_cluster.tsv - Gene clusters
* matrix/
* gene_count_matrix.tsv - Read count for each gene cluster for each sample
* small_gene_count_matrix.tsv - Read count matrix only containing the genes, that does not cluster across MAG cluster
* synteny/ - Intermediate files for synteny clustering of gene clusters
* gtdbtk/
* <cluster>/ - GTDB-tk taxonomic annotation for each VAMB cluster
Expand All @@ -152,19 +162,19 @@ This is what MAGinator does with your input (if you want to see all parameters r
* bams/ - Bam files for mapping reads to gene clusters
* phylo/
* alignments/ - Alignments for each signature gene
* cluster_alignments/ - Concatenated alignments for each MGS
* pileup/ - SNV information for each MGS and each sample
* trees/ - Phylogenetic trees for each MGS
* cluster_alignments/ - Concatenated alignments for each MAG cluster
* pileup/ - SNV information for each MAG cluster and each sample
* trees/ - Phylogenetic trees for each MAG cluster
* stats.tab - Mapping information such as non-N fraction, number of signature genes and marker genes, read depth, and number of bases not reaching allele frequency cutoff
* stats_genes.tab - Same as above but the information is split per gene
* signature_genes/
* \- R data files with signature gene optimization
* read-count_detected-genes.pdf - Figure for each MGS/cluster displaying number of identified SG's in each sample along with the number of reads mapped.
* read-count_detected-genes.pdf - Figure for each MAG cluster displaying number of identified SG's in each sample along with the number of reads mapped.
* tabs/
* gene_cluster_bins.tab - Table listing which bins each gene cluster was found in
* gene_cluster_tax_scope.tab - Table listing the taxonomic scope of each gene cluster
* metagenomicspecies.tab - Table listing which, if any, clusters where merged in MGS and the taxonomy of those
* signature_genes_cluster.tsv - Table with the signature genes for each MGS/cluster
* metagenomicspecies.tab - Table listing which, if any, clusters where merged in MAG cluster and the taxonomy of those
* signature_genes_cluster.tsv - Table with the signature genes for each MAG cluster
* synteny_clusters.tab - Table listing the synteny cluster association for the gene clusters. Gene clusters from the same synteny cluster are genomically adjacent.
* tax_matrix.tsv - Table with taxonomy information for MGS
* tax_matrix.tsv - Table with taxonomy information for MAG cluster

4 changes: 4 additions & 0 deletions maginator/main.py
Original file line number Diff line number Diff line change
Expand Up @@ -69,6 +69,10 @@ def cli():
app.add_argument('--clustering_type', help='Sequence type for gene clustering with MMseqs2. Nucleotide- or protein-level [%(default)s]', default='protein', type=str, choices=['nucleotide', 'protein'])
app.add_argument('--min_gtdb_markers', help='Minimum GTDBtk marker genes shared between MGS and outgroup for rooting trees [%(default)s]', default=10, type=int)
app.add_argument('--marker_gene_cluster_prevalence', help='Minimum prevalence of marker genes to be selected for rooting of MGS trees [%(default)s]', default=0.5, type=float)

app.add_argument('--min_mapped_signature_genes', help='Minimum number of signature genes with reads mapped for the sample to be included in the refinement [%(default)s]', default=3, type=int)
app.add_argument('--min_samples', help='Minimum number of samples containing the MAG cluster (more than "min_mapped_signature_genes" present) for the MAG cluster to identidy SG [%(default)s]', default=3, type=int)

app.add_argument('--min_af', help='Minimim allele frequency for calling a base when creating phylogenies [%(default)s]', default=0.8, type=float)
app.add_argument('--min_depth', help='Minimim read depth for calling a base when creating phylogenies [%(default)s]', default=2, type=int)
app.add_argument('--min_nonN', help='Minimum fraction of non-N characters of a sample to be included in a phylogeny [%(default)s]', default=0.5, type=float)
Expand Down
2 changes: 1 addition & 1 deletion maginator/workflow.py
Original file line number Diff line number Diff line change
Expand Up @@ -31,7 +31,7 @@ def run(self, snakefile):
# Define core snakemake command
cmd = ['snakemake',
'--use-conda',
'--latency-wait', '80',
'--latency-wait', '150',
'-s', snakefile,
'--config',
'wd='+self.output,
Expand Down
2 changes: 1 addition & 1 deletion maginator/workflow/envs/filter_geneclusters.yaml
Original file line number Diff line number Diff line change
Expand Up @@ -4,5 +4,5 @@ channels:
dependencies:
- bbmap=38.96
- bwa-mem2=2.2.1
- samtools=1.10
- samtools=1.18

4 changes: 2 additions & 2 deletions maginator/workflow/envs/filter_gtdbtk.yaml
Original file line number Diff line number Diff line change
Expand Up @@ -2,11 +2,11 @@ channels:
- bioconda
- conda-forge
dependencies:
- gtdbtk=2.1
- gtdbtk=2.3.2
- biopython=1.79
- pandas=1.4
- numpy=1.23.1
- mmseqs2=13.45111
- bbmap=38.96
- bwa-mem2=2.2.1
- samtools=1.10
- samtools=1.18
2 changes: 1 addition & 1 deletion maginator/workflow/envs/phylo.yaml
Original file line number Diff line number Diff line change
Expand Up @@ -5,7 +5,7 @@ channels:
dependencies:
- biopython=1.79
- pandas=1.4
- samtools=1.10
- samtools=1.18
- fasttree=2
- perl=5
- mafft=7
Expand Down
2 changes: 1 addition & 1 deletion maginator/workflow/filter.Snakefile
Original file line number Diff line number Diff line change
Expand Up @@ -22,7 +22,7 @@ mem = math.ceil(n_contigs/1000000)*30
if mem > int(param_dict['max_mem']):
mem = int(param_dict['max_mem'])
## time is 1 hour per million
tim = str(math.ceil(n_contigs/1000000)*60*60) # runtime in seconds
tim = str(max(1800, min(math.ceil(n_contigs/100000)*60*60, 72000))) # runtime in seconds, max 20h, min 30min

rule all:
input:
Expand Down
11 changes: 7 additions & 4 deletions maginator/workflow/gtdbtk.Snakefile
Original file line number Diff line number Diff line change
Expand Up @@ -26,14 +26,14 @@ wildcard_constraints:

rule all:
input:
expand(os.path.join(WD, 'gtdbtk', '{cluster}/classify/gtdbtk.bac120.summary.tsv'), cluster=CLUSTERS)
expand(os.path.join(WD, 'gtdbtk', '{cluster}/gtdbtk.done'), cluster=CLUSTERS)

rule gtdbtk:
input:
os.path.join(F_DIR, '{cluster}/')
ancient(os.path.join(F_DIR, '{cluster}/'))
output:
directory(os.path.join(WD, 'gtdbtk', '{cluster}')),
os.path.join(WD, 'gtdbtk', '{cluster}/classify/gtdbtk.bac120.summary.tsv')
os.path.join(WD, 'gtdbtk', '{cluster}/gtdbtk.done')
conda:
"envs/filter_gtdbtk.yaml"
params:
Expand All @@ -45,5 +45,8 @@ rule gtdbtk:
shell:
'''
export GTDBTK_DATA_PATH={params.gtdbtk};
gtdbtk classify_wf --genome_dir {input} --out_dir {output[0]} --cpus {resources.cores} --extension fa --keep_intermediates || true
gtdbtk classify_wf --genome_dir {input} --out_dir {output[0]} --cpus {resources.cores} --extension fa --keep_intermediates --skip_ani_screen || true
if [[ -f {output[0]}/classify/gtdbtk.bac120.summary.tsv || -f {output[0]}/classify/gtdbtk.ar53.summary.tsv ]]; then
touch {output[1]}
fi
'''
2 changes: 2 additions & 0 deletions maginator/workflow/prescreening_genes.Snakefile
Original file line number Diff line number Diff line change
Expand Up @@ -83,6 +83,8 @@ rule prescreening_genes:
gene_lengths = os.path.join(WD, 'signature_genes', 'gene_lengths.RDS')
output:
clusters_sorted = os.path.join(WD, 'signature_genes', 'clusters_sorted.RDS'),
params:
min_mapped_signature_genes = param_dict['min_mapped_signature_genes']
conda:
"envs/signature_genes.yaml"
resources:
Expand Down
9 changes: 5 additions & 4 deletions maginator/workflow/scripts/SG_refinement.R
Original file line number Diff line number Diff line change
Expand Up @@ -20,7 +20,8 @@ snakemake@source(snakemake@params[["functions"]])
n.genes <- 100

#minimum number of mapped genes required
n.mapped.minimum <- 3
n.mapped.minimum <- as.integer(snakemake@params[["min_mapped_signature_genes"]])
n.minimum.samples <- as.integer(snakemake@params[["min_samples"]])

ids <- names(Clusterlist) #the ids of the MGS
id <- tail(strsplit(strsplit(snakemake@input[["clusters_dir"]], ".RDS")[[1]], "/")[[1]], n=1)
Expand Down Expand Up @@ -59,12 +60,12 @@ run_one_id <- function(id){

#colsum is the amount mapped genes in any given sample
colsum <- colSums(Clusterlist[[id]][1:n.genes, ])
# extracting only the samples which contains above 3 mapped reads -- if there are 3 reads we believe this is a true detection
# extracting only the samples which contains above n.mapped.minimum mapped reads -- if there are above n.mapped.minimum reads we believe this is a true detection
mapped <- colsum[colsum >= n.mapped.minimum]

n_samples <- setNames(c(n_samples, sum(colsum>=n.mapped.minimum)), c(names(n_samples), id))
#if less than 3 samples are mapped to the MGS, the MGS should be skipped
if (length(mapped) < 3){
#if less than n.minimum.samples are mapped to the MGS, the MGS should be skipped
if (length(mapped) < n.minimum.samples){
# print("Not enough samples are mapped to the cluster")
return()}
genes <- names(Clusterlist[[id]][1:n.genes, 1])
Expand Down
2 changes: 1 addition & 1 deletion maginator/workflow/scripts/abundance_profiles.R
Original file line number Diff line number Diff line change
Expand Up @@ -18,7 +18,7 @@ colnames(taxonomy) <- c("Cluster","Taxonomy")
#setting important variables
gene_index <- seq(1,length(GeneLengths))
gene_names <- names(GeneLengths)
n.mapped.minimum <- 3 #The number of genes that needs reads that map to count the cluster as present
n.mapped.minimum <- as.integer(snakemake@params[["min_mapped_signature_genes"]]) #The number of genes that needs reads that map to count the cluster as present
n.genes <- 100 # number of signature genes

# inserting NA for the Clusters that do not have a annotation
Expand Down
2 changes: 1 addition & 1 deletion maginator/workflow/scripts/gene_refinement_plots.R
Original file line number Diff line number Diff line change
Expand Up @@ -9,7 +9,7 @@ taxmat <- read.csv(snakemake@input[["tax_matrix"]], sep="\t", header=FALSE)

# Initializing relevant parameters
n_signature_genes_expected <- '95'
minimum_sampels <- 3
minimum_sampels <- as.integer(snakemake@params[["min_samples"]])
n.genes <- 100

#loading colors
Expand Down
18 changes: 10 additions & 8 deletions maginator/workflow/scripts/parse_gtdbtk.py
Original file line number Diff line number Diff line change
Expand Up @@ -37,14 +37,16 @@ def most_common(ll):

# Try reading both bacterial and archaeal summaries.
try:
tax_bac = pd.read_csv(os.path.join(snakemake.input[0], clust, 'gtdbtk.bac120.summary.tsv'), sep='\t', header=0)
tax_bac = pd.read_csv(glob.glob(os.path.join(snakemake.input[0], clust, 'gtdbtk.bac*.summary.tsv'))[0], sep='\t', header=0)
if tax_bac.iloc[0,1]=='Unclassified':
tax_bac=None
except FileNotFoundError:
elif tax_bac.iloc[0,1]=='Unclassified Bacteria':
tax_bac=None
except (IndexError, FileNotFoundError):
tax_bac = None
try:
tax_ar = pd.read_csv(os.path.join(snakemake.input[0], clust, 'gtdbtk.ar122.summary.tsv'), sep='\t', header=0)
except FileNotFoundError:
tax_ar = pd.read_csv(glob.glob(os.path.join(snakemake.input[0], clust, 'gtdbtk.ar*.summary.tsv'))[0], sep='\t', header=0)
except (IndexError, FileNotFoundError):
tax_ar = None

# Combine
Expand All @@ -59,7 +61,7 @@ def most_common(ll):

# Remove unclassified
classification = [x for x in classification if len(x) == 7]

# Traverse from species annotation and up
# Pick the annotation if the most common annotation is above prevalence set by parameter
level = 6
Expand Down Expand Up @@ -145,7 +147,7 @@ def most_common(ll):
# Collect all unique markers for phylogenetic analyses
def unique_gtdb(domain, output):
with open(output, 'w') as wfh:
for f in glob.glob(os.path.join(snakemake.input[0], '*', 'identify', 'gtdbtk.'+domain+'.markers_summary.tsv')):
for f in glob.glob(os.path.join(snakemake.input[0], '*', 'identify', 'gtdbtk.'+domain+'*.markers_summary.tsv')):
with open(f, 'r') as rfh:
nl = 0
for ll in rfh:
Expand All @@ -154,5 +156,5 @@ def unique_gtdb(domain, output):
wfh.write(line[0]+'\t'+line[5]+'\n')
nl += 1

unique_gtdb('bac120', snakemake.output[3])
unique_gtdb('ar122', snakemake.output[4])
unique_gtdb('bac', snakemake.output[3])
unique_gtdb('ar', snakemake.output[4])
2 changes: 1 addition & 1 deletion maginator/workflow/scripts/prescreening_genes.R
Original file line number Diff line number Diff line change
Expand Up @@ -10,7 +10,7 @@ GeneLengths <- readRDS(snakemake@input[["gene_lengths"]])
n.genes <- 100

#minimum number of mapped genes required
n.mapped.minimum <- 3
n.mapped.minimum <- as.integer(snakemake@params[["min_mapped_signature_genes"]])

ids <- names(Clusterlist) #the ids of the MGS

Expand Down
8 changes: 7 additions & 1 deletion maginator/workflow/signature_genes.Snakefile
Original file line number Diff line number Diff line change
Expand Up @@ -30,7 +30,9 @@ rule refinement:
conda:
"envs/signature_genes.yaml"
params:
functions = "Functions_v4.R"
functions = "Functions_v4.R",
min_mapped_signature_genes=param_dict['min_mapped_signature_genes'],
min_samples = param_dict['min_samples']
resources:
cores = 1,
mem_gb = 12,
Expand Down Expand Up @@ -69,6 +71,8 @@ rule abundance_profile:
physeq_abundance = os.path.join(WD, 'abundance', 'abundance_phyloseq.RData'),
tax_matrix = os.path.join(WD, 'tabs', 'tax_matrix.tsv'),
sg_cluster = os.path.join(WD, 'tabs', 'signature_genes_cluster.tsv')
params:
min_mapped_signature_genes=param_dict['min_mapped_signature_genes']
conda:
"envs/signature_genes.yaml"
resources:
Expand All @@ -89,6 +93,8 @@ rule gene_refinement_plots:
tax_matrix = os.path.join(WD, 'tabs', 'tax_matrix.tsv')
output:
plot_pdf = os.path.join(WD, 'signature_genes', 'read-count_detected-genes.pdf')
params:
min_samples=param_dict['min_samples']
conda:
"envs/signature_genes.yaml"
resources:
Expand Down
2 changes: 1 addition & 1 deletion setup.py
Original file line number Diff line number Diff line change
Expand Up @@ -5,7 +5,7 @@

setuptools.setup(
name="maginator",
version="0.1.17",
version="0.1.19",
author="Jakob Russel & Trine Zachariasen",
author_email="russel2620@gmail.com,trine_zachariasen@hotmail.com",
description="MAGinator: Abundance, strain, and functional profiling of MAGs",
Expand Down

0 comments on commit a99b263

Please sign in to comment.