🎉 Exciting news! 🎉 This workflow now supports the use of Translation Elongation Factor 1 alpha (TEF1) as a marker gene for the filamentous fungal genus Fusarium.
🦠 SporeFlow: 16S, ITS and TEF1 metataxonomics pipeline
SporeFlow (Snakemake Pipeline For Metataxonomics Workflows) is a pipeline for metataxonomic analysis of fungal ITS, Fusarium TEF1 and bacterial 16S using QIIME 2 and Snakemake.
More information on the use of TEF1 for Fusarium can be found in https://github.com/SergioAlias/fusariumid-train.
🐍 This workflow uses Snakemake 7.32.4. Newer versions (8+) contain backwards incompatible changes that may result in this pipeline not working in a Slurm HPC queue system.
What SporeFlow does:
- Run FastQC on the raw FASTQ files (rule
fastqc_before
) - Run Cutadapt on the raw FASTQ files (rule
cutadapt
) - Run FastQC on the trimmed FASTQ files (rule
fastqc_after
) - Aggregate QC results (FastQC before trimming, Cutadapt, FastQC after trimming) with MultiQC (rule
multiqc
) - Create manifest file for QIIME 2 (rule
create_manifest
) - Import FASTQ files to QIIME 2 (rule
import_fastq
) - Trim ITS sequences in QIIME 2 with ITSxpress plugin (rule
itsxpress
) - Denoise, dereplicate, remove chimeras and merge sequences in QIIME 2 with DADA2 plugin (rule
dada2
) - Perform taxonomic classification in QIIME 2 with feature-classifier plugin (rule
taxonomy
) - Perform diversity analysis in QIIME 2 with diversity plugin (rule
diversity
) - Perform differential abundance in QIIME 2 with composition plugin (rule
abundance
)
There are some additional steps used for adapting results between main steps. We don't worry about those for now.
The only prerequisite is having Conda installed. In this regard, we highly recommend installing Miniconda and then installing Mamba (used by default by Snakemake) for a lightweight and fast experience.
-
Clone the repository.
-
Create a Screen (see section Immediate submit and Screen).
-
Run the following command to download (if needed) and activate the SporeFlow environment, and to set aliases for the main functions:
source init_sporeflow.sh
-
Create links to your original FASTQ files (with
ln -s
) that match the format[sample_name]_R1.fastq.gz
/[sample_name]_R2.fastq.gz
(the workflow only accepts paired-end sequencing for now). -
Edit
metadata.tsv
with your samples metadata. -
For differential abundance, edit
abundance.tsv
with the comparisons you want to perform, based on fields and values included inmetadata.tsv
. -
Edit
config/config.yml
with your experiment details. Variables annotated with #cluster# must also be updated inconfig/cluster_config.yml
. -
If needed, modify
time
,ncpus
andmemory
variables inconfig/cluster_config.yml
. -
Classifier setup:
-
Fungi (ITS): download a UNITE classfier in QIIME 2 format from https://github.com/colinbrislawn/unite-train/releases. We recommend using one of the following (remember to change the name accordingly in
config/config.yml
):unite_ver10_dynamic_all_04.04.2024-Q2-2024.2.qza
unite_ver10_99_all_04.04.2024-Q2-2024.2.qza
-
Fungi, Fusarium (TEF1): you can train your own classifier or download a pre-made one from https://github.com/SergioAlias/fusariumid-train.
-
Bacteria: download a SILVA classifier in QIIME 2 format from https://resources.qiime2.org/. We recommend using the SILVA 138 99% OTUs full-length sequences database (remember to change the name accordingly in
config/config.yml
).
-
-
Run
sf_run
to start the workflow. You can also run it until some key steps (using--until rule_name
) to check the results before continuing and to change parameters if necessary (recommended). For example, a possible workflow split could be (see Drawing DAGs and rule graphs for a visual workflow including all rule names):
sf_run --until multiqc # quality control and possible primer trimming
sf_run --until dada2 # feature table construction
sf_run --until taxonomy # taxonomic classification
sf_run # rest of workflow
# Tip: add the flag -n to perform a dry-run. You will see how many jobs
# will be executed without actually running the workflow.
# Example:
# sf_run --until multiqc -n
Sporeflow inlcudes a command, sf_immediate
, that automatically sends all jobs to Slurm, correctly queued according to their dependencies. This is desirable e.g. when the runtime in the cluster login machine is very short, because it may kill Snakemake in the middle of the workflow. If your HPC queue system only allows a limited number of jobs submitted at once, change that number in init_sporeflow.sh
and source it again (that also applies for sf_run
).
Please note that if the number of simultaneous jobs accepted by the queue system is less than the total number of jobs you need to submit, the workflow will fail. For such cases, we highly recommend not using sf_immediate
. Instead, use sf_run
inside a Screen. Screen is a multiplexer that lets you create multiple virtual terminal sessions. It is installed by default in most Linux HPC systems.
To create a screen, use screen -S sporeflow
. Then, follow usage section there. You can dettach the screen with Ctrl+a
and then d
. You can attach the screen again with screen -r sporeflow
. For more details about Screen usage, please check this Gist.
Since Sporeflow is built over Snakemake, you can generate DAGs, rule graphs and file graphs of the workflow. We provide three commands for this: sf_draw_dag
, sf_draw_rulegraph
and sf_draw_filegraph
. These commands create dag.pdf
, rulegraph.pdf
and filegraph.pdf
in the code directory.