A Snakemake workflow for phylogeographic analysis make by Wanlin Li and Tahiri Nadia from University of Sherbrooke (Quebec, Canada).
aPhyloGeo is a user-friendly, scalable, reproducible, and comprehensive workflow that can explore the correlation between specific genes (or gene segments) and environmental factors.
The workflow includes the following Python packages:
The workflow includes the following bioinformatics tools:
The software dependencies can be found in the conda environment files: [1] and [2].
1. Clone this repo.
git clone https://github.com/tahiri-lab/aPhyloGeo-pipeline.git
cd aPhyloGeo-pipeline
2. Install dependencies.
2.1 If Conda is not installed, then use the following method to install it, else then refer directly to the next step (2.2).
# download Miniconda3 installer
wget https://repo.continuum.io/miniconda/Miniconda3-latest-Linux-x86_64.sh -O miniconda.sh
# install Conda (respond by 'yes')
bash miniconda.sh
# update Conda
conda update -y conda
2.2 Create a conda environment named aPhyloGeo and install all the dependencies in that environment.
# create a new environment with dependencies
conda env create -n aPhyloGeo -f environment.yaml
2.3 Activate the environment
conda activate aPhyloGeo
3. Configure the workflow.
-
config file:
config.yaml
- analysis-specific settings (e.g., bootstrap_threshold, rf_threshold, step_size, window_size, data_type etc.)
Note: Set the parameters and threshold in theconfig.yaml
file according to the research needs. When setting the parameters and threshold, please modify the corresponding values. Remember not to change the parameter names or file names.- Thresholds in
config.yaml
:bootstrap_threshold
: Only sliding windows with bootstrap values greater than user-set bootstrap_threshold (value from 0 to 1) will be written to the output file.rf_threshold
: The tree distance between each combination of sliding windows and environmental features will be calculated. Only sliding windows with Robinson–Foulds (RF) distance below the user-set rf_threshold (value from 0 to 100) will be written to the output file.
- params in
config.yaml
:data_type
:aa
for the amino acid dataset (case insensitive); Any other values set by the user will be treated as nucleotide dataset (default).step_size
: the size of the Sliding window movement step (bp)window_size
: the size of the Sliding window (bp)strategy
: For constructing the phylogenetic tree, two alternative algorithms are provided, RAxML-Ng and FastTree.fasttree
for the FastTree strategy (case insensitive); Any other values set by the user will be treated as RAxML-Ng strategy (default).geo_file
: the path of input file (the environmental data.csv
)seq_file
: the path of input file (the Multiple Sequence Alignment data.fasta
)
Note: To use a Relative Path to describe the input file relatively to the path related to theaPhyloGeo-pipeline
directory (i.e., the default Present Working Directory should be theworkflow
).specimen_id
: the name of the column containing the sample id ingeo_file
feature_names
: The names of the columns corresponding to the environmental factors that will be involved in the analysis (ingeo_file
)
Note: Each column name is on a separate line, don't forget to keep the "-" in front of it.
-
input files:
- example data files for protein analysis:
align_p.fa
- Multiple Sequence Alignment for protein sequences inFASTA format
(5 samples).geo_p.csv
- Environmental data corresponding to sequencing samples (5 samples).
- example data files for nucleotide analysis:
- example data files for protein analysis:
-
output files:
- (filtered) sliding windows with Robinson–Foulds (RF) distance values below the user-set threshold and bootstrap values greater than the user-set threshold in
.csv
(comma-separated values files). .csv
and related metadata will be stored in the 'results' directory.
- (filtered) sliding windows with Robinson–Foulds (RF) distance values below the user-set threshold and bootstrap values greater than the user-set threshold in
4. Execute the workflow.
Locally
run workflow
# In a conda environment where all dependencies are already installed
# Specify the maximum number of CPU cores to be used at the same time.
# To use N cores: --cores N or -cN.
snakemake --cores all
Even with not created and activated the conda environment as required in 2.2 and 2.3 is possible by running the workflow successfully with '--use-conda'. Snakemake will create a temporary conda environment.
# To specify the maximum number of CPU cores to be used at the same time.
# With N cores: --cores N or -cN.
# For all cores in the system: --cores all.
snakemake --use-conda --cores all
Other features available
# 'dry' run only checks I/O files
snakemake -n
# 'dry-run' print shell commands
snakemake -np
# Force snakemake to run the job. By default, if snakemake thinks the pipeline doesn’t need updating, snakemake will not run
snakemake -F
A manuscript for aPhyloGeo-pipeline is in preparation.
Please email us at : Nadia.Tahiri@USherbrooke.ca for any question or feedback.