Skip to content

Latest commit

 

History

History

config

Snakemake workflow: aPhyloGeo

A Snakemake workflow for phylogeographic analysis make by Wanlin Li and Tahiri Nadia from University of Sherbrooke (Quebec, Canada).

aPhyloGeo is a user-friendly, scalable, reproducible, and comprehensive workflow that can explore the correlation between specific genes (or gene segments) and environmental factors.

Dependencies

The workflow includes the following Python packages:

The workflow includes the following bioinformatics tools:

The software dependencies can be found in the conda environment files: [1] and [2].

Usage

1. Clone this repo.

git clone https://github.com/tahiri-lab/aPhyloGeo-pipeline.git
cd aPhyloGeo-pipeline

2. Install dependencies.

2.1 If Conda is not installed, then use the following method to install it, else then refer directly to the next step (2.2).

# download Miniconda3 installer
wget https://repo.continuum.io/miniconda/Miniconda3-latest-Linux-x86_64.sh -O miniconda.sh

# install Conda (respond by 'yes')
bash miniconda.sh

# update Conda
conda update -y conda

2.2 Create a conda environment named aPhyloGeo and install all the dependencies in that environment.

# create a new environment with dependencies 
conda env create -n aPhyloGeo -f environment.yaml

2.3 Activate the environment

conda activate aPhyloGeo

3. Configure the workflow.

  • config file:

    • config.yaml - analysis-specific settings (e.g., bootstrap_threshold, rf_threshold, step_size, window_size, data_type etc.)
      Note: Set the parameters and threshold in the config.yaml file according to the research needs. When setting the parameters and threshold, please modify the corresponding values. Remember not to change the parameter names or file names.
    • Thresholds in config.yaml:
      • bootstrap_threshold: Only sliding windows with bootstrap values greater than user-set bootstrap_threshold (value from 0 to 1) will be written to the output file.
      • rf_threshold: The tree distance between each combination of sliding windows and environmental features will be calculated. Only sliding windows with Robinson–Foulds (RF) distance below the user-set rf_threshold (value from 0 to 100) will be written to the output file.
    • params in config.yaml:
      • data_type: aa for the amino acid dataset (case insensitive); Any other values set by the user will be treated as nucleotide dataset (default).
      • step_size: the size of the Sliding window movement step (bp)
      • window_size: the size of the Sliding window (bp)
      • strategy: For constructing the phylogenetic tree, two alternative algorithms are provided, RAxML-Ng and FastTree. fasttree for the FastTree strategy (case insensitive); Any other values set by the user will be treated as RAxML-Ng strategy (default).
      • geo_file: the path of input file (the environmental data .csv )
      • seq_file: the path of input file (the Multiple Sequence Alignment data .fasta )
        Note: To use a Relative Path to describe the input file relatively to the path related to the aPhyloGeo-pipeline directory (i.e., the default Present Working Directory should be the workflow).
      • specimen_id: the name of the column containing the sample id in geo_file
      • feature_names: The names of the columns corresponding to the environmental factors that will be involved in the analysis (in geo_file)
        Note: Each column name is on a separate line, don't forget to keep the "-" in front of it.
  • input files:

    • example data files for protein analysis:
      • align_p.fa - Multiple Sequence Alignment for protein sequences in FASTA format(5 samples).
      • geo_p.csv - Environmental data corresponding to sequencing samples (5 samples).
    • example data files for nucleotide analysis:
      • align.fa - Multiple Sequence Alignment for nucleotide sequences in FASTA format (5 samples).
      • geo.csv - Environmental data corresponding to sequencing samples (5 samples).
  • output files:

    • (filtered) sliding windows with Robinson–Foulds (RF) distance values below the user-set threshold and bootstrap values greater than the user-set threshold in .csv (comma-separated values files).
    • .csv and related metadata will be stored in the 'results' directory.

4. Execute the workflow.

Locally

run workflow

# In a conda environment where all dependencies are already installed
# Specify the maximum number of CPU cores to be used at the same time.
# To use N cores: --cores N or -cN.

snakemake --cores all

Even with not created and activated the conda environment as required in 2.2 and 2.3 is possible by running the workflow successfully with '--use-conda'. Snakemake will create a temporary conda environment.

# To specify the maximum number of CPU cores to be used at the same time. 
# 	With N cores: --cores N or -cN. 
# 	For all cores in the system: --cores all. 

snakemake --use-conda --cores all

Other features available

# 'dry' run only checks I/O files

snakemake -n

# 'dry-run' print shell commands

snakemake -np

# Force snakemake to run the job. By default, if snakemake thinks the pipeline doesn’t need updating, snakemake will not run

snakemake -F

Citation

A manuscript for aPhyloGeo-pipeline is in preparation.

Contact

Please email us at : Nadia.Tahiri@USherbrooke.ca for any question or feedback.