Skip to content

Snakemake implementation of protein structure alignment software DaliLite v5

Notifications You must be signed in to change notification settings

Doudna-lab/snakedali

Repository files navigation

screenshot

GitHub Tag GitHub top language GitHub commit activity GitHub repo size GitHub Downloads (all assets, all releases)

OVERVIEW

1 What is Snakedali?

Snakedali is the Snakemake implementation of the multithreaded version of DaliLite v5 to align PDB queries to a pre-built Alphafold database. It is designed to function on HPCs (High-Performance Clusters) and is specifically tailored to work with the SGE workload manager out-of-the-box. It introduces automated input handling and a unified report that aggregates all queries and hits in a single .xlsx.

2 Citation

    Yoon, P.H., Zhang, Z., Loi, K.J., Adler, B.A., Lahiri, A., Vohra, K., Shi, H., Rabelo, D.B., Trinidad, M., Boger, R.S. and Al-Shimary, M.J., 2024. Structure-guided discovery of ancestral CRISPR-Cas13 ribonucleases. Science, p.eadq0553.

GETTING STARTED

3 Dependencies
4 Installation
    4.1 Database Download
      - With AWS CLI installed (see Section 3.4) - Download the pre-built database:
      aws s3 cp s3://snakedali.db/pdb_files_DAT.tar.gz <your_local_path>
      tar zxf <your_local_path>/pdb_files_DAT.tar.gz
      
    4.2 Standard Git
      • Clone repository files
      git clone https://github.com/Doudna-lab/nidali.git
      
    4.3 Git LFS
      • Two singularity/apptainer containers are provided in this repository

      • Although these are support files which are not integrated to the pipeline, they could be useful for users who may be facing issues when trying to get DaliLite installed in unsupported machines.

      • These large files will be indexed upon cloning and will take a small amount of storage.

      • The user can then download them with Git LFS in case they need the containerized version.

      • 1.1 Install Git LFS to pull apptainer containers

      -1.1.1 Linux Install

      apt install git-lfs
      git lfs install
      

      -1.1.2 macOS Install

      brew install git-lfs
      git lfs install
      

      -1.1.3 Pull apptainer containers

      git lfs pull
      
5 Snakedali Pipeline Setup
    5.1 Run Configuration
      • Each Snakedali run can be customized based on the configuration file: config/dali_template.yaml
      • This file can be replicated, and each subsequent modified yaml file is associated with one Snakedali run.
      • From the configuration file users are expected to set up:
        • In-/Output paths for the run
        • pre-built database path
        • query name(s)
        • default DaliLite v5 binary folder path
    5.2 Create Environments
      • Some steps of Snakedali rely on Anaconda environments.

      • Because some HPCs might not be compatible with Anaconda, we implement the conda environments directly on the Snakemake shell.

      • To do that, first we create the conda environments:

        conda env create -f envs/biopympi.yaml

    5.3 Snakemake Profile
      • Snakedali was designed to work with (Sun Grid Engine) SGE job scheduler

      • The Snakemake profile can be modified to accommodate other schedulers: /profile/config.yaml

      • The default profile includes:

        • cluster job submission: qsub -l h_rt={cluster.time} -j y -pe smp 4 -cwd
        • cluster config path: config/cluster.yaml
        • rerun triggers: mtime
        • n jobs limit: 600
        • latency-wait: 120
        • reason: True
        • rerun-incomplete: True
        • show-failed-logs: True
        • keep-going: True
        • printshellcmds: True
        • jobname: {rule}.{jobid}
        • jobs: 600
      • Make sure to adjust the parameters above according to the house rules of your HPC.

6 Run Snakedali
    • Once the necessary inputs have been set up in the configuration file, Snakedali shall be called as in:
    snakemake --snakefile snakedali.smk --configfile config/dali_template.yaml --profile profile/
    
7 DALI + TCOFFEE Integration
    - The DALI + TCOFFEE workflow is broken down into two parts. - 1. The first script, dali_out_to_fasta.py is a python script that takes in a DALI.txt output (that is, the search results for a DALI query against a database in DALI alignment format) and converts them into individual pariwise alignment files in FASTA format. - 2. The second script is a wrapper that calls on the first script to take in an entire directory of DALI.txt output files to convert them into directories with FASTA format alignments. This script then calls TCOFFEE to merge the FASTA format alignments into multiple sequence alignments. One alignment is generated per DALI.txt output (that is, one DALI query searched against a database) such that there is an alignment generated for every single query. The script invokes TCOFFEE one more time to merge all such alignments into one final multiple-sequence alignment.

About

Snakemake implementation of protein structure alignment software DaliLite v5

Resources

Stars

Watchers

Forks

Packages

No packages published