Skip to content

TurakhiaLab/panman

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

License DOI Build Status

Pangenome Mutation-Annotated Network (PanMAN)

Table of Contents

Introduction

Here we provide an overview of PanMAN, panmanUtils, and its installation methods and usage. For more information please see our Wiki.

What is a PanMAN?

PanMAN or Pangenome Mutation-Annotated Network is a novel data representation for pangenomes that provides massive leaps in both representative power and storage efficiency. Specifically, PanMANs are composed of mutation-annotated trees, called PanMATs, which, in addition to substitutions, also annotate inferred indels (Fig. 1b), and even structural mutations (Fig. 1a) on the different branches. Multiple PanMATs are connected in the form of a network using edges to generate a PanMAN (Fig. 1c). PanMAN's representative power is compared against existing pangenomic formats in Fig. 1d. PanMANs are the most compressible pangenomic format for the different microbial datasets (SARS-CoV-2, RSV, HIV, Mycobacterium. Tuberculosis, E. Coli, and Klebsiella pneumoniae), providing 2.9 to 559-fold compression over standard pangenomic formats.

Figure 1: Overview of the PanMAN data structure

panmanUtils

panmanUtils includes multiple algorithms to construct PanMANs and to support various functionalities to modify and extract useful information from PanMANs (Fig. 2).

Figure 2: Overview of panmanUtils' functionalities

Installation

panmanUtils software can be installed using four different methods:

  1. Conda (Recommended)
  2. Docker Image
  3. Dockerfile
  4. Installation scripts

1. Using conda (recommended)

Users can install panmanUtils through installation of panman conda package, compatible with linux-64 and osx-64. For modern macs using Apple silicon (arm64), you need to install Rosetta 2.

i. Dependencies

  1. Conda

ii. Install panman conda package

# Create and activate a new environment for panman
conda create -n panman-env python=3.11 -y
conda activate panman-env

# Set up channels
conda config --add channels defaults
conda config --add channels bioconda
conda config --add channels conda-forge

# On macOS ARM:
# conda config --env --set subdir osx-64

# Install the panman package
conda install panman -y

iii. Run panmanUtils

panmanUtils --help

2. Using Docker Image

To use panmanUtils in a docker container, users can create a docker container from a docker image, by following these steps (compatible with linux-64 and osx-64).

i. Dependencies

  1. Docker

ii. Pull and build the PanMAN docker image from DockerHub

## Note: If the Docker image already exist locally, make sure to pull the latest version using 
## docker pull swalia14/panman:latest

## If the Docker image does not exist locally, the following command will pull and run the latest version
docker run -it swalia14/panman:latest

iii. Run panmanUtils

# Insider docker container
panmanUtils --help

3. Using DockerFile

Docker container with preinstalled panmanUtils can also be built from DockerFile by following these steps (compatible with linux-64 and osx-64).

i. Dependencies

  1. Docker
  2. Git

ii. Clone the repository and build a docker image

git clone https://github.com/TurakhiaLab/panman.git
cd panman/docker
docker build -t panman .

iii. Build and run the docker container

docker run -it panman

iv. Run panmanUtils

# Insider docker container
panmanUtils --help

4. Using installation script (Least recommended)

We provide scripts to install panmanUtils from source code (requires sudo access, compatible with Linux only). Mac users can use MacOS specific installation script, that uses conda to install panmanUtils.

i. Dependencies

  1. Git

ii. Clone the repository

git clone https://github.com/TurakhiaLab/panman.git
cd panman

iii. Run the installation script

chmod +x install/installationUbuntu.sh
./install/installationUbuntu.sh

iv. Run panmanUtils

cd build
./panmanUtils --help

PanMAN Construction

Once the package is installed, PanMANs can be constructed from PanGraph [or GFA or MSA] and Tree topology (Newick format) using panmanUtils. Here we provide examples for constructing PanMANs from PanGraph (JSON) and custom dataset. Alternatively, users can follow the instructions provided in wiki for other methods.

Step 1: Check if sars_20.json and sars_20.nwk files exist in test directory.

Step 2: Run panmanUtils with the following command to build a panman from PanGraph:

panmanUtils -P $PANMAN_HOME/test/sars_20.json -N $PANMAN_HOME/test/sars_20.nwk -o sars_20

The above command will run panmanUtils program and build sars_20.panman in $PANMAN_HOME/build/panman directory.

Building PanMAN from raw sequences or fragment assemblies using Snakemake Workflow

We provide a Snakemake workflow to construct PanMANs from raw sequences (FASTA format) or from fragment assemblies.

!!!Note The Snakemake workflow uses various tools such as PanGraph tool, PGGB, MAFFT, and MashTree to build input PanGraph, GFA, MSA, and Tree topology files, respectively and it is particularly designed to be used in the docker container build from either the provided docker image or the DockerFile (instructions provided here).

Building PanMAN from raw genome sequences

Step 1: Run the following command to construct a panman from raw sequences.

  • Usage
cd $PANMAN_HOME/workflows
snakemake --use-conda --cores 8 --config RUNTYPE="pangraph/gfa/msa" FASTA="[user_input]" SEQ_COUNT="Number of sequences" ASSEM="NONE" REF="NONE" TARGET="NONE"
  • Example
cd $PANMAN_HOME/workflows
snakemake --use-conda --cores 8 --config RUNTYPE="pangraph" FASTA="$PANMAN_HOME/test/sars_20.fa" SEQ_COUNT="20" ASSEM="NONE" REF="NONE" TARGET="NONE"

Building PanMAN from fragment assemblies

Step 1: Run the following command to construct a panman from fragment assemblies.

cd $PANMAN_HOME/workflows
snakemake --use-conda --cores 8 --config RUNTYPE="pangraph/gfa/msa" FASTA="None" SEQ_COUNT="Number of sequences" ASSEM="frag" REF="reference_file" TARGET="target.txt"

Here, target.txt includes a list of files that contain the fragmented assemblies.

panmanUtils functionalities

panmanUtils provide various functionalities such as summary, [Raw sequence, MSA, VCF, GFA] extract, sub-network pruning, and many more. Please refer to wiki for detailed information. Here we provide usage syntax and examples for summary and VCF extract.

Summary extract

The summary feature extracts node and tree level statistics of a PanMAN, that contains a summary of its geometric and parsimony information.

  • Usage Syntax
panmanUtils -I <path to PanMAN file> --summary --output-file=<prefix of output file> (optional)
  • Example
panmanUtils -I panman/sars_20.panman  --summary --output-file=sars_20

Variant Call Format (VCF) extract

Extract variations of all sequences from any PanMAT in a PanMAN in the form of a VCF file with respect to any reference sequence (ref) in the PanMAT.

  • Usage syntax
panmanUtils -I <path to PanMAN file> --vcf -reference=ref --output-file=<prefix of output file> (optional) 
  • Example
panmanUtils -I panman/sars_20.panman --vcf -reference="Switzerland/SO-ETHZ-500145/2020|OU000199.2|2020-11-12" --output-file=sars_20 

Contribute

We welcome contributions from the community to enhance the capabilities of PanMAN and panmanUtils. If you encounter any issues or have suggestions for improvement, please open an issue on PanMAN GitHub page. For general inquiries and support, reach out to our team.

Citing PanMAN

If you use the PanMANs or panmanUtils in your research or publications, we kindly request that you cite the following paper:

  • Sumit Walia, Harsh Motwani, Kyle Smith, Russell Corbett-Detig, Yatish Turakhia, "Compressive Pangenomics Using Mutation-Annotated Networks", bioRxiv 2024.07.02.601807; doi: 10.1101/2024.07.02.601807