Skip to content

Profluent-AI/Protein2PAM

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

41 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

Protein2PAM: Protein Language Models for CRISPR-Cas PAM Prediction

Python bioRxiv License: Models (CC BY–NC 4.0) License: Code (Polyform Noncommercial 1.0.0)

This repository contains code for Protein2PAM, a tool that predicts CRISPR-Cas PAMs from protein sequences using protein language models. The models and their applications are described in Nayfach, S., Bhatnagar, A., Novichkov, A., et al. (2025). For a browser-based interface, try the Protein2PAM Webserver.

The figure below shows the overall Protein2PAM workflow.

flowchart LR
    A[CRISPR-Cas protein] --> B[Protein2PAM model] --> C[Predicted PAM motif]

    style A fill:#0f0f0f,stroke:#00ff99,stroke-width:2px,color:#00ff99
    style B fill:#0f0f0f,stroke:#00ff99,stroke-width:2px,color:#00ff99
    style C fill:#0f0f0f,stroke:#00ff99,stroke-width:2px,color:#00ff99
Loading

This repo has two Protein2PAM implementations:

  • a full implementation in bare PyTorch (under the protein2pam.models submodule) which includes uncertainty estimation and visualization utilities, but requires users to manually download certain modeling resources.
  • a Huggingface implementation (under the protein2pam.huggingface submodule) that includes only the PAM prediction model without uncertainty estimation. This implementation is intended for more low-level usage, but it is integrated with the Huggingface Hub for easier downloading of model weights. You can see the model collection here.

Contents

Prerequisites

Before installing Protein2PAM, ensure you have:

  • Python 3.8+ Available from python.org.
  • NVIDIA GPU with CUDA support:
    Refer to the official NVIDIA installation guide or see our step-by-step instructions here. Use nvidia-smi to ensure GPU(s) are available on your system.
  • pip If pip is not already installed, you can install it using:
    python3 -m ensurepip --upgrade
  • Mamba or Conda:
    You can install mamba using:
    wget https://github.com/conda-forge/miniforge/releases/latest/download/Miniforge3-Linux-x86_64.sh
    bash Miniforge3-Linux-x86_64.sh

Quickstart

Download repo

git clone https://github.com/Profluent-AI/protein2pam
cd protein2pam

Create environment and install

mamba env create -f environment.yml
source activate protein2pam
pip install -e .

Download and unpack database (only necessary if using the protein2pam.models implementation, not the protein2pam.huggingface implementation)

wget https://storage.googleapis.com/protein2pam-x83y9z7q4k/protein2pam_db_v1.0.tar
tar -xvf protein2pam_db_v1.0.tar

Usage

Python usage

Below are examples of how to run Protein2PAM locally through the Python API.

Import protein2pam and list available models:

import protein2pam
protein2pam.model_info()

For the example, we'll use the main cas9 model, the PID sequence of SpCas9, and the database located at ./protein2pam_db_v1.0

proteins  = ["VQTGGFSKESILPKRNSDKLIARKKDWDPKKYGGFDSPTVAYSVLVVAKVEKGKSKKLKSVKELLGITIMERSSFEKNPIDFLEAKGYKEVKKDLIIKLPKYSLFELENGRKRMLASAGELQKGNELALPSKYVNFLYLASHYEKLKGSPEDNEQKQLFVEQHKHYLDEIIEQISEFSKRVILADANLDKVLSAYNKHRDKPIREQAENIIHLFTLTNLGAPAAFKYFDTTIDRKRYTSTKEVLDATLIHQSITGLYETRIDLSQLGGD"]
oracle = protein2pam.PAMOracle(
    model_name = "cas9", 
    data_dir = "../protein2pam_db_v1.0"
)
predictions = oracle.evaluate(proteins)
for prediction in predictions:
    print(prediction)

We can plot PAM logos using:

for pam in predictions:
    pam.plot_logo(
        file="example.png",
        side="downstream",
        title="example"
    )

Alternatively, if you would like to directly use a Huggingface implementation of the PAM prediction model, you can call

import torch
import torch.nn.functional as F
from protein2pam.huggingface import EsmForSequenceClassification, get_tokenizer

# Tokenize proteins so they can be consumed by the model
proteins  = ["VQTGGFSKESILPKRNSDKLIARKKDWDPKKYGGFDSPTVAYSVLVVAKVEKGKSKKLKSVKELLGITIMERSSFEKNPIDFLEAKGYKEVKKDLIIKLPKYSLFELENGRKRMLASAGELQKGNELALPSKYVNFLYLASHYEKLKGSPEDNEQKQLFVEQHKHYLDEIIEQISEFSKRVILADANLDKVLSAYNKHRDKPIREQAENIIHLFTLTNLGAPAAFKYFDTTIDRKRYTSTKEVLDATLIHQSITGLYETRIDLSQLGGD"]
tokenizer = get_tokenizer()
encodings = tokenizer.encode_batch(proteins)
device = torch.device("cuda") if torch.cuda.is_available() else torch.device("cpu")
input_batch = dict(
  input_ids=torch.tensor([encoding.ids for encoding in encodings], device=device),
  attention_mask=torch.tensor([encoding.attention_mask for encoding in encodings], device=device),
)

# Initialize model. You can use any of the supported model names (below) instead of cas9.
model = EsmForSequenceClassification.from_pretrained("Profluent-Bio/protein2pam-cas9", device_map=device)

# Get a batch of PAM probability matrices (as a torch tensor) from the model
# dimension 0 is batch size, dimension 1 is sequence position, and dimension 2 is nucleotide ID (ordered ACGT)
output = model(**input_batch)
pam_probability_matrix = F.softmax(output.logits, dim=-1)

Models

Models used by the Protein2PAM webserver

Model Name Input Protein/Domain CRISPR Type Samples Literature Datasets
cas8 Cas8 or Cas10d Type I 28,410
cas9 Cas9 PI-domain Type II 15,843 1-9
cas12 Cas12 protein Type V 1,720 10-21

These models are exposed on the Protein2PAM Webserver. See Data for details on training samples.

Additional models

In addition to the models deployed on the webserver, we also trained several variants for benchmarking and ablation studies, summarized below:

Model Name Input Protein/Domain CRISPR Type Samples Literature Datasets Notes
cas9_full Cas9 protein Type II 15,843 1-9 Full-length Cas9 trained with literature PAMs
cas9_full_nolit Cas9 protein Type II 15,731 Full-length Cas9, no literature PAMs
cas9_pid_nolit Cas9 PI-domain Type II 15,731 No literature PAMs
cas9_pid_nme Cas9 PI-domain Type II 15,843 1-9 Nme orthologs upweighted
cas12_no_lit Cas12 protein Type V 1,675 Trained without literature PAMs

Data

Sequences and PAM profiles used for training can be downloaded from Google Cloud:

wget https://storage.googleapis.com/protein2pam-x83y9z7q4k/protein2pam_train_seqs.tsv

The training set consists of protein:PAM pairs curated from evolutionary data and supplemented with experimental PAMs reported in the literature (see REFERENCES.md)

An expanded set of PAM profiles associated with the CRISPR-Cas Atlas can be found at https://github.com/Profluent-AI/CRISPR-Cas-Atlas

License

This repository contains both software and trained models under different licenses.

The full license texts are available in LICENSES.md.

For commercial licensing inquiries, please contact partnerships@profluent.bio.

Cite this work

If you use Protein2PAM in your research, please cite the following preprint:

BibTeX:

@article{nayfach2025protein2pam,
  title={Deep Learning Prediction and Customization of CRISPR-Cas PAMs},
  author={Stephen Nayfach and Aadyot Bhatnagar and Andrey Novichkov and Gabriella O. Estevam and Nahye Kim and Emily Hill and Jeffrey A. Ruffolo and Rachel Silverstein and Joseph Gallagher and Benjamin Kleinstiver and Alexander J. Meeske and Peter Cameron and Ali Madani},
  journal={bioRxiv},
  year={2025},
  publisher={Cold Spring Harbor Laboratory},
  doi={10.1101/2025.01.06.631536},
  url={https://www.biorxiv.org/content/10.1101/2025.01.06.631536v1}
}

About

PAM prediction models for CRISPR-Cas nucleases

Resources

Stars

Watchers

Forks

Packages

No packages published

Contributors 3

  •  
  •  
  •  

Languages