This repository contains code for Protein2PAM, a tool that predicts CRISPR-Cas PAMs from protein sequences using protein language models. The models and their applications are described in Nayfach, S., Bhatnagar, A., Novichkov, A., et al. (2025). For a browser-based interface, try the Protein2PAM Webserver.
The figure below shows the overall Protein2PAM workflow.
flowchart LR
A[CRISPR-Cas protein] --> B[Protein2PAM model] --> C[Predicted PAM motif]
style A fill:#0f0f0f,stroke:#00ff99,stroke-width:2px,color:#00ff99
style B fill:#0f0f0f,stroke:#00ff99,stroke-width:2px,color:#00ff99
style C fill:#0f0f0f,stroke:#00ff99,stroke-width:2px,color:#00ff99
This repo has two Protein2PAM implementations:
- a full implementation in bare PyTorch (under the
protein2pam.modelssubmodule) which includes uncertainty estimation and visualization utilities, but requires users to manually download certain modeling resources. - a Huggingface implementation (under the
protein2pam.huggingfacesubmodule) that includes only the PAM prediction model without uncertainty estimation. This implementation is intended for more low-level usage, but it is integrated with the Huggingface Hub for easier downloading of model weights. You can see the model collection here.
Before installing Protein2PAM, ensure you have:
- Python 3.8+ Available from python.org.
- NVIDIA GPU with CUDA support:
Refer to the official NVIDIA installation guide or see our step-by-step instructions here. Usenvidia-smito ensure GPU(s) are available on your system. - pip
If
pipis not already installed, you can install it using:python3 -m ensurepip --upgrade
- Mamba or Conda:
You can installmambausing:wget https://github.com/conda-forge/miniforge/releases/latest/download/Miniforge3-Linux-x86_64.sh bash Miniforge3-Linux-x86_64.sh
Download repo
git clone https://github.com/Profluent-AI/protein2pam
cd protein2pamCreate environment and install
mamba env create -f environment.yml
source activate protein2pam
pip install -e .Download and unpack database (only necessary if using the protein2pam.models implementation, not the protein2pam.huggingface implementation)
wget https://storage.googleapis.com/protein2pam-x83y9z7q4k/protein2pam_db_v1.0.tar
tar -xvf protein2pam_db_v1.0.tarBelow are examples of how to run Protein2PAM locally through the Python API.
Import protein2pam and list available models:
import protein2pam
protein2pam.model_info()For the example, we'll use the main cas9 model, the PID sequence of SpCas9, and the database located at ./protein2pam_db_v1.0
proteins = ["VQTGGFSKESILPKRNSDKLIARKKDWDPKKYGGFDSPTVAYSVLVVAKVEKGKSKKLKSVKELLGITIMERSSFEKNPIDFLEAKGYKEVKKDLIIKLPKYSLFELENGRKRMLASAGELQKGNELALPSKYVNFLYLASHYEKLKGSPEDNEQKQLFVEQHKHYLDEIIEQISEFSKRVILADANLDKVLSAYNKHRDKPIREQAENIIHLFTLTNLGAPAAFKYFDTTIDRKRYTSTKEVLDATLIHQSITGLYETRIDLSQLGGD"]
oracle = protein2pam.PAMOracle(
model_name = "cas9",
data_dir = "../protein2pam_db_v1.0"
)
predictions = oracle.evaluate(proteins)
for prediction in predictions:
print(prediction)We can plot PAM logos using:
for pam in predictions:
pam.plot_logo(
file="example.png",
side="downstream",
title="example"
)Alternatively, if you would like to directly use a Huggingface implementation of the PAM prediction model, you can call
import torch
import torch.nn.functional as F
from protein2pam.huggingface import EsmForSequenceClassification, get_tokenizer
# Tokenize proteins so they can be consumed by the model
proteins = ["VQTGGFSKESILPKRNSDKLIARKKDWDPKKYGGFDSPTVAYSVLVVAKVEKGKSKKLKSVKELLGITIMERSSFEKNPIDFLEAKGYKEVKKDLIIKLPKYSLFELENGRKRMLASAGELQKGNELALPSKYVNFLYLASHYEKLKGSPEDNEQKQLFVEQHKHYLDEIIEQISEFSKRVILADANLDKVLSAYNKHRDKPIREQAENIIHLFTLTNLGAPAAFKYFDTTIDRKRYTSTKEVLDATLIHQSITGLYETRIDLSQLGGD"]
tokenizer = get_tokenizer()
encodings = tokenizer.encode_batch(proteins)
device = torch.device("cuda") if torch.cuda.is_available() else torch.device("cpu")
input_batch = dict(
input_ids=torch.tensor([encoding.ids for encoding in encodings], device=device),
attention_mask=torch.tensor([encoding.attention_mask for encoding in encodings], device=device),
)
# Initialize model. You can use any of the supported model names (below) instead of cas9.
model = EsmForSequenceClassification.from_pretrained("Profluent-Bio/protein2pam-cas9", device_map=device)
# Get a batch of PAM probability matrices (as a torch tensor) from the model
# dimension 0 is batch size, dimension 1 is sequence position, and dimension 2 is nucleotide ID (ordered ACGT)
output = model(**input_batch)
pam_probability_matrix = F.softmax(output.logits, dim=-1)| Model Name | Input Protein/Domain | CRISPR Type | Samples | Literature Datasets |
|---|---|---|---|---|
cas8 |
Cas8 or Cas10d | Type I | 28,410 | |
cas9 |
Cas9 PI-domain | Type II | 15,843 | 1-9 |
cas12 |
Cas12 protein | Type V | 1,720 | 10-21 |
These models are exposed on the Protein2PAM Webserver. See Data for details on training samples.
In addition to the models deployed on the webserver, we also trained several variants for benchmarking and ablation studies, summarized below:
| Model Name | Input Protein/Domain | CRISPR Type | Samples | Literature Datasets | Notes |
|---|---|---|---|---|---|
cas9_full |
Cas9 protein | Type II | 15,843 | 1-9 | Full-length Cas9 trained with literature PAMs |
cas9_full_nolit |
Cas9 protein | Type II | 15,731 | Full-length Cas9, no literature PAMs | |
cas9_pid_nolit |
Cas9 PI-domain | Type II | 15,731 | No literature PAMs | |
cas9_pid_nme |
Cas9 PI-domain | Type II | 15,843 | 1-9 | Nme orthologs upweighted |
cas12_no_lit |
Cas12 protein | Type V | 1,675 | Trained without literature PAMs |
Sequences and PAM profiles used for training can be downloaded from Google Cloud:
wget https://storage.googleapis.com/protein2pam-x83y9z7q4k/protein2pam_train_seqs.tsvThe training set consists of protein:PAM pairs curated from evolutionary data and supplemented with experimental PAMs reported in the literature (see REFERENCES.md)
An expanded set of PAM profiles associated with the CRISPR-Cas Atlas can be found at https://github.com/Profluent-AI/CRISPR-Cas-Atlas
This repository contains both software and trained models under different licenses.
-
Code: Licensed under the Polyform Noncommercial License 1.0.0. The code may be used, modified, and shared for non-commercial research and academic purposes only. Commercial use is prohibited without prior written consent.
-
Models and Data: Licensed under the Creative Commons Attribution–NonCommercial 4.0 International (CC BY-NC 4.0). You may share and adapt the models for non-commercial use with appropriate attribution.
The full license texts are available in LICENSES.md.
For commercial licensing inquiries, please contact partnerships@profluent.bio.
If you use Protein2PAM in your research, please cite the following preprint:
BibTeX:
@article{nayfach2025protein2pam,
title={Deep Learning Prediction and Customization of CRISPR-Cas PAMs},
author={Stephen Nayfach and Aadyot Bhatnagar and Andrey Novichkov and Gabriella O. Estevam and Nahye Kim and Emily Hill and Jeffrey A. Ruffolo and Rachel Silverstein and Joseph Gallagher and Benjamin Kleinstiver and Alexander J. Meeske and Peter Cameron and Ali Madani},
journal={bioRxiv},
year={2025},
publisher={Cold Spring Harbor Laboratory},
doi={10.1101/2025.01.06.631536},
url={https://www.biorxiv.org/content/10.1101/2025.01.06.631536v1}
}