A deep learning framework for predicting protein binding sites using transformer-based architecture with convolutional encoders and attention mechanisms.

This project implements a protein binding site prediction model that combines:
- Convolutional Encoder: Extracts local protein features using 1D convolutions with GLU activation
- Transformer Decoder: Processes features using multi-head attention mechanisms
- RAdam + Lookahead Optimization: Advanced optimization strategy for better convergence
- ESM2 Embeddings: Uses pre-computed protein embeddings for feature representation
Input Protein → Encoder → Decoder → Classification
↓ ↓ ↓ ↓
ESM2 Conv1D + Multi-head Binary
Embeddings GLU Attention Classification
(2560D) (Residual) (Cross + (Binding/
Self) Non-binding)
ALLSites/
├── src/
│ ├── models/
│ │ ├── model.py # Main model architecture
│ │ ├── radam.py # RAdam optimizer
│ │ └── lookahead.py # Lookahead optimizer wrapper
│ ├── data/
│ │ └── data_generator.py # Data loading and preprocessing
│ ├── utils/
│ │ ├── helpers.py # Utility functions
│ │ └── metrics.py # Evaluation metrics
│ └── optimizers/ # Alternative optimizer location
├── configs/
│ └── config.yaml # Training configuration
├── dataset/ # Data directory
├── dataset_processed/ # Pickle files generated by data preprocessing
├── logs/ # Training logs
├── models/ # Saved model checkpoints
├── results/ # Evaluation results
├── train.py # Training script
├── preprocess.py # Data preprocessing
└── README.md # This file
conda create -n AllSites python=3.10# Core dependencies
torch>=1.12.0
numpy>=1.21.0
scikit-learn>=1.0.0
pyyaml>=6.0
pandas>=1.3.0
# For ESM2 embedding
fair-esm
# Optional for distributed training
torch.distributed
# For data processing
pickle
pathlib- GPU: NVIDIA GPU with CUDA support (recommended)
- Memory: 16GB+ RAM, 8GB+ GPU memory
- Storage: 10GB+ for data and models
- Clone the repository:
git clone <repository-url>
cd ALLSites- Create conda environment:
conda create -n protein_prediction python=3.10
conda activate protein_prediction- Install dependencies:
pip install torch torchvision torchaudio --index-url https://download.pytorch.org/whl/cu118
pip install numpy scikit-learn pyyaml pandas fair-esm- Verify installation:
python -c "import torch; print(f'PyTorch version: {torch.__version__}'); print(f'CUDA available: {torch.cuda.is_available()}')"# Preprocess FASTA files (take CarbPI-site as an example)
python preprocess.py --input dataset/CarbPI-site/Carb-Train517.fa --output dataset_processed/CarbPI-site --split train
# Expected file structure:
dataset_processed/
└── CarbPI-site/
├── train/
│ ├── Carb-Train517-ESM2.pkl
│ ├── Carb-Train517-label.pkl
│ └── Carb-Train517-list.pkl
├── valid/
│ └── ...
└── test/
└── ...
# ESM2 checkpoint download issues
During the first `preprocess.py` run you should see messages like:
Downloading: "https://dl.fbaipublicfiles.com/fair-esm/models/esm2_t36_3B_UR50D.pt" to /home/<username>/.cache/torch/hub/checkpoints/esm2_t36_3B_UR50D.pt
Downloading: "https://dl.fbaipublicfiles.com/fair-esm/regression/esm2_t36_3B_UR50D-contact-regression.pt" to /home/<username>/.cache/torch/hub/checkpoints/esm2_t36_3B_UR50D-contact-regression.pt
If the download takes too long or fails, you can manually fetch the files from the URLs above and place them in `/home/<username>/.cache/torch/hub/checkpoints/`.
Make sure they retain their original filenames, then rerun the preprocessing script.After running preprocess.py, three pickle files expected by the model are generated for each dataset split.
-
ESM2 Embeddings (
*-ESM2.pkl):- List of protein embedding arrays
- Each protein:
numpy.ndarraywith shape[seq_len, 2560], dtypefloat32
-
Labels (
*-label.pkl):- List of binding site label arrays
- Each protein:
numpy.ndarraywith shape[seq_len], dtypeint32
-
Index List (
*-list.pkl):- Protein metadata
- Format:
[(count, id_idx, position, dataset, protein_id, seq_length), ...]
Edit configs/config.yaml to customize training:
data:
train_path: "dataset_processed/CarbPI-site/train/"
valid_path: "dataset_processed/CarbPI-site/valid/"
test_path: "dataset_processed/CarbPI-site/test/"
window_size: 0 # Context window (0 = no windowing)
local_dim: 2560 # ESM2 embedding dimension
protein_dim: 2560 # Protein feature dimension
model:
hidden_dim: 128 # Hidden layer dimension
n_layers: 3 # Number of encoder/decoder layers
n_heads: 8 # Multi-head attention heads
pf_dim: 256 # Feedforward dimension
dropout: 0.1 # Dropout rate
kernel_size: 7 # Convolution kernel size
training:
batch_size: 32 # Batch size
learning_rate: 0.0001 # Initial learning rate
weight_decay: 0.0001 # L2 regularization
epochs: 30 # Maximum epochs
early_stopping: 10 # Early stopping patience
decay_interval: 10 # LR decay frequency
lr_decay: 0.9 # LR decay factor
seed: 42 # Random seed
paths:
model_dir: "models/"
result_dir: "results/"
experiment_name: "Carb-Train517-Val129-Test162"python train.py --config configs/config.yaml# Simple background execution
nohup python train.py --config configs/config.yaml > train.log 2>&1 &# Multi-GPU training
torchrun --nproc_per_node=2 train.py --config configs/config.yaml --distributed# View training progress
tail -f train.log
# Monitor GPU usage
nvidia-smi
# View training metrics
tail -f results/output-*.txt- Input: ESM2 embeddings
[batch, seq_len, 2560] - Layers: Multiple 1D Conv + GLU + Residual connections
- Output: Encoded features
[batch, seq_len, hidden_dim]
- Self-Attention: Processes local features
- Cross-Attention: Attends to encoded protein features
- Output: Classification logits
[batch, 2]
- Primary: RAdam optimizer with adaptive learning rates
- Meta: Lookahead wrapper for improved convergence
- Regularization: Weight decay + dropout
The model reports comprehensive evaluation metrics:
- ACC: Accuracy
- AUC: Area Under ROC Curve
- Rec: Recall (Sensitivity)
- Pre: Precision
- F1: F1-Score
- MCC: Matthews Correlation Coefficient
- PRC: Precision-Recall Curve AUC
Training outputs are saved to:
- Models:
models/best_model.pth - Metrics:
results/output-*.txt - Logs:
logs/train_*.log
Example results format:
Epoch Time1(sec) Time2(sec) Loss_train ACC_dev AUC_dev Rec_dev Pre_dev F1_dev MCC_dev PRC_dev ACC_test AUC_test Rec_test Pre_test F1_test MCC_test PRC_test
1 1186.210 1298.588 410.146 0.949 0.948 0.698 0.388 0.498 0.496 0.539 0.947 0.947 0.727 0.401 0.517 0.516 0.571
-
CUDA Out of Memory:
# Reduce batch size in config.yaml batch_size: 16 # or smaller
-
Data Loading Errors:
# Check data file format python -c "import pickle; print(len(pickle.load(open('data.pkl', 'rb'))))"
-
Import Errors:
# Add project to Python path export PYTHONPATH="${PYTHONPATH}:/path/to/ALLSites"
-
NumPy Version Issues:
# Update NumPy if you see np.long errors pip install numpy>=1.21.0
# Enable debug logging
python train.py --config configs/config.yaml --debug- Data Loading: Use
num_workers=4for faster data loading - Memory: Enable gradient checkpointing for large models
- Speed: Use mixed precision training with
autocast() - Distributed: Scale learning rate linearly with number of GPUs
This project is licensed under the MIT License - see the LICENSE file for details.
- Fork the repository
- Create a feature branch (
git checkout -b feature/new-feature) - Commit changes (
git commit -am 'Add new feature') - Push to branch (
git push origin feature/new-feature) - Create a Pull Request
- ESM2 for protein embeddings
- PyTorch team for the deep learning framework
- Scientific Python community for tools and libraries