Accurate Identification of Protein Binding Sites for All Drug Modalities Using ALLSites

A deep learning framework for predicting protein binding sites using transformer-based architecture with convolutional encoders and attention mechanisms.

Overview

This project implements a protein binding site prediction model that combines:

Convolutional Encoder: Extracts local protein features using 1D convolutions with GLU activation
Transformer Decoder: Processes features using multi-head attention mechanisms
RAdam + Lookahead Optimization: Advanced optimization strategy for better convergence
ESM2 Embeddings: Uses pre-computed protein embeddings for feature representation

Architecture

Input Protein → Encoder → Decoder → Classification
     ↓             ↓         ↓            ↓
ESM2        Conv1D +   Multi-head   Binary
Embeddings     GLU       Attention   Classification
(2560D)      (Residual)  (Cross +     (Binding/
                         Self)      Non-binding)

Project Structure

ALLSites/
├── src/
│   ├── models/
│   │   ├── model.py           # Main model architecture
│   │   ├── radam.py           # RAdam optimizer
│   │   └── lookahead.py       # Lookahead optimizer wrapper
│   ├── data/
│   │   └── data_generator.py  # Data loading and preprocessing
│   ├── utils/
│   │   ├── helpers.py         # Utility functions
│   │   └── metrics.py         # Evaluation metrics
│   └── optimizers/            # Alternative optimizer location
├── configs/
│   └── config.yaml            # Training configuration
├── dataset/                   # Data directory
├── dataset_processed/         # Pickle files generated by data preprocessing
├── logs/                      # Training logs
├── models/                    # Saved model checkpoints
├── results/                   # Evaluation results
├── train.py                   # Training script
├── preprocess.py              # Data preprocessing
└── README.md                  # This file

Conda Environment

conda create -n AllSites python=3.10

Dependencies

# Core dependencies
torch>=1.12.0
numpy>=1.21.0
scikit-learn>=1.0.0
pyyaml>=6.0
pandas>=1.3.0

# For ESM2 embedding
fair-esm

# Optional for distributed training
torch.distributed

# For data processing
pickle
pathlib

Hardware

GPU: NVIDIA GPU with CUDA support (recommended)
Memory: 16GB+ RAM, 8GB+ GPU memory
Storage: 10GB+ for data and models

Installation

Clone the repository:

git clone <repository-url>
cd ALLSites

Create conda environment:

conda create -n protein_prediction python=3.10
conda activate protein_prediction

Install dependencies:

pip install torch torchvision torchaudio --index-url https://download.pytorch.org/whl/cu118
pip install numpy scikit-learn pyyaml pandas fair-esm

Verify installation:

python -c "import torch; print(f'PyTorch version: {torch.__version__}'); print(f'CUDA available: {torch.cuda.is_available()}')"

Data Format

Data Preparation

# Preprocess FASTA files (take CarbPI-site as an example)
python preprocess.py --input dataset/CarbPI-site/Carb-Train517.fa --output dataset_processed/CarbPI-site --split train

# Expected file structure:
dataset_processed/
└── CarbPI-site/
    ├── train/
    │   ├── Carb-Train517-ESM2.pkl
    │   ├── Carb-Train517-label.pkl
    │   └── Carb-Train517-list.pkl
    ├── valid/
    │   └── ...
    └── test/
        └── ...

# ESM2 checkpoint download issues
During the first `preprocess.py` run you should see messages like:
Downloading: "https://dl.fbaipublicfiles.com/fair-esm/models/esm2_t36_3B_UR50D.pt" to /home/<username>/.cache/torch/hub/checkpoints/esm2_t36_3B_UR50D.pt
Downloading: "https://dl.fbaipublicfiles.com/fair-esm/regression/esm2_t36_3B_UR50D-contact-regression.pt" to /home/<username>/.cache/torch/hub/checkpoints/esm2_t36_3B_UR50D-contact-regression.pt
If the download takes too long or fails, you can manually fetch the files from the URLs above and place them in `/home/<username>/.cache/torch/hub/checkpoints/`.
Make sure they retain their original filenames, then rerun the preprocessing script.

Input Files

After running preprocess.py, three pickle files expected by the model are generated for each dataset split.

ESM2 Embeddings (*-ESM2.pkl):
- List of protein embedding arrays
- Each protein: numpy.ndarray with shape [seq_len, 2560], dtype float32
Labels (*-label.pkl):
- List of binding site label arrays
- Each protein: numpy.ndarray with shape [seq_len], dtype int32
Index List (*-list.pkl):
- Protein metadata
- Format: [(count, id_idx, position, dataset, protein_id, seq_length), ...]

Configuration

Edit configs/config.yaml to customize training:

data:
  train_path: "dataset_processed/CarbPI-site/train/"
  valid_path: "dataset_processed/CarbPI-site/valid/"
  test_path: "dataset_processed/CarbPI-site/test/"
  window_size: 0           # Context window (0 = no windowing)
  local_dim: 2560          # ESM2 embedding dimension
  protein_dim: 2560        # Protein feature dimension

model:
  hidden_dim: 128          # Hidden layer dimension
  n_layers: 3              # Number of encoder/decoder layers
  n_heads: 8               # Multi-head attention heads
  pf_dim: 256              # Feedforward dimension
  dropout: 0.1             # Dropout rate
  kernel_size: 7           # Convolution kernel size

training:
  batch_size: 32           # Batch size
  learning_rate: 0.0001    # Initial learning rate
  weight_decay: 0.0001     # L2 regularization
  epochs: 30               # Maximum epochs
  early_stopping: 10       # Early stopping patience
  decay_interval: 10       # LR decay frequency
  lr_decay: 0.9            # LR decay factor
  seed: 42                 # Random seed

paths:
  model_dir: "models/"
  result_dir: "results/"
  experiment_name: "Carb-Train517-Val129-Test162"

Training

Basic Training

python train.py --config configs/config.yaml

Background Training

# Simple background execution
nohup python train.py --config configs/config.yaml > train.log 2>&1 &

Distributed Training

# Multi-GPU training
torchrun --nproc_per_node=2 train.py --config configs/config.yaml --distributed

Monitor Training

# View training progress
tail -f train.log

# Monitor GPU usage
nvidia-smi

# View training metrics
tail -f results/output-*.txt

Model Architecture Details

Encoder (Convolutional)

Input: ESM2 embeddings [batch, seq_len, 2560]
Layers: Multiple 1D Conv + GLU + Residual connections
Output: Encoded features [batch, seq_len, hidden_dim]

Decoder (Transformer)

Self-Attention: Processes local features
Cross-Attention: Attends to encoded protein features
Output: Classification logits [batch, 2]

Optimization

Primary: RAdam optimizer with adaptive learning rates
Meta: Lookahead wrapper for improved convergence
Regularization: Weight decay + dropout

Evaluation Metrics

The model reports comprehensive evaluation metrics:

ACC: Accuracy
AUC: Area Under ROC Curve
Rec: Recall (Sensitivity)
Pre: Precision
F1: F1-Score
MCC: Matthews Correlation Coefficient
PRC: Precision-Recall Curve AUC

Results

Training outputs are saved to:

Models: models/best_model.pth
Metrics: results/output-*.txt
Logs: logs/train_*.log

Example results format:

Epoch	Time1(sec)	Time2(sec)	Loss_train	ACC_dev	AUC_dev	Rec_dev	Pre_dev	F1_dev	MCC_dev	PRC_dev	ACC_test	AUC_test	Rec_test	Pre_test	F1_test	MCC_test	PRC_test
1	1186.210	1298.588	410.146	0.949	0.948	0.698	0.388	0.498	0.496	0.539	0.947	0.947	0.727	0.401	0.517	0.516	0.571

Troubleshooting

Common Issues

CUDA Out of Memory:

# Reduce batch size in config.yaml
batch_size: 16  # or smaller

Data Loading Errors:

# Check data file format
python -c "import pickle; print(len(pickle.load(open('data.pkl', 'rb'))))"

Import Errors:

# Add project to Python path
export PYTHONPATH="${PYTHONPATH}:/path/to/ALLSites"

NumPy Version Issues:

# Update NumPy if you see np.long errors
pip install numpy>=1.21.0

Debug Mode

# Enable debug logging
python train.py --config configs/config.yaml --debug

Performance Tips

Data Loading: Use num_workers=4 for faster data loading
Memory: Enable gradient checkpointing for large models
Speed: Use mixed precision training with autocast()
Distributed: Scale learning rate linearly with number of GPUs

License

This project is licensed under the MIT License - see the LICENSE file for details.

Contributing

Fork the repository
Create a feature branch (git checkout -b feature/new-feature)
Commit changes (git commit -am 'Add new feature')
Push to branch (git push origin feature/new-feature)
Create a Pull Request

Acknowledgments

ESM2 for protein embeddings
PyTorch team for the deep learning framework
Scientific Python community for tools and libraries

Name		Name	Last commit message	Last commit date
Latest commit History 53 Commits
configs		configs
dataset		dataset
optimizers		optimizers
src		src
.gitignore		.gitignore
ALLSites-Figure1.jpg		ALLSites-Figure1.jpg
ALLSites-Figure1.tif		ALLSites-Figure1.tif
ALLSites.jpg		ALLSites.jpg
README.md		README.md
model.jpg		model.jpg
preprocess.py		preprocess.py
requirements.txt		requirements.txt
train.py		train.py

idrblab/ALLSites

Folders and files

Latest commit

History

Repository files navigation