Skip to content

idrblab/ALLSites

Repository files navigation

Accurate Identification of Protein Binding Sites for All Drug Modalities Using ALLSites

A deep learning framework for predicting protein binding sites using transformer-based architecture with convolutional encoders and attention mechanisms.

Overview

Model Architecture
This project implements a protein binding site prediction model that combines:

  • Convolutional Encoder: Extracts local protein features using 1D convolutions with GLU activation
  • Transformer Decoder: Processes features using multi-head attention mechanisms
  • RAdam + Lookahead Optimization: Advanced optimization strategy for better convergence
  • ESM2 Embeddings: Uses pre-computed protein embeddings for feature representation

Architecture

Input Protein → Encoder → Decoder → Classification
     ↓             ↓         ↓            ↓
ESM2        Conv1D +   Multi-head   Binary
Embeddings     GLU       Attention   Classification
(2560D)      (Residual)  (Cross +     (Binding/
                         Self)      Non-binding)

Project Structure

ALLSites/
├── src/
│   ├── models/
│   │   ├── model.py           # Main model architecture
│   │   ├── radam.py           # RAdam optimizer
│   │   └── lookahead.py       # Lookahead optimizer wrapper
│   ├── data/
│   │   └── data_generator.py  # Data loading and preprocessing
│   ├── utils/
│   │   ├── helpers.py         # Utility functions
│   │   └── metrics.py         # Evaluation metrics
│   └── optimizers/            # Alternative optimizer location
├── configs/
│   └── config.yaml            # Training configuration
├── dataset/                   # Data directory
├── dataset_processed/         # Pickle files generated by data preprocessing
├── logs/                      # Training logs
├── models/                    # Saved model checkpoints
├── results/                   # Evaluation results
├── train.py                   # Training script
├── preprocess.py              # Data preprocessing
└── README.md                  # This file

Conda Environment

conda create -n AllSites python=3.10

Dependencies

# Core dependencies
torch>=1.12.0
numpy>=1.21.0
scikit-learn>=1.0.0
pyyaml>=6.0
pandas>=1.3.0

# For ESM2 embedding
fair-esm

# Optional for distributed training
torch.distributed

# For data processing
pickle
pathlib

Hardware

  • GPU: NVIDIA GPU with CUDA support (recommended)
  • Memory: 16GB+ RAM, 8GB+ GPU memory
  • Storage: 10GB+ for data and models

Installation

  1. Clone the repository:
git clone <repository-url>
cd ALLSites
  1. Create conda environment:
conda create -n protein_prediction python=3.10
conda activate protein_prediction
  1. Install dependencies:
pip install torch torchvision torchaudio --index-url https://download.pytorch.org/whl/cu118
pip install numpy scikit-learn pyyaml pandas fair-esm
  1. Verify installation:
python -c "import torch; print(f'PyTorch version: {torch.__version__}'); print(f'CUDA available: {torch.cuda.is_available()}')"

Data Format

Data Preparation

# Preprocess FASTA files (take CarbPI-site as an example)
python preprocess.py --input dataset/CarbPI-site/Carb-Train517.fa --output dataset_processed/CarbPI-site --split train

# Expected file structure:
dataset_processed/
└── CarbPI-site/
    ├── train/
    │   ├── Carb-Train517-ESM2.pkl
    │   ├── Carb-Train517-label.pkl
    │   └── Carb-Train517-list.pkl
    ├── valid/
    │   └── ...
    └── test/
        └── ...

# ESM2 checkpoint download issues
During the first `preprocess.py` run you should see messages like:
Downloading: "https://dl.fbaipublicfiles.com/fair-esm/models/esm2_t36_3B_UR50D.pt" to /home/<username>/.cache/torch/hub/checkpoints/esm2_t36_3B_UR50D.pt
Downloading: "https://dl.fbaipublicfiles.com/fair-esm/regression/esm2_t36_3B_UR50D-contact-regression.pt" to /home/<username>/.cache/torch/hub/checkpoints/esm2_t36_3B_UR50D-contact-regression.pt
If the download takes too long or fails, you can manually fetch the files from the URLs above and place them in `/home/<username>/.cache/torch/hub/checkpoints/`.
Make sure they retain their original filenames, then rerun the preprocessing script.

Input Files

After running preprocess.py, three pickle files expected by the model are generated for each dataset split.

  1. ESM2 Embeddings (*-ESM2.pkl):

    • List of protein embedding arrays
    • Each protein: numpy.ndarray with shape [seq_len, 2560], dtype float32
  2. Labels (*-label.pkl):

    • List of binding site label arrays
    • Each protein: numpy.ndarray with shape [seq_len], dtype int32
  3. Index List (*-list.pkl):

    • Protein metadata
    • Format: [(count, id_idx, position, dataset, protein_id, seq_length), ...]

Configuration

Edit configs/config.yaml to customize training:

data:
  train_path: "dataset_processed/CarbPI-site/train/"
  valid_path: "dataset_processed/CarbPI-site/valid/"
  test_path: "dataset_processed/CarbPI-site/test/"
  window_size: 0           # Context window (0 = no windowing)
  local_dim: 2560          # ESM2 embedding dimension
  protein_dim: 2560        # Protein feature dimension

model:
  hidden_dim: 128          # Hidden layer dimension
  n_layers: 3              # Number of encoder/decoder layers
  n_heads: 8               # Multi-head attention heads
  pf_dim: 256              # Feedforward dimension
  dropout: 0.1             # Dropout rate
  kernel_size: 7           # Convolution kernel size

training:
  batch_size: 32           # Batch size
  learning_rate: 0.0001    # Initial learning rate
  weight_decay: 0.0001     # L2 regularization
  epochs: 30               # Maximum epochs
  early_stopping: 10       # Early stopping patience
  decay_interval: 10       # LR decay frequency
  lr_decay: 0.9            # LR decay factor
  seed: 42                 # Random seed

paths:
  model_dir: "models/"
  result_dir: "results/"
  experiment_name: "Carb-Train517-Val129-Test162"

Training

Basic Training

python train.py --config configs/config.yaml

Background Training

# Simple background execution
nohup python train.py --config configs/config.yaml > train.log 2>&1 &

Distributed Training

# Multi-GPU training
torchrun --nproc_per_node=2 train.py --config configs/config.yaml --distributed

Monitor Training

# View training progress
tail -f train.log

# Monitor GPU usage
nvidia-smi

# View training metrics
tail -f results/output-*.txt

Model Architecture Details

Encoder (Convolutional)

  • Input: ESM2 embeddings [batch, seq_len, 2560]
  • Layers: Multiple 1D Conv + GLU + Residual connections
  • Output: Encoded features [batch, seq_len, hidden_dim]

Decoder (Transformer)

  • Self-Attention: Processes local features
  • Cross-Attention: Attends to encoded protein features
  • Output: Classification logits [batch, 2]

Optimization

  • Primary: RAdam optimizer with adaptive learning rates
  • Meta: Lookahead wrapper for improved convergence
  • Regularization: Weight decay + dropout

Evaluation Metrics

The model reports comprehensive evaluation metrics:

  • ACC: Accuracy
  • AUC: Area Under ROC Curve
  • Rec: Recall (Sensitivity)
  • Pre: Precision
  • F1: F1-Score
  • MCC: Matthews Correlation Coefficient
  • PRC: Precision-Recall Curve AUC

Results

Training outputs are saved to:

  • Models: models/best_model.pth
  • Metrics: results/output-*.txt
  • Logs: logs/train_*.log

Example results format:

Epoch	Time1(sec)	Time2(sec)	Loss_train	ACC_dev	AUC_dev	Rec_dev	Pre_dev	F1_dev	MCC_dev	PRC_dev	ACC_test	AUC_test	Rec_test	Pre_test	F1_test	MCC_test	PRC_test
1	1186.210	1298.588	410.146	0.949	0.948	0.698	0.388	0.498	0.496	0.539	0.947	0.947	0.727	0.401	0.517	0.516	0.571

Troubleshooting

Common Issues

  1. CUDA Out of Memory:

    # Reduce batch size in config.yaml
    batch_size: 16  # or smaller
  2. Data Loading Errors:

    # Check data file format
    python -c "import pickle; print(len(pickle.load(open('data.pkl', 'rb'))))"
  3. Import Errors:

    # Add project to Python path
    export PYTHONPATH="${PYTHONPATH}:/path/to/ALLSites"
  4. NumPy Version Issues:

    # Update NumPy if you see np.long errors
    pip install numpy>=1.21.0

Debug Mode

# Enable debug logging
python train.py --config configs/config.yaml --debug

Performance Tips

  1. Data Loading: Use num_workers=4 for faster data loading
  2. Memory: Enable gradient checkpointing for large models
  3. Speed: Use mixed precision training with autocast()
  4. Distributed: Scale learning rate linearly with number of GPUs

License

This project is licensed under the MIT License - see the LICENSE file for details.

Contributing

  1. Fork the repository
  2. Create a feature branch (git checkout -b feature/new-feature)
  3. Commit changes (git commit -am 'Add new feature')
  4. Push to branch (git push origin feature/new-feature)
  5. Create a Pull Request

Acknowledgments

  • ESM2 for protein embeddings
  • PyTorch team for the deep learning framework
  • Scientific Python community for tools and libraries

About

No description, website, or topics provided.

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Contributors 4

  •  
  •  
  •  
  •  

Languages