Skip to content

wasicse/ESMDisPred

Repository files navigation

ESMDisPred

ESMDisPred: A Structure-Aware CNN–Transformer Framework for Intrinsically Disordered Protein (IDP) Prediction.


Overview

ESMDisPred is a deep learning framework that integrates convolutional and Transformer-based architectures to predict intrinsically disordered regions in proteins. By combining sequence embeddings, evolutionary features, and structural information, ESMDisPred achieves high predictive accuracy and generalization across diverse protein families.


Model Variants

  • ESMDisPred-1
    Utilizes sequence-based features from DisPredict3.0 and ESM-1 embeddings.

  • ESMDisPred-2
    Extends ESMDisPred-1 by incorporating ESM-2 embeddings, providing richer contextual protein representations.

  • ESMDisPred-2PDB
    Builds upon ESMDisPred-2 by integrating structural-related features derived from PDB data, enhancing structural context awareness.

  • ESMDisPred-DNN
    A comprehensive CNN–Transformer hybrid model trained using all feature types from DisPredict3.0, ESM-1, ESM-2, and PDB-derived structural descriptors.
    This variant captures both local residue patterns (via CNNs) and long-range dependencies (via Transformers), resulting in superior predictive performance.


Requirements

  • OS: Ubuntu 20.04 (tested)
  • Python: via pyenv (recommended). Python 3.9–3.10 typically works best with PyTorch.
  • Hardware: CPU works; CUDA GPU strongly recommended for speed (CUDA 11+).
  • Disk: ≥ 20 GB free (models + features cache).
  • Tools (optional): Docker 24+ or Singularity 3.9+.

Tip: If you plan to run ESM models on GPU, ensure nvidia-driver, nvidia-container-toolkit (for Docker), and a CUDA-enabled PyTorch build are available.


Install & Data

# 1) Get the code
git clone https://github.com/wasicse/ESMDisPred.git
cd ESMDisPred

# 2) Download pre-trained model bundles (large)
./run_downloadLargeModels.sh

Dataset examples are under dataset/. A demo FASTA is provided at example/sample.fasta.


Quickstart (Local OS)

Warning: Running locally without Docker may fail due to missing system libraries (e.g., libidn.so.11) or Python package version conflicts (e.g., scikit-learn, lightgbm). We strongly recommend using Docker or Singularity for consistent and reproducible results. See Run with Docker or Run with Singularity sections below.

A) Install Dependencies (one command)

# from repo root
./install_dependencies.sh

The script will set up Python, packages, and expected folders (including largeModels/).

B) Run a Prediction

# from repo root
./run_ESMDisPred.sh "$(pwd)/example/sample.fasta" outputs
  • Input: path to a FASTA file (may contain one or more sequences)
  • Output: results in the outputs/ directory

Run with Docker

You can build the image or pull it from the registry.

Option 1: Build from source

docker build -t wasicse/esmdispred https://github.com/wasicse/ESMDisPred.git#master
docker build -t wasicse/esmdispred:version2 -f ./Dockerfile2 .

Option 2: Pull prebuilt image

docker pull wasicse/esmdispred:version2

Run

The helper script mounts your input FASTA, largeModels/, and outputs/ into the container:

# Interactive mode - will prompt for model selection
./run_ESMDisPred_Docker.sh "$(pwd)/example/sample.fasta" outputs

# Non-interactive mode - specify model as 3rd parameter
./run_ESMDisPred_Docker.sh "$(pwd)/example/sample.fasta" outputs 3
./run_ESMDisPred_Docker.sh "$(pwd)/example/sample.fasta" outputs ESMDisPred-2PDB
./run_ESMDisPred_Docker.sh "$(pwd)/example/sample.fasta" outputs all

Model options (3rd parameter):

  • 1 or ESMDisPred-1 - DisPredict3.0 + ESM1
  • 2 or ESMDisPred-2 - DisPredict3.0 + ESM1 + ESM2
  • 3 or ESMDisPred-2PDB - DisPredict3.0 + ESM1 + ESM2 + PDB features
  • 4 or ESMDisPred-DNN - CNN–Transformer hybrid
  • 5 or all - Run ALL models

If the 3rd parameter is omitted, the script will prompt you interactively to select a model.


Run with Singularity

Build from definition

sudo singularity build ESMDispS.sif ESMDispS.def

sudo singularity run --writable-tmpfs \
  -B "$(pwd)/example/sample.fasta:/opt/ESMDisPred/example/sample.fasta" \
  -B "$(pwd)/largeModels:/opt/ESMDisPred/largeModels" \
  -B "$(pwd)/outputs:/opt/ESMDisPred/outputs:rw" \
  ESMDispS.sif

cd /opt/ESMDisPred && ./run_ESMDisPred.sh "$(pwd)/example/sample.fasta" outputs

Build from Docker image

singularity pull esmdispred.sif docker://wasicse/esmdispred:latest

sudo singularity run --writable-tmpfs \
  -B "$(pwd)/example/sample.fasta:/opt/ESMDisPred/example/sample.fasta" \
  -B "$(pwd)/largeModels:/opt/ESMDisPred/largeModels" \
  -B "$(pwd)/outputs:/opt/ESMDisPred/outputs:rw" \
  esmdispred.sif

cd /opt/ESMDisPred && ./run_ESMDisPred.sh "$(pwd)/example/sample.fasta" outputs

Output

Inside outputs/ you’ll find:

  • PROTEINID.caid — per-residue disorder probabilities (one file per sequence), tab- or space-delimited.
  • timings.csv — wall-clock timings per stage/model.
  • Subfolders by model variant (e.g., ESMDisPred-1/, ESMDisPred-2/, …) when applicable.

File format (*.caid)

# Example (columns may include: residue_index, residue, probability)
1   M   0.034
2   A   0.071
...

Citation

If you use ESMDisPred, please cite:

  1. Md Wasi Ul Kabir, Ayon Dey, Farzeen Nafees, and Md Tamjidul Hoque. "ESMDisPred: A Structure-Aware CNN-Transformer Architecture for Intrinsically Disordered Protein Prediction." bioRxiv (2026). https://doi.org/10.64898/2026.01.22.701204.

  2. Md Wasi Ul Kabir, and Md Tamjidul Hoque. "DisPredict3.0: Prediction of Intrinsically Disordered Regions/Proteins Using Protein Language Model." Applied Mathematics and Computation 472 (July 2024): 128630. https://doi.org/10.1016/j.amc.2024.128630.

BibTeX
@article{Kabir2026ESMDisPred,
  author = {Kabir, Md Wasi Ul and Dey, Ayon and Nafees, Farzeen and Hoque, Md Tamjidul},
  title = {ESMDisPred: A Structure-Aware CNN-Transformer Architecture for Intrinsically Disordered Protein Prediction},
  year = {2026},
  doi = {10.64898/2026.01.22.701204},
  publisher = {Cold Spring Harbor Laboratory},
  journal = {bioRxiv}
}

@article{Kabir2024DisPredict3,
  title = {DisPredict3.0: Prediction of Intrinsically Disordered Regions/Proteins Using Protein Language Model},
  author = {Kabir, Md Wasi Ul and Hoque, Md Tamjidul},
  journal = {Applied Mathematics and Computation},
  volume = {472},
  pages = {128630},
  year = {2024},
  doi = {10.1016/j.amc.2024.128630}
}

Authors & Contact

Md Wasi Ul Kabir, Md Tamjidul Hoque
Questions/Issues: Md Tamjidul Hoquethoque@uno.edu


About

No description, website, or topics provided.

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published