Prot-xLSTM

This repository provides the code necessary to reproduce the experiments presented in the paper Bio-xLSTM: Generative modeling, representation and in-context learning of biological and chemical sequences. The code is organized across the following repositories:

Prot-xLSTM (current repository)
DNA-xLSTM
Chem-xLSTM

Quickstart

Installation

git clone https://github.com/ml-jku/Prot-xLSTM.git
cd Prot-xLSTM
conda env create -f prot_xlstm_env.yaml
conda activate prot_xlstm
pip install -e .

This package also supports the use of ProtMamba and ProtTransformer++ models. If you want to enable flash attention for the transformer model, install flash-attn separately.

Model Weights and Pre-Processed Data

Model weights and the processed dataset can be downloaded here. To reproduce the results, place the model weights in a checkpoints/ folder and copy the dataset to the data/ folder.

Applications

For an easy start with Prot-xLSTM applications, we provide two sample notebooks:

examples/generation.ipynb: This notebook demonstrates how to generate and evaluate novel protein sequences based on a set context sequences.
examples/variant_fitness.ipynb: This notebook enables you to assess the mutational effects of amino acid substitutions on a target sequence, with the option to include context proteins as well.

Repository Structure

configs/: Configuration files for model training.
data/: Train, validation and test splits of the dataset.
evaluation/: Scripts to reproduce experiments and figures from the paper.
- evaluation/generation/: Homology-conditioned sequence generation.
- evaluation/pretraining/: Learning curves and test set performance.
- evaluation/proteingym/: ProteinGym DMS substitution benchmark.
examples/: Example notebooks for Prot-xLSTM applications.
protxlstm/: Implementation of Prot-xLSTM.

Pretraining

Data

Download preprocessed data from here or download raw multiple-sequence alignment files from the OpenProteinSet dataset using:

aws s3 cp s3://openfold/uniclust30/data/a3m_files/ --recursive --no-sign-request --exclude "*" --include "*.a3m"

and preprocess data with (this takes several hours!):

python protxlstm/data.py

Model Training

To train a Prot-xLSTM model, set the desired model parameters in configs/xlstm_default_config.yaml and dataset/training parameters in configs/train_default_config.yaml, and run:

python protxlstm/train.py

ProtMamba and Transformer++ models can be trained using the following:

python protxlstm/train.py --model_config_path=configs/mamba_default_config.yaml
python protxlstm/train.py --model_config_path=configs/llama_default_config.yaml

Evaluation

Evaluation of Model on the Test Set

To evaluate a model on the test set, provide the name of the checkpoint folder (located in checkpoints/) and the context length you want to evaluate on, and run:

python evaluation/pretraining/evaluate.py --model_name protxlstm_102M_60B --model_type xlstm --context_len 131072

Homology-Conditional Protein Generation

To reprodruce the results on the sequence generation downstream-task, set the checkpoint path in evaluation/generation/run_sample_sequences.py and run the script using:

python evaluation/generation/run_sample_sequences.py

To score the generated sequences, run:

python evaluation/generation/run_score_sequences.py

This will generate a dataframe for each selected protein cluster, containing all generated sequences and their relevant metrics.

ProteinGym

To evaluate the model on the ProteinGym DMS Substitutions Benchmark, first download the DMS data directly from the ProteinGym website. Additionally, download the ColabFold MSAs used as context from this link. Place both the DMS data and ColabFold MSA files into a data/proteingym directory, and then run:

python evaluation/proteingym/run.py

Acknowledgments

The underlying code was adapted from the ProtMamba repository, and includes original code from the xLSTM repository.

Citation

@article{schmidinger2024bio-xlstm,
  title={{Bio-xLSTM}: Generative modeling, representation and in-context learning of biological and chemical  sequences},
  author={Niklas Schmidinger and Lisa Schneckenreiter and Philipp Seidl and Johannes Schimunek and Pieter-Jan Hoedt and Johannes Brandstetter and Andreas Mayr and Sohvi Luukkonen and Sepp Hochreiter and Günter Klambauer},
  journal={arXiv},
  doi = {},
  year={2024},
  url={}
}

Name		Name	Last commit message	Last commit date
Latest commit History 6 Commits
assets		assets
configs		configs
data		data
evaluation		evaluation
examples		examples
protxlstm		protxlstm
.gitignore		.gitignore
LICENSE		LICENSE
README.md		README.md
prot_xlstm_env.yml		prot_xlstm_env.yml
pyproject.toml		pyproject.toml

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Prot-xLSTM

Quickstart

Installation

Model Weights and Pre-Processed Data

Applications

Repository Structure

Pretraining

Data

Model Training

Evaluation

Evaluation of Model on the Test Set

Homology-Conditional Protein Generation

ProteinGym

Acknowledgments

Citation

About

Releases

Packages

Languages

License

ml-jku/Prot-xLSTM

Folders and files

Latest commit

History

Repository files navigation

Prot-xLSTM

Quickstart

Installation

Model Weights and Pre-Processed Data

Applications

Repository Structure

Pretraining

Data

Model Training

Evaluation

Evaluation of Model on the Test Set

Homology-Conditional Protein Generation

ProteinGym

Acknowledgments

Citation

About

Resources

License

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages