This project implements a Variational Autoencoder (VAE) to encode and generate protein sequences using a one-hot encoded representation of amino acids.
- Supports multi-sequence FASTA files for input.
- Uses a one-hot encoding scheme for amino acids, including padding.
- Trains a VAE to learn a latent representation of protein sequences.
- Generates new protein-like sequences by sampling from the latent space.
- Clone this repository:
git clone https://github.com/annadiarov/ProtVAE cd ProtVAE - Install the required dependencies (we recommend using a virtual environment
and install pytorch using the instructions from the official website):
pip install -r requirements.txt
To train the VAE, run the following command:
python src/training.py --fasta_file data/example.fasta --epochs 50 --batch_size 32This will train the VAE on the sequences in data/example.fasta for 50 epochs
using a batch size of 32.
The trained weights will be saved as vae_weights.pth by default but can be
changed using the --output-weights argument.
After training the VAE, you can generate new sequences by running:
python src/sampling.py --weights vae_weights.pth --num_samples 10This will generate 10 new sequences using the trained VAE weights in
generated_sequences.fasta by default, but this can be changed using the
--output-file argument.
This project is licensed under the MIT License - see the LICENSE file for details.