GitHub - omics-lab/ProtSEC: Protein Sequence Embedding in Complex Space

ProtSEC architecture for protein sequence embedding

ProtSEC (Protein Sequence Embedding in Complex Space) is an ultrafast method for embedding protein sequences using the discrete Fourier transform. Unlike large protein language models (PLMs), ProtSEC requires no training on sequence data. It is 20,000× faster and uses 85× less memory compared to the popular models like esm2_3B, esm2_35M, prot_t5 and prot_bert. ProtSEC is lightweight enough to run on personal or laptop computers, even for processing large protein sequence datasets.

1. Requirement

Python >= 3.10
Linux or macOS >= 13.5
Minimum 10GB disk space

2. Installation

Clone the repository and navigate to the project directory

git clone https://github.com/omics-lab/ProtSEC/
cd ProtSEC/

Create a virtual environment and activate

python3 -m venv venv
source venv/bin/activate

Upgrade pip and install the required dependencies
Time: ~2 minute on a personal computer

pip install --upgrade pip
pip install -r requirements.txt

3. Run ProtSEC

Generate complex embedding using a FASTA file
Time: ~1 second on a personal computer using the example data
Output file: protein_embedding_ProtSEC.pkl (size ~80MB)

python3 protsec.py \
    --fasta_path ./analysis/uniprot_data/uniprot_sprot_5000.fasta

Usage

python protsec.py -h
usage: protsec.py [-h] --fasta_path FASTA_PATH [--dim DIM] [--num_threads NUM_THREADS] [--dim_reduct {UMAP,t-SNE,MDS}] [--dist_func {SMS,ASMP,SNN}]
                  [--out_file OUT_FILE]

Build protein vector database from FASTA file

options:
  -h, --help            show this help message and exit
  --fasta_path FASTA_PATH
                        Path to input FASTA file (default: None)
  --dim DIM             Dimensionality of the embeddings (default: 1024)
  --num_threads NUM_THREADS
                        Number of worker threads to use (default: 1/4 of CPU cores) (default: 6)
  --dim_reduct {UMAP,t-SNE,MDS}
                        Algorithm for dimensionality reduction (default: MDS)
  --dist_func {SMS,ASMP,SNN}
                        Distance function for computing distance (default: ASMP)
  --out_file OUT_FILE   Output file path for the embeddings (default: protein_embedding_ProtSEC.pkl)

Generate pairwise phase correlation matrix using ProtSEC

To select the optimum embedding dimension d, please check the recommendations here
Default -d is 1024
Time: ~1 minute for a total of 12,497,500 unique pairs using 8 threads on a personal computer

fasta=./analysis/uniprot_data/uniprot_sprot_5000.fasta
python3 get_phase_dist_mat_optimized.py -d 1024 -i $fasta -o ProtSEC_matrix.csv

Usage

python3 get_phase_dist_mat_optimized.py -h
usage: get_phase_dist_mat_optimized.py [-h] --input INPUT --output OUTPUT [--dim DIM] [--processes PROCESSES]
                                       [--method {symmetric}]

Compute pairwise distances of protein sequences from phase correlation (Optimized)

options:
  -h, --help            show this help message and exit
  --input INPUT, -i INPUT
                        Input FASTA file
  --output OUTPUT, -o OUTPUT
                        Output CSV file path
  --dim DIM, -d DIM     FFT dimension (default: 1024)
  --processes PROCESSES, -p PROCESSES
                        Number of processes (default: auto)
  --method {symmetric}, -m {symmetric}
                        Optimization method: symmetric (always optimal)

4. Contact

Rashedul Islam, PhD
📧 rashedul.gen@gmail.com

5. Citation

Raju RS and Rashedul I. ProtSEC: Ultrafast Protein Sequence Embedding in Complex Space Using Fast Fourier Transform. (2025).

6. License

This work is licensed under a Creative Commons Attribution-NonCommercial 4.0 International License.

Name		Name	Last commit message	Last commit date
Latest commit History 120 Commits
analysis		analysis
.gitignore		.gitignore
LICENSE		LICENSE
README.md		README.md
amino_acid_encoder.py		amino_acid_encoder.py
annotate.py		annotate.py
embedder.py		embedder.py
get_phase_dist_mat_optimized.py		get_phase_dist_mat_optimized.py
get_plm_dist_mat_optimized.py		get_plm_dist_mat_optimized.py
protsec.py		protsec.py
requirements.txt		requirements.txt
utility.py		utility.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

ProtSEC architecture for protein sequence embedding

1. Requirement

2. Installation

3. Run ProtSEC

Usage

Generate pairwise phase correlation matrix using ProtSEC

Usage

4. Contact

5. Citation

6. License

About

Uh oh!

Releases

Packages

Contributors 2

Uh oh!

Languages

License

omics-lab/ProtSEC

Folders and files

Latest commit

History

Repository files navigation

ProtSEC architecture for protein sequence embedding

1. Requirement

2. Installation

3. Run ProtSEC

Usage

Generate pairwise phase correlation matrix using ProtSEC

Usage

4. Contact

5. Citation

6. License

About

Topics

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Contributors 2

Uh oh!

Languages

Packages