ProtSEC (Protein Sequence Embedding in Complex Space) is an ultrafast method for embedding protein sequences using the discrete Fourier transform. Unlike large protein language models (PLMs), ProtSEC requires no training on sequence data. It is 20,000× faster and uses 85× less memory compared to the popular models like esm2_3B, esm2_35M, prot_t5 and prot_bert. ProtSEC is lightweight enough to run on personal or laptop computers, even for processing large protein sequence datasets.
- Python >= 3.10
- Linux or macOS >= 13.5
- Minimum 10GB disk space
- Clone the repository and navigate to the project directory
git clone https://github.com/omics-lab/ProtSEC/
cd ProtSEC/- Create a virtual environment and activate
python3 -m venv venv
source venv/bin/activate- Upgrade
pipand install the required dependencies - Time: ~2 minute on a personal computer
pip install --upgrade pip
pip install -r requirements.txt- Generate complex embedding using a FASTA file
- Time: ~1 second on a personal computer using the example data
- Output file:
protein_embedding_ProtSEC.pkl(size ~80MB)
python3 protsec.py \
--fasta_path ./analysis/uniprot_data/uniprot_sprot_5000.fastapython protsec.py -h
usage: protsec.py [-h] --fasta_path FASTA_PATH [--dim DIM] [--num_threads NUM_THREADS] [--dim_reduct {UMAP,t-SNE,MDS}] [--dist_func {SMS,ASMP,SNN}]
[--out_file OUT_FILE]
Build protein vector database from FASTA file
options:
-h, --help show this help message and exit
--fasta_path FASTA_PATH
Path to input FASTA file (default: None)
--dim DIM Dimensionality of the embeddings (default: 1024)
--num_threads NUM_THREADS
Number of worker threads to use (default: 1/4 of CPU cores) (default: 6)
--dim_reduct {UMAP,t-SNE,MDS}
Algorithm for dimensionality reduction (default: MDS)
--dist_func {SMS,ASMP,SNN}
Distance function for computing distance (default: ASMP)
--out_file OUT_FILE Output file path for the embeddings (default: protein_embedding_ProtSEC.pkl)
- To select the optimum embedding dimension
d, please check the recommendations here - Default
-dis 1024 - Time: ~1 minute for a total of 12,497,500 unique pairs using 8 threads on a personal computer
fasta=./analysis/uniprot_data/uniprot_sprot_5000.fasta
python3 get_phase_dist_mat_optimized.py -d 1024 -i $fasta -o ProtSEC_matrix.csvpython3 get_phase_dist_mat_optimized.py -h
usage: get_phase_dist_mat_optimized.py [-h] --input INPUT --output OUTPUT [--dim DIM] [--processes PROCESSES]
[--method {symmetric}]
Compute pairwise distances of protein sequences from phase correlation (Optimized)
options:
-h, --help show this help message and exit
--input INPUT, -i INPUT
Input FASTA file
--output OUTPUT, -o OUTPUT
Output CSV file path
--dim DIM, -d DIM FFT dimension (default: 1024)
--processes PROCESSES, -p PROCESSES
Number of processes (default: auto)
--method {symmetric}, -m {symmetric}
Optimization method: symmetric (always optimal)
Rashedul Islam, PhD
📧 rashedul.gen@gmail.com
Raju RS and Rashedul I. ProtSEC: Ultrafast Protein Sequence Embedding in Complex Space Using Fast Fourier Transform. (2025).
This work is licensed under a Creative Commons Attribution-NonCommercial 4.0 International License.

