The primary goal of FEpi is to accurately identify peptidase in multiple predicted protein sequences. Currently, there is a lack of one-stop tools for specifically mining peptidases.We apply fine-tuned ESM2 to develop a specific peptide enzyme identification model universally applicable for users in need.
(1) Download the FEpi software from github
git clone https://github.com/Stevenleizheng/FEpi.git
(2) Go to the directory of FEpi, for example:
cd FEpi
(3) Create a new conda environment, for example:
conda create -n fepi python=3.9.18
(4) Enter the conda environment
conda activate fepi
(5) Install the following software
a. pytorch: If you want to use the CPU version, please run conda install pytorch torchvision torchaudio cpuonly -c pytorch.
If you want to use the GPU version, please go to https://pytorch.org/get-started and get the conda or pip install command according to your device and demand.
b. fair-esm: pip install fair-esm==2.0.0
c. pandas: pip install pandas==1.4.2
d. biopython: conda install -c bioconda biopython=1.78
e. numpy: conda install numpy=1.26.2
or pip install numpy==1.22.3
f. tqdm: pip install tqdm
(1) Download the model (The working path is still 'FEpi'). The model parameter file is saved at: https://zenodo.org/records/14878564. The file size is 1.6 GB.
wget -c https://zenodo.org/records/14878564/files/model_param.tar.gz
or
wget -c https://zenodo.org/records/14878564/files/model_param.tar.gz?download=1
(2) Unpack the file
tar xzvf model_param.tar.gz
Run this command (a test prediction with 103 proteins) to see whether the software has installed correctly using CPU.
python main.py -i Testset/demo.fasta
If the software is installed correctly and completely, this step will finish in less than 10 minutes (might be longer if your device is too old) without any error. The results of the test prediction will be saved in the result folder.
Run this command (a test prediction with 103 proteins) to see whether the software has installed correctly using GPU (e.g. NVIDIA A40).
python main.py -i Testset/demo.fasta -g 0
If the software is installed correctly and completely, this step will finish in less than 3 minutes (might be longer if your device is too old) without any error. The results of the test prediction will be saved in the result folder.
(1) Preparations
Your proteins in a fasta file (path: ???.fa). A directory to save the output files (path: ???/). If you want to use GPU(s), please prepare the IDs of the GPU(s) you want to use, for example, a single-GPU machine, here it is prepared to be 0; multi-GPU machine using only one GPU, here it is prepared as x (x is the GPU ID used); multi-GPU machine using multiple GPUs, here it is prepared as x1,x2,... (x1,x2,... are the GPU IDs you want to use).
(2) Prediction
CPU: python main.py -i ???.fa
single GPU machine: python main.py -i ???.fa -g '0'
multi GPU machine, using one GPU: python main.py -i ???.fa -g 'x'
multi GPU machine, using multi GPUs: python main.py -i ???.fa -g 'x1,x2,...'
-o determines the output directory, -g determines the IDs of GPUs you want to use (not given -g, will use CPU)
If you want to change the batch size (default is 32), please use -b, please note that the batch size cannot be negative and should not be smaller than the number of GPUs used.
If you want to change the threshold of binary task (default is 0.5), please use -t. You can set the number between 0 and 1. Example commands:
Predict proteins in 'example.fasta', save the results to 'result/', and batch size is 64. The intermediate process data is saved in the 'data/' directory.
CPU: python main.py -i example.fa -b 64 -d data/ -o result/
single GPU machine: python main.py -i example.fa -g '0' -b 64 -d data/ -o result/
multi GPU machine, using one GPU (ID:2): python main.py -i example.fa -g '2' -b 64 -d data/ -o result/
multi GPU machine, using eight GPUs (ID:0-7): python main.py -i example.fa -g '0,1,2,3,4,5,6,7' -b 16 -d data/ -o result/
The descriptions for the result files are in the 'binary_result.txt' file of the output directory.
Additionally, our tool provides two parameters: one is -a, which, if processing FASTA files with sequences ending in *, can be set to 1 (default is 0).
The other parameter is -t, which controls the threshold for binary classification models (ranging from 0 to 1, default is 0.5). A higher value increases the confidence in the selected enzymes, while a lower value allows for the detection of more enzymes, but may also result in a higher rate of false positives.