MiPepid is a software specifically for predicting the coding capabilities of sORFs.
The corresponding paper "MiPepid: Micropeptide identification tool using machine learning"[*] is now submitted to BMC Bioinformatics.
sORFs / smORFs are short open reading frames with length <= 303 bp (including the stop codon), and if translated, they encode micropeptides that are <= 100 amino acids.
Micropeptides were traditionally ignored due to their size but are now gaining increasing attentions because they have been shown to play critical roles in many vital biological activities.
Given a fasta file containing DNA fasta sequences, for each sequence, MiPepid will find all the sORFs (length <= 303 bp) present in all the 3 translation frames of the sequence, and for each sORF it will return the predicted class label (coding or noncoding) as well as the probability of being in that class. All the results will be written in an output .csv file.
Language dependency:
Python 3 (Please do not use Python 2 to run the code.)
Library dependency:
NumpypandasBio(the Biopython package: https://biopython.org)
cd MiPepid
python3 ./src/mipepid.py input_fasta_file_path output_fasta_file_pathNote:
- The
input_fasta_file_pathis the path of your input fasta file.- Please make sure it is a fasta file with only DNA sequences (not RNA sequences or protein sequences).
- Please make sure that each record in your fasta file has a unique ID, as the IDs will be used to name the found ORFs.
- Please also make sure that your sequences only contain the four letters:
A,T,C,G. Other DNA letters, such asN,R,Y, etc. are currently not supported.
- The
output_fasta_file_pathis the path of the output.csvfile.- It is optional.
- If you choose to specify it, it is suggested that you name this output file with
.csvextension. - If you do not specify the output file, a file with name
Mipepid_results.csvwill be automatically created under the same./MiPepiddirectory as the output file.
The output .csv file contains the following columns:
sORF_ID: the ID of an sORF.sORF_seq: the DNA sequence of the sORF (containing the stop codon with the start codon asATG).transcript_DNA_sequence_ID: the ID of the original DNA sequence in your input fasta file where this sORF is extracted.start_at: the 1-based starting position of this sORF on the original DNA sequence.end_at: the 1-based ending position of this sORF on the original DNA sequence.classification: the predicted class of the sORF, eithercodingornoncoding.probability: the probability that this sORF is in the assigned class.
There is a sample DNA sequence file sample_seqs.fasta under the directory ./demo_files/. You can try to run MiPepid on this file:
cd MiPepid
python3 ./src/mipepid.py ./demo_files/sample_seqs.fasta ./demo_files/MiPepid_results_on_sample_seqs.csvThis will output a file MiPepid_results_on_sample_seqs.csv under the same directory (./demo_files/).
datasets.tar.gz contains all major datasets used in the paper.
[*]: Mengmeng Zhu, Michael Gribskov. MiPepid: Micropeptide identification tool using machine learning. BMC Bioinformatics 20, 559 (2019) doi:10.1186/s12859-019-3033-9