This markdown contains the full workflow of creating a function which predicts the location of user inputed proteins.
Download repository:
First navigate on directory you want the repository to be saved
git clone https://github.com/Theofili/Subcellular-Localization-Prediction
Create a conda enviroment:
conda create --name SLP python==3.12.1
conda activate SLP
cd Subcellular-Localization-Prediction
Install required packages:
pip install -r requirements.txt
Create folders:
cd data
mkdir fastas proteins splits model_data
cd ..
python data/data_functions/average.py
When executed this program:
- Collects human protein sequences from the Entrez-protein database.
- Calculates the lengths of the sequences.
- Prints statistics like mean, sd and distribution of sequence lengths.
- Prints lower (5%) and upper (95%) bound for sequence lengths.
These numbers are used to filter out extremely short or long protein sequences, because they are often biological artifacts (fragments, chimeras, or annotation errors) and don’t represent typical proteins. From a computational side, very short or long sequences create inefficiencies in batching, increase memory cost, and can destabilize training. By restricting lengths to a reasonable range, the model focuses on biologically meaningful proteins and trains more efficiently.
Average length is calculated on proteins from human and not only from the ones used to train - if trained data is used upper bound is higher due to in general longer extracellular protein sequences. Don't know which bounds should be used.
python data/data_functions/model_dataframes.py
When executed this program:
- Creates dataframes with proteins pulled from Entrez which have a particular word in title. In this case it is cell locations(nuclear, membrane, mitochondrial, etc.)
- Filters through duplicates and sequence lengths based on the average length computed in the previous program.
- Creates 50:50 binary classification dataframes for each cell location -- Adjust sample sizes to number of target sequences
- Dataframes saved in
data/model_data
folder (naming =model_{cell_location}_data.csv
)
Each cell location contains different numbers of sequences, and for that reason total number of sequences differ. Each dataframe contains a sequence
column and a type
column. When type==1
sequence exist in the location written in the title of the dataframe, when type==0
sequence exists in any other location in the cell.
Doesn't use user input, collects data from 9 cellular locations (Reticulum, Extracellular, Golgi, Membrane, Mitochondria, Nuclear, Peroxisome, Reticulum, Ribosome)
NOTE: When runnning 1.1 and 1.2, might come accross
HTTP Error 500: Internal Server Error
due to network problems or NCBI busy servers. Try again in a few minutes.
All dataframes created are saved as csv's, as well as saved fasta files from Entrez.
python models/Model_Nuclear.py --acc
This argument determines whether bfloat will be enabled, use True
or False
(default).
bfloat16 is used to speed up training and reduce memory usage, but is not compatible with all computers.
If you get and error your setup doesn't support bf16/gpu
, or an accelarator error
set argument as False.
This command will be repeated changing the cellular location. Model names are (Reticulum, Extracellular, Golgi, Membrane, Mitochondria, Nuclear, Peroxisome, Ribosome)
When executed these programms:
- Initializes training parameters.
- Splits coresponding dataframe into training(80%), validation(10%), testing(10%) sets. Saved as csv's
- Trains the
RaphaelMourad/Mistral-Peptide-v1-15M
model on the datasets. - Saves results each epoch.
- Saves the model in each own folder.
Each program needs to run individually and beacause of different training dataset sizes, some models perform better than others.
After filtering:
Type | Sequences Count |
---|---|
Extracellular | 160 |
Golgi | 456 |
Mitochondrial | 2749 |
Membrane | 3698 |
Nuclear | 2561 |
Peroxisome | 236 |
Reticulum | 431 |
Ribosome | 164 |
Most important parameters:
- Learning rate: 2e-5
- Number of epochs: 20
- Max length: 1000
- Early stopping patience: 3
Some other paramters can be changed for the model to train faster, in better performing hardware (batch_size, number of epochs, bf16)
Lysosome dataframe did not have sufficient amount of sequences, therefore there is no model.
python models/Localization.py
When executed this program:
- Puts a user-inputted sequence through all models.
- Prints the model that yields the highest score
- Prints all model scores
- If no model hits a score above 0.5,
'Not Matched'
is printed
## Example 2.2 execution
# Models were trained on very few data, for output example purposes.
# Peroxisome and Reticulum models were not trained or used in the localization function.
(SLP) ~\Subcellular-Localization-Prediction>python models/Localization.py
Please provide protein sequence: MAALRRLLWPPPRVSPPLCAHQPLLGPWGRPAVTTLGLPGRPFSSREDEERAVAEAAWRRRRRWGELSVAAAAGGGLVGLVCYQLYGDPRAGSPATGRPSKSAATEPEDPPRGRGMLPIPVAAAKETVAIGRTDIEDLDLYATSRERRFRLFASIECEGQLFMTPYDFILAVTTDEPKVAKTWKSLSKQELNQMLAETPPVWKGSSKLFRNLKEKEPHAGFRIAFNMFDTDGNEMVDKKEFLVLQEIFRKKNEKREIKGDEEKRAMLRLQLYGYHSPTNSVLKTDAEELVSRSYWDTLRRNTSQALFSDLAERADDITSLVTDTTLLVHFFGKKGKAELNFEDFYRFMDNLQTEVLEIEFLSYSNGMNTISEEDFAHILLRYTNVENTSVFLENVRYSIPEEKGITFDEFRSFFQFLNNLEDFAIALNMYNFASRSIGQDEFKRAVYVATGLKFSPHLVNTVFKIFDVDKDDQLSYKEFIGIMKDRLHRGFRGYKTVQKYPTFKSCLKKELHSR
Predicted type: mitochondria
All model scores: {'extracellular': 0.410167396068573, 'golgi': 0.38762950897216797, 'membrane': 0.3010360598564148, 'mitochondria': 0.6791390776634216, 'nuclear': 0.2867254316806793, 'ribosome': 0.44316619634628296}
(SLP) ~\Subcellular-Localization-Prediction>python models/Localization.py
Please provide protein sequence: GSHMESADLRALAKHLYDSYIKSFPLTKAKARAILTGKTTDKSPFVIYDMNSLMMGEDKIKFKHITPLQEQSKEVAIRIFQGCQFRSVEAVQEITEYAKSIPGFVNLDLNDQVTLLKYGVHEIIYTMLASLMNKDGVLISEGQGFMTREFLKSLRKPFGDFMEPKFEFAVKFNALELDDSDLAIFIAVIILSGDRPGLLNVKPIEDIQDNLLQALELQLKLNHPESSQLFAKLLQKMTDLRQIVTEHVQLLQVIKKTETDMSLHPLLQEIYKDLY
Predicted type: Not Matched
All model scores: {'extracellular': 0.31158125400543213, 'golgi': 0.3205016851425171, 'membrane': 0.1591329723596573, 'mitochondria': 0.3411959707736969, 'nuclear': 0.42463913559913635, 'ribosome': 0.3570682108402252}