Subcellular Localization Prediction Using Pre-trained Mistral Models

This markdown contains the full workflow of creating a function which predicts the location of user inputed proteins.

0. Installations

Download repository:

First navigate on directory you want the repository to be saved

git clone https://github.com/Theofili/Subcellular-Localization-Prediction

Create a conda enviroment:

conda create --name SLP python==3.12.1
conda activate SLP
cd Subcellular-Localization-Prediction

Install required packages:

pip install -r requirements.txt

Create folders:

cd data
mkdir fastas proteins splits model_data
cd ..

1. Collecting Data

In the `data/data_functions` folder run:

1.1. `average.py`

python data/data_functions/average.py

When executed this program:

Collects human protein sequences from the Entrez-protein database.
Calculates the lengths of the sequences.
Prints statistics like mean, sd and distribution of sequence lengths.
Prints lower (5%) and upper (95%) bound for sequence lengths.

These numbers are used to filter out extremely short or long protein sequences, because they are often biological artifacts (fragments, chimeras, or annotation errors) and don’t represent typical proteins. From a computational side, very short or long sequences create inefficiencies in batching, increase memory cost, and can destabilize training. By restricting lengths to a reasonable range, the model focuses on biologically meaningful proteins and trains more efficiently.

Average length is calculated on proteins from human and not only from the ones used to train - if trained data is used upper bound is higher due to in general longer extracellular protein sequences. ~~Don't know which bounds should be used.~~

1.2 `model_dataframes.py`

python data/data_functions/model_dataframes.py

When executed this program:

Creates dataframes with proteins pulled from Entrez which have a particular word in title. In this case it is cell locations(nuclear, membrane, mitochondrial, etc.)
Filters through duplicates and sequence lengths based on the average length computed in the previous program.
Creates 50:50 binary classification dataframes for each cell location -- Adjust sample sizes to number of target sequences
Dataframes saved in data/model_data folder (naming = model_{cell_location}_data.csv)

Each cell location contains different numbers of sequences, and for that reason total number of sequences differ. Each dataframe contains a sequence column and a type column. When type==1 sequence exist in the location written in the title of the dataframe, when type==0 sequence exists in any other location in the cell.

Doesn't use user input, collects data from 9 cellular locations (Reticulum, Extracellular, Golgi, Membrane, Mitochondria, Nuclear, Peroxisome, Reticulum, Ribosome)

NOTE: When runnning 1.1 and 1.2, might come accross HTTP Error 500: Internal Server Error due to network problems or NCBI busy servers. Try again in a few minutes.

All dataframes created are saved as csv's, as well as saved fasta files from Entrez.

2. Fine tuning the models

In the `models folder` run:

2.1 `Model_{cellular_location}.py`

python models/Model_Nuclear.py --acc

-acc (True, False)

This argument determines whether bfloat will be enabled, use True or False(default).

bfloat16 is used to speed up training and reduce memory usage, but is not compatible with all computers.

If you get and error your setup doesn't support bf16/gpu, or an accelarator error set argument as False.

This command will be repeated changing the cellular location. Model names are (Reticulum, Extracellular, Golgi, Membrane, Mitochondria, Nuclear, Peroxisome, Ribosome)

When executed these programms:

Initializes training parameters.
Splits coresponding dataframe into training(80%), validation(10%), testing(10%) sets. Saved as csv's
Trains the RaphaelMourad/Mistral-Peptide-v1-15M model on the datasets.
Saves results each epoch.
Saves the model in each own folder.

Each program needs to run individually and beacause of different training dataset sizes, some models perform better than others.

After filtering:

Type	Sequences Count
Extracellular	160
Golgi	456
Mitochondrial	2749
Membrane	3698
Nuclear	2561
Peroxisome	236
Reticulum	431
Ribosome	164

Most important parameters:

Learning rate: 2e-5
Number of epochs: 20
Max length: 1000
Early stopping patience: 3

Some other paramters can be changed for the model to train faster, in better performing hardware (batch_size, number of epochs, bf16)

Lysosome dataframe did not have sufficient amount of sequences, therefore there is no model.

2.2 `Localization.py`

python models/Localization.py

User will be asked to provide a protein sequence for classification

When executed this program:

Puts a user-inputted sequence through all models.
Prints the model that yields the highest score
Prints all model scores
If no model hits a score above 0.5, 'Not Matched' is printed

Output expamle:

## Example 2.2 execution
# Models were trained on very few data, for output example purposes.
# Peroxisome and Reticulum models were not trained or used in the localization function.

(SLP) ~\Subcellular-Localization-Prediction>python models/Localization.py
Please provide protein sequence: MAALRRLLWPPPRVSPPLCAHQPLLGPWGRPAVTTLGLPGRPFSSREDEERAVAEAAWRRRRRWGELSVAAAAGGGLVGLVCYQLYGDPRAGSPATGRPSKSAATEPEDPPRGRGMLPIPVAAAKETVAIGRTDIEDLDLYATSRERRFRLFASIECEGQLFMTPYDFILAVTTDEPKVAKTWKSLSKQELNQMLAETPPVWKGSSKLFRNLKEKEPHAGFRIAFNMFDTDGNEMVDKKEFLVLQEIFRKKNEKREIKGDEEKRAMLRLQLYGYHSPTNSVLKTDAEELVSRSYWDTLRRNTSQALFSDLAERADDITSLVTDTTLLVHFFGKKGKAELNFEDFYRFMDNLQTEVLEIEFLSYSNGMNTISEEDFAHILLRYTNVENTSVFLENVRYSIPEEKGITFDEFRSFFQFLNNLEDFAIALNMYNFASRSIGQDEFKRAVYVATGLKFSPHLVNTVFKIFDVDKDDQLSYKEFIGIMKDRLHRGFRGYKTVQKYPTFKSCLKKELHSR

Predicted type: mitochondria
All model scores: {'extracellular': 0.410167396068573, 'golgi': 0.38762950897216797, 'membrane': 0.3010360598564148, 'mitochondria': 0.6791390776634216, 'nuclear': 0.2867254316806793, 'ribosome': 0.44316619634628296}

(SLP) ~\Subcellular-Localization-Prediction>python models/Localization.py
Please provide protein sequence: GSHMESADLRALAKHLYDSYIKSFPLTKAKARAILTGKTTDKSPFVIYDMNSLMMGEDKIKFKHITPLQEQSKEVAIRIFQGCQFRSVEAVQEITEYAKSIPGFVNLDLNDQVTLLKYGVHEIIYTMLASLMNKDGVLISEGQGFMTREFLKSLRKPFGDFMEPKFEFAVKFNALELDDSDLAIFIAVIILSGDRPGLLNVKPIEDIQDNLLQALELQLKLNHPESSQLFAKLLQKMTDLRQIVTEHVQLLQVIKKTETDMSLHPLLQEIYKDLY

Predicted type: Not Matched
All model scores: {'extracellular': 0.31158125400543213, 'golgi': 0.3205016851425171, 'membrane': 0.1591329723596573, 'mitochondria': 0.3411959707736969, 'nuclear': 0.42463913559913635, 'ribosome': 0.3570682108402252}

Name		Name	Last commit message	Last commit date
Latest commit History 132 Commits
data/data_functions		data/data_functions
models		models
README.md		README.md
Report.pdf		Report.pdf
requirements.txt		requirements.txt

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

Subcellular Localization Prediction Using Pre-trained Mistral Models

0. Installations

1. Collecting Data

In the `data/data_functions` folder run:

1.1. `average.py`

1.2 `model_dataframes.py`

2. Fine tuning the models

In the `models folder` run:

2.1 `Model_{cellular_location}.py`

`-acc (True, False)`

This command will be repeated changing the cellular location. Model names are (Reticulum, Extracellular, Golgi, Membrane, Mitochondria, Nuclear, Peroxisome, Ribosome)

2.2 `Localization.py`

User will be asked to provide a protein sequence for classification

Output expamle:

About

Uh oh!

Releases

Packages

Languages

BiodataAnalysisGroup/Subcellular-Localization-Prediction

Folders and files

Latest commit

History

Repository files navigation

Subcellular Localization Prediction Using Pre-trained Mistral Models

0. Installations

1. Collecting Data

In the data/data_functions folder run:

1.1. average.py

1.2 model_dataframes.py

2. Fine tuning the models

In the models folder run:

2.1 Model_{cellular_location}.py

-acc (True, False)

This command will be repeated changing the cellular location. Model names are (Reticulum, Extracellular, Golgi, Membrane, Mitochondria, Nuclear, Peroxisome, Ribosome)

2.2 Localization.py

User will be asked to provide a protein sequence for classification

Output expamle:

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Languages

In the `data/data_functions` folder run:

1.1. `average.py`

1.2 `model_dataframes.py`

In the `models folder` run:

2.1 `Model_{cellular_location}.py`

`-acc (True, False)`

2.2 `Localization.py`

Packages