Prediction of the effect of a mutation in an amido acid sequence

Generate the Data files

Main datasets

First, you need to download the folder datasets contained in https://drive.google.com/drive/folders/19PUzDqEg_kl-8nWG_Z7C7NDC9BaZTwPw?usp=sharing and put it in the repo (or download the full folder 'datasets'.

N.B. : You can skip all the following generation of data by downloading the folders 'data' contained in the drive https://drive.google.com/drive/folders/19PUzDqEg_kl-8nWG_Z7C7NDC9BaZTwPw?usp=sharing

Due to limited computational power, we use the the train set provided by the lab, to create our train set, val set and test set. We create one small set composed of sequences with a maximum length of 215 aminoa acids and another bigger one with sequences with a maximum length of 600 aminoa acids. To create those datasets, make that the provided files'mut_effect_train.csv' and 'uniprot_fasta_sequences.csv' are contained in the dataset folder, and run the following :

mkdir data

python data_creation/create_data_sets.py

Cut datasets

To do some deeper analysis on how our models perform, we create new cut datasets from the 215 ones. We cut every sequence around the mutation, and keep X amount of amino acids. To recreate this data, run the following :

mkdir data/cut

python data_creation/create_cut_data_sets.py

Generate sequence features using ProtBert

N.B.: Generate sequence features using ProtBert is a very expensive and long process, we invite you to download directly the necessary features by downloading the 'data' folder contained in https://drive.google.com/drive/folders/19PUzDqEg_kl-8nWG_Z7C7NDC9BaZTwPw?usp=sharing and put it in the cloned repo.

Generating sequence features using the ProtBert from amazon takes a long time, therefore we decided to create all features needed in on go and store them in csv files.

Due to limited computational power, we had to save the features for every 10 sequences, and couldn't create just one big files. To recreate those files, you should run the following :

If the following folders are not already created :

mkdir data/data_bert_600 data/data_bert_215 data/cut/cut_bert

mkdir data/data_bert_600/train data/data_bert_600/val data/data_bert_215/train data/data_bert_215/val data/cut/cut_bert/cut_5 data/cut/cut_bert/cut_10 data/cut/cut_bert/cut_20 cut/cut_bert/cut_50

Then run :

python3 sequence_to_bert_features.py

Recreate our results

To use our models and recreate our results, you should follow the notebooks method_1-2.ipynb and method_3.ipynb which are already quite self-explainnatory.

Requirements

Make sure that the following library are installed :

pytorch, pandas, sklearn, re, csv, numpy, seaborn, matplotlib

Name		Name	Last commit message	Last commit date
Latest commit History 67 Commits
data_creation		data_creation
utils		utils
README.md		README.md
method_1-2.ipynb		method_1-2.ipynb
method_3.ipynb		method_3.ipynb
report.pdf		report.pdf
sequence_to_bert_features.py		sequence_to_bert_features.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

Prediction of the effect of a mutation in an amido acid sequence

Generate the Data files

Main datasets

Cut datasets

Generate sequence features using ProtBert

Recreate our results

Requirements

About

Uh oh!

Releases

Packages

Contributors 3

Uh oh!

Languages

scherkao31/ml_project2

Folders and files

Latest commit

History

Repository files navigation

Prediction of the effect of a mutation in an amido acid sequence

Generate the Data files

Main datasets

Cut datasets

Generate sequence features using ProtBert

Recreate our results

Requirements

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Contributors 3

Uh oh!

Languages

Packages