Thema 09 Machine learning

Dennis Haandrikman, 347655, HanzeHogeschool Groningen

Introduction

This experiment is a means to explore machine learning practices, while trying to contribute to the current day relevant topic of vaccine prediction for diseases akin to COVID-19. If successful, this practice could be used to help vaccine creation based on peptide/protein sequences henceforth.

As such, the experiment will recreate the experiment/project that was done with this dataset. Which is to calculate how many antigen binding sites are on COVID-19 peptide/protein sequences, based on the peptide/protein characteristics.

As such the following research question has been formulated:

Is it possible to predict if an amino-acid peptide promotes antibody-inducing activities, based on the peptide/protein structure and properties? And is it possible to try to expand or improve on the experiment and predictions that have been done previously on the same dataset regarding COVID-19?

The paper for this experiment can be read on the following link: https://www.biorxiv.org/content/10.1101/2020.07.27.224121v1

Data

The experiment makes use of a dataset used for epitome prediction for COVID-19 peptide sequences, available on the following website: https://www.kaggle.com/datasets/futurecorporation/epitope-prediction

It consists of 3 different csv files, which can be found in the /Data folder, they are the following:

input_bcell.csv; The main training data, consisting 14387 rows for all combinations of 14362 peptides and 757 proteins.
input_sars.csv; Additional main training data, consisting of 520 rows.
input_covid.csv; The target data, there is no label data in the columns.

In all three the datasets there is information of the protein and peptide:

parent_protein_id; parent protein ID
protein_seq; parent protein sequence
start_position; start position of peptide
end_position; end position of peptide
peptide_seq; peptide sequence
chou_fasman; peptide feature, ß turn
emini; peptide feature, relative surface accessibility
kolaskar_tongaonkar; peptide feature, antigenicity
parker; peptide feature, hydrophobicity
isoelectric_point; protein feature
aromacity; protein feature
hydrophobicity; protein feature
stability; protein feature

Further information about the variables is located in the peptide_protein_properties_codebook.csv file.

Usage

The usage for the EDA and the Java_wrapper are located in both sub-folders.

Tree of the project

│   .gitattributes
│   .gitignore
│   README.md
│
├───EDA_analysis
│   │   peptide_protein_properties_codebook.csv
│   │   README.md
│   │
│   ├───Data
│   │   │   input_bcell.csv
│   │   │   input_covid.csv
│   │   │   input_sars.csv
│   │   │
│   │   ├───Images
│   │   │       Figure1.png
│   │   │       Figure2.png
│   │   │       Figure3.png
│   │   │       Figure4_heatmap.png
│   │   │       Figure5.png
│   │   │       Figure6.png
│   │   │       Figure7.png
│   │   │
│   │   └───Weka
│   │       │   bcel_sars_data.arff
│   │       │   bcel_sars_log_transform.arff
│   │       │   Cost_sensitive_test_machine_learning.arff
│   │       │   exploratory_weka_algorythm_analysis.arff
│   │       │   exploratory_weka_algorythm_analysis.exp
│   │       │
│   │       ├───Models
│   │       │       RandomForest_500_iterations.model
│   │       │       RandomForest_500_iterations_randomtiebreaks.model
│   │       │
│   │       └───ROC_curve
│   │               Non-promoting_ROC-curve.arff
│   │               Promoting_ROC-curve.arff
│   │
│   ├───EDA_files
│   │       Thema-09_EDA_log.pdf
│   │       Thema-09_EDA_log.rmd
│   │
│   └───EDA_report
│           EDA_report_Dennis_Haandrikman.pdf
│           EDA_report_Dennis_Haandrikman.rmd
│
└───Java_wrapper
    │   README.md
    │
    └───ML_Java_Wrapper
        │   build.gradle
        │   settings.gradle
        │
        ├───src
        │   └───main
        │       ├───java
        │       │   └───nl
        │       │       └───bioinf
        │       │           └───dhaandrikman
        │       │               └───Java_wrapper
        │       │                       WekaRunner.java
        │       │
        │       └───resources
        │               RandomForest_500_iterations.model
        │
        ├───test_data
        │       bcel_sars_log_transform_test_data.arff
        │
        └───Usable_Build
                MLModelWrapper_with_dep-1.0-all.jar

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Thema 09 Machine learning

Introduction

Data

Usage

Tree of the project

About

Releases

Packages

Languages

Name		Name	Last commit message	Last commit date
Latest commit History 43 Commits
EDA_analysis		EDA_analysis
Java_wrapper		Java_wrapper
.gitattributes		.gitattributes
.gitignore		.gitignore
LICENSE.md		LICENSE.md
README.md		README.md

License

Kyreria/Thema_09_Machine_learning

Folders and files

Latest commit

History

Repository files navigation

Thema 09 Machine learning

Introduction

Data

Usage

Tree of the project

About

Resources

License

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages