Skip to content

Bio-Informatics Year 3, Period 9: Project Machine Learning [Part 1: Analysis] (BFVH3TH9) (2022-2023)

License

Notifications You must be signed in to change notification settings

Vincent-Talen/Project-Machine-Learning-Part1_Analysis

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

33 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

Introduction to Machine Learning (Part 1: Analysis)

Hanzehogeschool Groningen: Bioinformatics Project Year 3, Period 9

Performing an Exploratory Data Analysis on a chosen data set and creating a Machine Learning Algorithm that is able to predict class labels for instances.

About the project

The goal of this project is to create and learn about Machine Learning Algorithms. Multiple topics and competences are involved in this process, including an initial Exploratory Data Analysis on the data set, keeping a research log to make sure everything is reproducible and the creation of a Java wrapper around the final machine learning algorithm.
At the end a scientific report is written that discusses the data exploration, the strategy and methodology of benchmarking, performance and optimization in relation to machine learning algorithms. Another important aspect of the project is the ability to critically look at work that has been done, and thus giving and receiving peer feedback.

Chosen Data Set and Research

The Breast Cancer Wisconsin (Diagnostic) data set has been chosen to do this project about. It was created to easily and accurately diagnose breast masses with just a fine needle aspiration (FNA), ten visually assessed characteristics were identified that were relevant to diagnosis.
Data was gathered by scanning a sample from the FNA after staining so the cell nuclei were highlighted, the boundaries of all individual nuclei were marked and the nuclei thus isolated from each other. A program then computed values for each of ten characteristics of each nuclei, measuring size, shape and texture. After that the mean, standard error and worst extreme values of these features were computed, resulting in a total of 30 nuclear features for each sample.
The full data set consists of 569 cases and was used to train an algorithm that can classify and differentiate between benign and malignant FNA samples of breast masses.

Repository File Structure

Project Tree

Project-Machine-Learning-Part1_Analysis
├── research_log.pdf
├── report.pdf
├── LICENSE
├── README.md
├── data
│   ├── raw
│   │   ├── codebook.txt
│   │   ├── wdbc.data
│   │   └── wdbc.names
│   └── processed
│       └── data.arff
├── output
│   ├── algorithm_performances
│   │   └── *
│   ├── figures
│   │   └── *
│   ├── models
│   │   └── *
│   └── roc_data
│       └── *
└── src
    ├── rmd
    │   ├── research_log.Rmd
    │   └── report.Rmd
    ├── scripts
    │   ├── weka_analyser_custom.R
    │   └── split_violin_plot.R
    └── report_subfiles
        ├── abbreviations.tex
        ├── abstract.tex
        ├── after_body.Rmd
        ├── before_body.tex
        ├── import.tex
        └── references.bib

research_log.pdf

This research log has all the steps taken during the entire process, from the initial data exploration up to the creation of the final machine learning model. The project is easily reproducible because of this, so it's a great opportunity to learn the steps taken for creating a machine learning algorithm.

report.pdf

To properly report on all findings of the project, this report has been written. The report gets into all the results and discusses any complications encountered or what could be improved when reproducing this project. A project proposal is also included for a possible project for the minor Application Design from the Bioinformatics Bachelor at the Hanze University of Applied Sciences Groningen.

data/

The publicly available data set was downloaded and placed in the raw data subdirectory for easy access, these are the wdbc.data and wdbc.names files. Because the data file does not have a header line, a codebook was created to use in tandem with the data, enabling to easily set column names, graph titles or axis labels.
The preprocessed subdirectory contains the data that is generated during this project, such as data.arff, which is the dataset to be used for the machine learning algorithms in Weka.

output/

In this directory lie four subdirectories:

  • algorithm_performances: contains the performance data of runs of different algorithms with different settings
  • figures: holds all the generated figures from the research log
  • models: exported models from Weka
  • roc_data: roc curve visualization data files

src/

The src/ directory houses three subdirectories where files are placed so the (root of the) repository is not as cluttered with loose files.

  • report_subfiles/: To keep the report text file itself as clean as possible, everything other than the actual article text is split up in separate files and these are located in the subdirectory report_subfiles. This way all the formatting and code does not clutter the text and makes it easier to find everything.
  • rmd/: The research_log and report RMarkdown files are placed here as to keep the root directory cleaner, since these are not looked at as often by people.
  • scripts/: The same principle is used for all .R functions and scripts used in this project, these will be put in the scripts directory.

Installation

This project was written on MacOS in RStudio (version 2022.02.0) with R version 4.1.3 for Apple silicon arm64.

First, a working R environment is needed, which can be installed from the CRAN website by carefully following the instructions there.
Second, either RStudio, or another editor of choice should be installed. It should be noted that to be able to knit the documents LaTeX is needed.
Finally, the required packages listed below should be installed, after which the project should be reproducible.

Make sure that before running or knitting any files the working directory is that of this repository!

Required Packages

The following R packages are required for this project and should be installed through an R console using the install.packages() function:

  • data.table
  • ggpubr
  • kableExtra
  • sass
  • tidyverse
  • pander
  • ggplot2
  • ggbiplot (might need to download from GitHub using 'remotes' package)

To easily install any missing packages the code below can be used instead, which should be pasted and run an R console:

required_packages <- c("tidyverse", "pander", "ggplot2", "data.table", "ggpubr", "remotes", "kableExtra", "sass")
missing_packages <- required_packages[!(required_packages %in% installed.packages()[,"Package"])]
if(length(missing_packages)) install.packages(missing_packages)
remotes::install_github("vqv/ggbiplot")

Useful links

Contact

Vincent Talen
v.k.talen@st.hanze.nl

About

Bio-Informatics Year 3, Period 9: Project Machine Learning [Part 1: Analysis] (BFVH3TH9) (2022-2023)

Resources

License

Stars

Watchers

Forks