Skip to content

HatetoRush/SC1015_DSAI_Project_Stroke_Prediction

 
 

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

26 Commits
 
 
 
 
 
 
 
 

Repository files navigation

SC1015 DSAI Project Stroke Prediction BCF1 Group 7

image

Objective

Singapore has an ageing population, and the incidence of stroke has been rising consistently. If we can predict a stroke before it happens, many fatal accidents can be prevented. However, traditional stroke prediction models often rely on medical data that is unattainable by a typical individual, so the prediction process can be very cumbersome. What if there is a way to utilise easily accessible data like marriage status, bmi, age, gender, work type... for stroke prediction? If successful, then perhaps we could look into exploring other indirect factors to build even simpler stroke prediction models that the general public can use even without medical knowledge and extensive personal medical data!

Youtube Link

DSAI Mini-Project Video

Dataset Folder

  • healthcare-dataset-stroke-data.csv - Original CSV file
  • stroke_cat.csv - Preprocessed data
Column Description
id ID
gender Male / Female
age Age
hypertension 1 / 0
heart_disease 1 / 0
ever_married Yes / No
work_type Private sector / Government Job / Self-employed / Never worked / Children
Residence_type Rural / Urban
avg_glucose_level Average glucose level
bmi (weight / height) ^ 2
smoking_status Never smoked / Formerly smoked / Smokes / Unknown
stroke (Label) 1 / 0

Notebooks Folder

  • Exploratory_Data_Analysis_&_Data_Preprocessing.ipynb

    • EDA
    • Fill missing BMI values with mean BMI
    • One-hot encoding
  • Machine_Learning_Models.ipynb

    • SMOTE
    • 4 Machine learning models implemented
      1. Artificial Neural Network
      2. Logistic Regression
      3. Random Forest
      4. XGBoost
    • Comparison of models with F1 score and Accurary
    • Findings: We found that Random Forest has the highest prediction accuracy (94.09%) and F1 score (96.95%)

Libraries

  • os
  • numpy
  • pandas
  • scipy
  • seaborn
  • matplotlib
  • keras
  • xgboost
  • sklearn
  • imblearn
  • tensorflow

Directions

  1. Import Exploratory_Data_Analysis_&_Data_Preprocessing.ipynb to Google Colab and run the whole notebook
  2. At the end, proprocessed data will be saved as stroke_cat.csv
  3. Download stroke_cat.csv and upload it to Google Drive before running Machine_Learning_Models.ipynb

image

image

What we learnt in this project

  • Additional methods of data visualisation:
    • 2 Dimensional KDE Plots - Effects of 2 different predictors on probability of response variable
    • Stacked bar charts with offset-y: Comparison between the 2 categories of response variable on a highly imbalanced dataset
  • Concepts to handle highly imbalanced data (SMOTE)
  • Concepts and implementation of 4 new machine learning models with their respective libraries required
    • Artificial Neural Network
    • Logistic Regression
    • Random Forest
    • XGBoost
  • Concepts on Model Comparison (F1 Score, recall, area under ROC curve). In the case of imbalanced datasets, F1 Score is a better indicator of predictive power than prediction accuracy.

Indivual Contributions

  • Si Ming Zhou

    • EDA
    • Logistic Regression
    • Slides and Presentation
  • Jeremy Lim

    • Random Forest
    • Slides and Presentation
  • Siah Wee Hung

    • EDA
    • ANN and XGBoost Models

Conclusion

Logistic Regression Random Forest ANN XGBoost
Accuracy 0.93 0.94 0.81 0.93
F1 Score 0.96 0.97 0.89 0.96

References

About

No description, website, or topics provided.

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Languages

  • Jupyter Notebook 100.0%