Singapore has an ageing population, and the incidence of stroke has been rising consistently. If we can predict a stroke before it happens, many fatal accidents can be prevented. However, traditional stroke prediction models often rely on medical data that is unattainable by a typical individual, so the prediction process can be very cumbersome. What if there is a way to utilise easily accessible data like marriage status, bmi, age, gender, work type... for stroke prediction? If successful, then perhaps we could look into exploring other indirect factors to build even simpler stroke prediction models that the general public can use even without medical knowledge and extensive personal medical data!
- healthcare-dataset-stroke-data.csv - Original CSV file
- stroke_cat.csv - Preprocessed data
Column | Description |
---|---|
id | ID |
gender | Male / Female |
age | Age |
hypertension | 1 / 0 |
heart_disease | 1 / 0 |
ever_married | Yes / No |
work_type | Private sector / Government Job / Self-employed / Never worked / Children |
Residence_type | Rural / Urban |
avg_glucose_level | Average glucose level |
bmi | (weight / height) ^ 2 |
smoking_status | Never smoked / Formerly smoked / Smokes / Unknown |
stroke (Label) | 1 / 0 |
-
Exploratory_Data_Analysis_&_Data_Preprocessing.ipynb
- EDA
- Fill missing BMI values with mean BMI
- One-hot encoding
-
Machine_Learning_Models.ipynb
- SMOTE
- 4 Machine learning models implemented
- Artificial Neural Network
- Logistic Regression
- Random Forest
- XGBoost
- Comparison of models with F1 score and Accurary
- Findings: We found that Random Forest has the highest prediction accuracy (94.09%) and F1 score (96.95%)
- os
- numpy
- pandas
- scipy
- seaborn
- matplotlib
- keras
- xgboost
- sklearn
- imblearn
- tensorflow
- Import Exploratory_Data_Analysis_&_Data_Preprocessing.ipynb to Google Colab and run the whole notebook
- At the end, proprocessed data will be saved as stroke_cat.csv
- Download stroke_cat.csv and upload it to Google Drive before running Machine_Learning_Models.ipynb
- Additional methods of data visualisation:
- 2 Dimensional KDE Plots - Effects of 2 different predictors on probability of response variable
- Stacked bar charts with offset-y: Comparison between the 2 categories of response variable on a highly imbalanced dataset
- Concepts to handle highly imbalanced data (SMOTE)
- Concepts and implementation of 4 new machine learning models with their respective libraries required
- Artificial Neural Network
- Logistic Regression
- Random Forest
- XGBoost
- Concepts on Model Comparison (F1 Score, recall, area under ROC curve). In the case of imbalanced datasets, F1 Score is a better indicator of predictive power than prediction accuracy.
-
Si Ming Zhou
- EDA
- Logistic Regression
- Slides and Presentation
-
Jeremy Lim
- Random Forest
- Slides and Presentation
-
Siah Wee Hung
- EDA
- ANN and XGBoost Models
Logistic Regression | Random Forest | ANN | XGBoost | |
---|---|---|---|---|
Accuracy | 0.93 | 0.94 | 0.81 | 0.93 |
F1 Score | 0.96 | 0.97 | 0.89 | 0.96 |