Objective

SC1015 DSAI Project Stroke Prediction BCF1 Group 7

Objective

Singapore has an ageing population, and the incidence of stroke has been rising consistently. If we can predict a stroke before it happens, many fatal accidents can be prevented. However, traditional stroke prediction models often rely on medical data that is unattainable by a typical individual, so the prediction process can be very cumbersome. What if there is a way to utilise easily accessible data like marriage status, bmi, age, gender, work type... for stroke prediction? If successful, then perhaps we could look into exploring other indirect factors to build even simpler stroke prediction models that the general public can use even without medical knowledge and extensive personal medical data!

Youtube Link

DSAI Mini-Project Video

Dataset Folder

healthcare-dataset-stroke-data.csv - Original CSV file
stroke_cat.csv - Preprocessed data

Column	Description
id	ID
gender	Male / Female
age	Age
hypertension	1 / 0
heart_disease	1 / 0
ever_married	Yes / No
work_type	Private sector / Government Job / Self-employed / Never worked / Children
Residence_type	Rural / Urban
avg_glucose_level	Average glucose level
bmi	(weight / height) ^ 2
smoking_status	Never smoked / Formerly smoked / Smokes / Unknown
stroke (Label)	1 / 0

Notebooks Folder

Exploratory_Data_Analysis_&_Data_Preprocessing.ipynb
- EDA
- Fill missing BMI values with mean BMI
- One-hot encoding
Machine_Learning_Models.ipynb
- SMOTE
- 4 Machine learning models implemented
  1. Artificial Neural Network
  2. Logistic Regression
  3. Random Forest
  4. XGBoost
- Comparison of models with F1 score and Accurary
- Findings: We found that Random Forest has the highest prediction accuracy (94.09%) and F1 score (96.95%)

Libraries

os
numpy
pandas
scipy
seaborn
matplotlib
keras
xgboost
sklearn
imblearn
tensorflow

Directions

Import Exploratory_Data_Analysis_&_Data_Preprocessing.ipynb to Google Colab and run the whole notebook
At the end, proprocessed data will be saved as stroke_cat.csv
Download stroke_cat.csv and upload it to Google Drive before running Machine_Learning_Models.ipynb

What we learnt in this project

Additional methods of data visualisation:
- 2 Dimensional KDE Plots - Effects of 2 different predictors on probability of response variable
- Stacked bar charts with offset-y: Comparison between the 2 categories of response variable on a highly imbalanced dataset
Concepts to handle highly imbalanced data (SMOTE)
Concepts and implementation of 4 new machine learning models with their respective libraries required
- Artificial Neural Network
- Logistic Regression
- Random Forest
- XGBoost
Concepts on Model Comparison (F1 Score, recall, area under ROC curve). In the case of imbalanced datasets, F1 Score is a better indicator of predictive power than prediction accuracy.

Indivual Contributions

Si Ming Zhou
- EDA
- Logistic Regression
- Slides and Presentation
Jeremy Lim
- Random Forest
- Slides and Presentation
Siah Wee Hung
- EDA
- ANN and XGBoost Models

Conclusion

	Logistic Regression	Random Forest	ANN	XGBoost
Accuracy	0.93	0.94	0.81	0.93
F1 Score	0.96	0.97	0.89	0.96

Name		Name	Last commit message	Last commit date
Latest commit History 26 Commits
Dataset		Dataset
Notebooks		Notebooks
DSAI Slides.pdf		DSAI Slides.pdf
README.md		README.md

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

SC1015 DSAI Project Stroke Prediction BCF1 Group 7

Objective

Youtube Link

Dataset Folder

Notebooks Folder

Libraries

Directions

What we learnt in this project

Indivual Contributions

Conclusion

References

About

Releases

Packages

Languages

HatetoRush/SC1015_DSAI_Project_Stroke_Prediction

Folders and files

Latest commit

History

Repository files navigation

SC1015 DSAI Project Stroke Prediction BCF1 Group 7

Objective

Youtube Link

Dataset Folder

Notebooks Folder

Libraries

Directions

What we learnt in this project

Indivual Contributions

Conclusion

References

About

Resources

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages