Stroke Prediction Using Data Mining is a machine learning project that aims to build a predictive model to classify whether an individual is likely to suffer a stroke based on their healthcare and demographic data.
This project involved:
- Comprehensive data preprocessing and EDA
- Implementation of 5 ML models (Logistic Regression, Decision Tree, Random Forest, K-Nearest Neighbors, SVM)
- Usage of techniques like SMOTE (for class imbalance) and Recursive Feature Elimination (RFE)
- Creation of an advanced Stacked Ensemble Model for improved predictive accuracy
- Source: Kaggle - Stroke Prediction Dataset
- Records: 5,110 instances
- Features:
- Categorical: Gender, Ever Married, Work Type, Residence Type, Smoking Status
- Numerical: Age, Hypertension, Heart Disease, Avg Glucose Level, BMI
Note: BMI missing values were handled via mean imputation.
Technique | Purpose |
---|---|
Label Encoding | Convert categorical variables |
SMOTE | Handle class imbalance |
Standard Scaler | Normalize numerical features |
Recursive Feature Elimination (RFE) | Feature selection |
Ensemble Learning | Boost accuracy with multiple models |
Model | Accuracy | F1 Score | ROC AUC |
---|---|---|---|
Logistic Regression | 0.79 | 0.80 | 0.85 |
Decision Tree | 0.91 | 0.91 | 0.91 |
Random Forest | 0.96 | 0.96 | 0.99 |
K-Nearest Neighbors (KNN) | 0.90 | 0.90 | 0.95 |
Support Vector Machines (SVM) | 0.84 | 0.85 | 0.91 |
🚀 Best Model: Random Forest
🚀 Even Better: Stacked Ensemble achieved 96.66% accuracy!
- Tuned SVM improved accuracy and ROC AUC through hyperparameter optimization.
- Stacked Ensemble combined base models and gave the best generalization.
- Data Preprocessing and Feature Engineering significantly affect performance.
- SMOTE improved stroke class detection.
- Ensemble models (Random Forest, Stacked) outperform individual models.
- Feature Selection (RFE) simplifies models with minor performance trade-offs.
- 📂 Dataset: Stroke Prediction Dataset - Kaggle
- 📑 Project Notebook: Google Colab Notebook
🎯 Developed as part of CPS 844 - Data Mining, under Prof. Cherie Ding at Toronto Metropolitan University.