Skip to content

Machine Learning project for predicting stroke risk using healthcare data. Includes EDA, preprocessing, SMOTE, feature selection (RFE), evaluation of Logistic Regression, Decision Tree, Random Forest, KNN, SVM, and Stacked Ensemble models.

Notifications You must be signed in to change notification settings

hetuvpatel/brain-stroke-prediction

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

2 Commits
 
 
 
 
 
 

Repository files navigation

🧠 Stroke Prediction Using Data Mining

Python Machine Learning Colab License: MIT


📖 Project Overview

Stroke Prediction Using Data Mining is a machine learning project that aims to build a predictive model to classify whether an individual is likely to suffer a stroke based on their healthcare and demographic data.

This project involved:

  • Comprehensive data preprocessing and EDA
  • Implementation of 5 ML models (Logistic Regression, Decision Tree, Random Forest, K-Nearest Neighbors, SVM)
  • Usage of techniques like SMOTE (for class imbalance) and Recursive Feature Elimination (RFE)
  • Creation of an advanced Stacked Ensemble Model for improved predictive accuracy

📊 Dataset

  • Source: Kaggle - Stroke Prediction Dataset
  • Records: 5,110 instances
  • Features:
    • Categorical: Gender, Ever Married, Work Type, Residence Type, Smoking Status
    • Numerical: Age, Hypertension, Heart Disease, Avg Glucose Level, BMI

Note: BMI missing values were handled via mean imputation.


🛠️ Techniques Used

Technique Purpose
Label Encoding Convert categorical variables
SMOTE Handle class imbalance
Standard Scaler Normalize numerical features
Recursive Feature Elimination (RFE) Feature selection
Ensemble Learning Boost accuracy with multiple models

📈 Models Evaluated

Model Accuracy F1 Score ROC AUC
Logistic Regression 0.79 0.80 0.85
Decision Tree 0.91 0.91 0.91
Random Forest 0.96 0.96 0.99
K-Nearest Neighbors (KNN) 0.90 0.90 0.95
Support Vector Machines (SVM) 0.84 0.85 0.91

🚀 Best Model: Random Forest
🚀 Even Better: Stacked Ensemble achieved 96.66% accuracy!


📊 Additional Observations

  • Tuned SVM improved accuracy and ROC AUC through hyperparameter optimization.
  • Stacked Ensemble combined base models and gave the best generalization.


🔥 Key Takeaways

  • Data Preprocessing and Feature Engineering significantly affect performance.
  • SMOTE improved stroke class detection.
  • Ensemble models (Random Forest, Stacked) outperform individual models.
  • Feature Selection (RFE) simplifies models with minor performance trade-offs.

📚 References


📬 Contact

  • 👤 Hetu Virajkumar Patel | GitHub | LinkedIn
  • 👤 Nilay Thakorbhai Patel

🎯 Developed as part of CPS 844 - Data Mining, under Prof. Cherie Ding at Toronto Metropolitan University.


About

Machine Learning project for predicting stroke risk using healthcare data. Includes EDA, preprocessing, SMOTE, feature selection (RFE), evaluation of Logistic Regression, Decision Tree, Random Forest, KNN, SVM, and Stacked Ensemble models.

Topics

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Languages