π§ Stroke Risk Prediction System
A machine learningβpowered system that predicts an individualβs likelihood of stroke occurrence based on key clinical and lifestyle indicators. This project applies ensemble algorithms to healthcare data to identify high-risk patients and support early medical intervention.
π Overview
Early prediction of stroke risk can significantly reduce mortality and improve preventive care. This project leverages patient health metrics such as age, BMI, glucose levels, and smoking habits to develop a data-driven stroke risk classifier.
Using an ensemble comparison framework, multiple algorithms were trained, evaluated, and benchmarked to identify the best-performing model.
Result: The Random Forest classifier achieved a ROC-AUC score of 0.77, outperforming other ensemble models (XGBoost, LightGBM, CatBoost, Gradient Boosting).
ποΈ Project Structure stroke-risk-ensemble-comparison/ β βββ data/ β βββ raw/ # Original dataset β βββ processed/ # Cleaned and transformed data β βββ src/ β βββ preprocessing.py # Handles cleaning, encoding, and SMOTE balancing β βββ train.py # Trains and compares all ensemble models β βββ deployment/ β βββ deploy.py # Streamlit web interface for live predictions β βββ requirements.txt β βββ models/ β βββ best_model.pkl # Saved Random Forest model β βββ run_all.py # Automated end-to-end pipeline βββ requirements.txt # Dependency list
βοΈ Getting Started 1οΈβ£ Clone the Repository git clone https://github.com/PerceptronCipher/stroke-risk-ensemble-comparison.git cd stroke-risk-ensemble-comparison
2οΈβ£ Install Dependencies pip install -r requirements.txt
3οΈβ£ Add the Dataset data/raw/
4οΈβ£ Run the Complete Pipeline python run_all.py This will clean the dataset, balance it using SMOTE, train all models, and store the best performer.
5οΈβ£ Launch the Web App streamlit run deployment/deploy.py
π Methodology
- Data Preparation
Handled missing BMI values and outliers
Encoded categorical variables
Applied SMOTE to balance minority (stroke) cases
- Model Training & Evaluation
Compared five ensemble models:
Random Forest β Best ROC-AUC = 0.77
XGBoost
LightGBM
Gradient Boosting
CatBoost
Each model was evaluated on ROC-AUC, precision, recall, and f1-score to determine robustness and clinical reliability.
- Deployment
Built a Streamlit web application that enables:
Input of patient parameters (age, glucose, BMI, etc.)
Instant stroke risk prediction with probability output
π§© Key Insights
Addressing class imbalance is crucial for medical data (SMOTE improved recall).
Random Forest remains a strong baseline for tabular healthcare data.
In risk prediction, recall is often more important than overall accuracy β itβs better to flag a potential risk than to miss one.
Feature scaling and careful preprocessing significantly impact medical model validity.
π Future Improvements
Hyperparameter optimization using Optuna
Feature engineering with domain-specific health variables
Ensemble stacking for performance boosting
Integration with cloud-based MLOps for scalable deployment
π§° Tech Stack | Category | Tools / Frameworks
| Data Handling | pandas, numpy
| ML Algorithms | scikit-learn, XGBoost, LightGBM, CatBoost
| Imbalance Handling | imbalanced-learn (SMOTE)
| Visualization / UI | Streamlit
| Environment | Python 3.10+
This project is intended for educational and research purposes only. It should not be used for clinical decision-making or as a substitute for professional medical evaluation.
π License
MIT License
π Dataset Source
Kaggle β Healthcare Stroke Dataset
π‘ Author
Boluwatife AI Engineer | Machine Learning Researcher | Data Scientist π« adeyemiboluwatife.olayinka@gmail.com
π GitHub Profile