A comprehensive machine learning project using logistic regression to predict heart disease based on clinical and demographic features
- π― Project Overview
- π¬ Dataset Information
- π Features
- π οΈ Technologies Used
- π Model Performance
- ποΈ Project Structure
- βοΈ Installation & Setup
- π Usage
- π§ͺ Model Training Process
- π Results & Analysis
- π€ Contributing
- π License
- π¨βπ» Author
Heart disease is one of the leading causes of death worldwide, making early detection and prediction crucial for preventive healthcare. This project implements a machine learning solution using Logistic Regression to predict the likelihood of heart disease in patients based on various clinical and demographic parameters.
- Predictive Healthcare: Early detection system for heart disease risk assessment
- High Accuracy: Achieves 85%+ accuracy on test data
- Real-world Application: Based on authentic medical dataset with 303 patient records
- Interpretable Model: Logistic regression provides clear feature importance and probability scores
- Complete Pipeline: From data preprocessing to model deployment
- Develop a reliable machine learning model for heart disease prediction
- Analyze the relationship between various medical factors and heart disease
- Create an interpretable model that medical professionals can understand
- Provide probability scores for risk assessment rather than just binary classification
- Demonstrate the complete ML workflow from data exploration to model evaluation
Attribute | Value |
---|---|
Total Records | 303 patients |
Features | 13 clinical/demographic variables |
Target Classes | 2 (Heart Disease: Yes/No) |
Missing Values | None (Complete dataset) |
File Format | CSV |
Data Quality | High-quality medical data |
- Heart Disease Present (1): 165 patients (54.5%)
- Heart Disease Absent (0): 138 patients (45.5%)
- Balance: Well-balanced dataset suitable for binary classification
Feature | Type | Description | Clinical Significance |
---|---|---|---|
age | Numerical | Patient age (29-77 years) | Risk increases with age |
sex | Binary | Gender (1=Male, 0=Female) | Different risk patterns by gender |
cp | Categorical | Chest pain type (0-3) | Key symptom indicator |
trestbps | Numerical | Resting blood pressure (mm Hg) | Major cardiovascular risk factor |
chol | Numerical | Serum cholesterol (mg/dl) | High levels increase risk |
fbs | Binary | Fasting blood sugar >120 mg/dl | Diabetes indicator |
restecg | Categorical | Resting ECG results (0-2) | Heart electrical activity |
thalach | Numerical | Maximum heart rate achieved | Exercise capacity indicator |
exang | Binary | Exercise induced angina | Stress test result |
oldpeak | Numerical | ST depression induced by exercise | ECG stress test measure |
slope | Categorical | Slope of peak exercise ST segment | Exercise test parameter |
ca | Numerical | Number of major vessels colored (0-3) | Coronary angiography result |
thal | Categorical | Thalassemia type (1-3) | Blood disorder indicator |
- π Data Exploration: Comprehensive analysis of heart disease dataset
- π§Ή Data Preprocessing: Handling of categorical variables and feature scaling
- π€ Model Training: Logistic regression implementation with scikit-learn
- π Model Evaluation: Multiple metrics including accuracy, precision, recall
- π― Prediction System: Real-time heart disease risk assessment
- π Visualization: Data distribution and model performance plots
- Binary Classification: Predicts presence/absence of heart disease
- Probability Estimation: Provides confidence scores for predictions
- Feature Importance: Identifies most significant risk factors
- Cross-validation: Robust model validation techniques
- Interpretability: Clear understanding of model decisions
Technology | Purpose | Version |
---|---|---|
Python | Primary programming language | 3.7+ |
Pandas | Data manipulation and analysis | Latest |
NumPy | Numerical computing | Latest |
Scikit-learn | Machine learning library | Latest |
Matplotlib | Data visualization | Latest |
Seaborn | Statistical visualization | Latest |
Jupyter Notebook | Interactive development | Latest |
numpy>=1.19.0
pandas>=1.2.0
scikit-learn>=0.24.0
matplotlib>=3.3.0
seaborn>=0.11.0
jupyter>=1.0.0
Metric | Training Data | Testing Data |
---|---|---|
Accuracy | ~85-90% | ~85%+ |
Precision | High | Consistent |
Recall | High | Consistent |
F1-Score | Balanced | Stable |
- β High Accuracy: Consistent 85%+ accuracy across train/test splits
- βοΈ Balanced Performance: Good precision-recall balance
- π― Generalization: Minimal overfitting, stable test performance
- π Reproducibility: Consistent results with random state control
Logistic-Heart-Disease-Prediction/
βββ π heart_disease_data.csv # Dataset file
βββ π Logistic-Heart-Disease-Prediction.ipynb # Main notebook
βββ π README.md # This file
βββ π LICENSE # License information
βββ π docs/ # Documentation
βββ π dataset.md # Dataset documentation
βββ π€ logistic-regression-model.md # Model documentation
heart_disease_data.csv
: Complete dataset with 303 patient recordsLogistic-Heart-Disease-Prediction.ipynb
: Main Jupyter notebook with complete analysisdocs/dataset.md
: Detailed dataset documentation and feature descriptionsdocs/logistic-regression-model.md
: Comprehensive model documentationREADME.md
: Project overview and instructionsLICENSE
: License terms and conditions
- Python 3.7+ installed on your system
- Git for version control
- Jupyter Notebook or JupyterLab
- Code Editor (VS Code, PyCharm, etc.)
-
Clone the Repository
git clone https://github.com/NhanPhamThanh-IT/Logistic-Heart-Disease-Prediction.git cd Logistic-Heart-Disease-Prediction
-
Create Virtual Environment (Recommended)
# Windows python -m venv venv venv\Scripts\activate # macOS/Linux python3 -m venv venv source venv/bin/activate
-
Install Dependencies
pip install numpy pandas scikit-learn matplotlib seaborn jupyter
-
Launch Jupyter Notebook
jupyter notebook
-
Open the Main Notebook
- Navigate to
Logistic-Heart-Disease-Prediction.ipynb
- Run all cells to see the complete analysis
- Navigate to
-
Open the Jupyter Notebook
jupyter notebook Logistic-Heart-Disease-Prediction.ipynb
-
Run the Complete Analysis
- Execute all cells in sequence
- Observe data exploration, model training, and evaluation
-
Make Predictions
# Example prediction for a new patient input_data = (62, 0, 0, 140, 268, 0, 0, 160, 0, 3.6, 0, 2, 2) prediction = model.predict([input_data]) if prediction[0] == 0: print('No Heart Disease Detected') else: print('Heart Disease Risk Detected')
To predict heart disease for a new patient, provide the following 13 features in order:
# Feature order: age, sex, cp, trestbps, chol, fbs, restecg,
# thalach, exang, oldpeak, slope, ca, thal
# Example: 62-year-old female with specific medical parameters
patient_data = (62, 0, 0, 140, 268, 0, 0, 160, 0, 3.6, 0, 2, 2)
risk_prediction = model.predict([patient_data])
confidence_score = model.predict_proba([patient_data])
-
π₯ Data Loading
- Load heart disease dataset from CSV
- Initial data inspection and validation
-
π Exploratory Data Analysis
- Statistical summary of features
- Target variable distribution analysis
- Missing value assessment
-
βοΈ Data Preprocessing
- Feature-target separation
- Train-test split (80-20 ratio)
- Stratified sampling for balanced splits
-
π€ Model Training
- Logistic Regression initialization
- Model fitting on training data
- Parameter optimization
-
π Model Evaluation
- Training accuracy assessment
- Testing accuracy evaluation
- Performance metric calculation
-
π― Prediction System
- Individual patient prediction
- Probability score generation
- Risk assessment interpretation
# Model Parameters
model = LogisticRegression(
random_state=42, # Reproducibility
max_iter=1000, # Convergence assurance
solver='liblinear' # Optimal for small datasets
)
# Train-Test Split Configuration
X_train, X_test, Y_train, Y_test = train_test_split(
X, Y,
test_size=0.2, # 20% for testing
stratify=Y, # Maintain class balance
random_state=2 # Reproducible splits
)
- β Training Accuracy: ~85-90% (Indicates good learning)
- β Testing Accuracy: ~85%+ (Demonstrates generalization)
- βοΈ Balanced Performance: No significant overfitting
- π― Consistent Results: Stable across different random states
- π« Age Factor: Strong correlation with heart disease risk
- π₯ Gender Differences: Distinct risk patterns between males and females
- π Chest Pain: Critical symptom for disease prediction
- π©Ί Clinical Markers: Blood pressure and cholesterol are significant predictors
- β‘ Exercise Metrics: Heart rate and exercise capacity provide valuable insights
The logistic regression model identifies the most significant factors:
- Chest pain type (cp)
- Maximum heart rate achieved (thalach)
- Exercise induced angina (exang)
- ST depression (oldpeak)
- Number of major vessels (ca)
We welcome contributions to improve this heart disease prediction project!
-
π΄ Fork the Repository
git fork https://github.com/NhanPhamThanh-IT/Logistic-Heart-Disease-Prediction.git
-
πΏ Create Feature Branch
git checkout -b feature/amazing-feature
-
πΎ Commit Changes
git commit -m "Add amazing feature"
-
π€ Push to Branch
git push origin feature/amazing-feature
-
π Open Pull Request
- π¬ Model Improvements: Advanced algorithms, hyperparameter tuning
- π Visualization: Enhanced plots and charts
- π Documentation: Improve explanations and examples
- π§ͺ Testing: Add unit tests and validation
- π Web Interface: Create user-friendly web application
- π± Mobile App: Develop mobile prediction interface
This project is licensed under the MIT License - see the LICENSE file for details.
- β Commercial Use: Allowed
- β Modification: Allowed
- β Distribution: Allowed
- β Private Use: Allowed
- β Liability: Limited
- β Warranty: None
Nhan Pham Thanh
- π GitHub: @NhanPhamThanh-IT
- π§ Email: Contact
- πΌ LinkedIn: Connect
- π Portfolio: Visit