Skip to content
This repository was archived by the owner on Jul 25, 2025. It is now read-only.

πŸ«€ A machine learning project using logistic regression to predict heart disease risk from clinical data. Built with Python, scikit-learn, and Jupyter notebooks. Achieves 85%+ accuracy on 303-patient dataset with 13 medical features. Complete ML pipeline from data exploration to model evaluation.

License

Notifications You must be signed in to change notification settings

NhanPhamThanh-IT/Logistic-Regression-Heart-Disease-Prediction

Folders and files

NameName
Last commit message
Last commit date

Latest commit

Β 

History

6 Commits
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 

Repository files navigation

πŸ«€ Logistic Heart Disease Prediction

A comprehensive machine learning project using logistic regression to predict heart disease based on clinical and demographic features

Heart Disease Machine Learning Python Accuracy

πŸ“‹ Table of Contents

🎯 Project Overview

Heart disease is one of the leading causes of death worldwide, making early detection and prediction crucial for preventive healthcare. This project implements a machine learning solution using Logistic Regression to predict the likelihood of heart disease in patients based on various clinical and demographic parameters.

🌟 Key Highlights

  • Predictive Healthcare: Early detection system for heart disease risk assessment
  • High Accuracy: Achieves 85%+ accuracy on test data
  • Real-world Application: Based on authentic medical dataset with 303 patient records
  • Interpretable Model: Logistic regression provides clear feature importance and probability scores
  • Complete Pipeline: From data preprocessing to model deployment

🎯 Objectives

  1. Develop a reliable machine learning model for heart disease prediction
  2. Analyze the relationship between various medical factors and heart disease
  3. Create an interpretable model that medical professionals can understand
  4. Provide probability scores for risk assessment rather than just binary classification
  5. Demonstrate the complete ML workflow from data exploration to model evaluation

πŸ”¬ Dataset Information

πŸ“Š Dataset Overview

Attribute Value
Total Records 303 patients
Features 13 clinical/demographic variables
Target Classes 2 (Heart Disease: Yes/No)
Missing Values None (Complete dataset)
File Format CSV
Data Quality High-quality medical data

πŸ₯ Target Distribution

  • Heart Disease Present (1): 165 patients (54.5%)
  • Heart Disease Absent (0): 138 patients (45.5%)
  • Balance: Well-balanced dataset suitable for binary classification

πŸ“‹ Feature Descriptions

Feature Type Description Clinical Significance
age Numerical Patient age (29-77 years) Risk increases with age
sex Binary Gender (1=Male, 0=Female) Different risk patterns by gender
cp Categorical Chest pain type (0-3) Key symptom indicator
trestbps Numerical Resting blood pressure (mm Hg) Major cardiovascular risk factor
chol Numerical Serum cholesterol (mg/dl) High levels increase risk
fbs Binary Fasting blood sugar >120 mg/dl Diabetes indicator
restecg Categorical Resting ECG results (0-2) Heart electrical activity
thalach Numerical Maximum heart rate achieved Exercise capacity indicator
exang Binary Exercise induced angina Stress test result
oldpeak Numerical ST depression induced by exercise ECG stress test measure
slope Categorical Slope of peak exercise ST segment Exercise test parameter
ca Numerical Number of major vessels colored (0-3) Coronary angiography result
thal Categorical Thalassemia type (1-3) Blood disorder indicator

πŸš€ Features

✨ Core Functionality

  • πŸ” Data Exploration: Comprehensive analysis of heart disease dataset
  • 🧹 Data Preprocessing: Handling of categorical variables and feature scaling
  • πŸ€– Model Training: Logistic regression implementation with scikit-learn
  • πŸ“Š Model Evaluation: Multiple metrics including accuracy, precision, recall
  • 🎯 Prediction System: Real-time heart disease risk assessment
  • πŸ“ˆ Visualization: Data distribution and model performance plots

πŸ›‘οΈ Model Capabilities

  • Binary Classification: Predicts presence/absence of heart disease
  • Probability Estimation: Provides confidence scores for predictions
  • Feature Importance: Identifies most significant risk factors
  • Cross-validation: Robust model validation techniques
  • Interpretability: Clear understanding of model decisions

πŸ› οΈ Technologies Used

🐍 Core Technologies

Technology Purpose Version
Python Primary programming language 3.7+
Pandas Data manipulation and analysis Latest
NumPy Numerical computing Latest
Scikit-learn Machine learning library Latest
Matplotlib Data visualization Latest
Seaborn Statistical visualization Latest
Jupyter Notebook Interactive development Latest

πŸ“¦ Dependencies

numpy>=1.19.0
pandas>=1.2.0
scikit-learn>=0.24.0
matplotlib>=3.3.0
seaborn>=0.11.0
jupyter>=1.0.0

πŸ“Š Model Performance

🎯 Performance Metrics

Metric Training Data Testing Data
Accuracy ~85-90% ~85%+
Precision High Consistent
Recall High Consistent
F1-Score Balanced Stable

πŸ“ˆ Key Performance Indicators

  • βœ… High Accuracy: Consistent 85%+ accuracy across train/test splits
  • βš–οΈ Balanced Performance: Good precision-recall balance
  • 🎯 Generalization: Minimal overfitting, stable test performance
  • πŸ”„ Reproducibility: Consistent results with random state control

πŸ—οΈ Project Structure

Logistic-Heart-Disease-Prediction/
β”œβ”€β”€ πŸ“Š heart_disease_data.csv           # Dataset file
β”œβ”€β”€ πŸ““ Logistic-Heart-Disease-Prediction.ipynb  # Main notebook
β”œβ”€β”€ πŸ“– README.md                        # This file
β”œβ”€β”€ πŸ“„ LICENSE                          # License information
└── πŸ“ docs/                           # Documentation
    β”œβ”€β”€ πŸ“‹ dataset.md                   # Dataset documentation
    └── πŸ€– logistic-regression-model.md # Model documentation

πŸ“ File Descriptions

  • heart_disease_data.csv: Complete dataset with 303 patient records
  • Logistic-Heart-Disease-Prediction.ipynb: Main Jupyter notebook with complete analysis
  • docs/dataset.md: Detailed dataset documentation and feature descriptions
  • docs/logistic-regression-model.md: Comprehensive model documentation
  • README.md: Project overview and instructions
  • LICENSE: License terms and conditions

βš™οΈ Installation & Setup

πŸ”§ Prerequisites

  • Python 3.7+ installed on your system
  • Git for version control
  • Jupyter Notebook or JupyterLab
  • Code Editor (VS Code, PyCharm, etc.)

πŸ“₯ Quick Setup

  1. Clone the Repository

    git clone https://github.com/NhanPhamThanh-IT/Logistic-Heart-Disease-Prediction.git
    cd Logistic-Heart-Disease-Prediction
  2. Create Virtual Environment (Recommended)

    # Windows
    python -m venv venv
    venv\Scripts\activate
    
    # macOS/Linux
    python3 -m venv venv
    source venv/bin/activate
  3. Install Dependencies

    pip install numpy pandas scikit-learn matplotlib seaborn jupyter
  4. Launch Jupyter Notebook

    jupyter notebook
  5. Open the Main Notebook

    • Navigate to Logistic-Heart-Disease-Prediction.ipynb
    • Run all cells to see the complete analysis

πŸ“– Usage

πŸš€ Quick Start

  1. Open the Jupyter Notebook

    jupyter notebook Logistic-Heart-Disease-Prediction.ipynb
  2. Run the Complete Analysis

    • Execute all cells in sequence
    • Observe data exploration, model training, and evaluation
  3. Make Predictions

    # Example prediction for a new patient
    input_data = (62, 0, 0, 140, 268, 0, 0, 160, 0, 3.6, 0, 2, 2)
    prediction = model.predict([input_data])
    
    if prediction[0] == 0:
        print('No Heart Disease Detected')
    else:
        print('Heart Disease Risk Detected')

🎯 Making Custom Predictions

To predict heart disease for a new patient, provide the following 13 features in order:

# Feature order: age, sex, cp, trestbps, chol, fbs, restecg,
#                thalach, exang, oldpeak, slope, ca, thal

# Example: 62-year-old female with specific medical parameters
patient_data = (62, 0, 0, 140, 268, 0, 0, 160, 0, 3.6, 0, 2, 2)
risk_prediction = model.predict([patient_data])
confidence_score = model.predict_proba([patient_data])

πŸ§ͺ Model Training Process

πŸ“Š Step-by-Step Workflow

  1. πŸ“₯ Data Loading

    • Load heart disease dataset from CSV
    • Initial data inspection and validation
  2. πŸ” Exploratory Data Analysis

    • Statistical summary of features
    • Target variable distribution analysis
    • Missing value assessment
  3. βš™οΈ Data Preprocessing

    • Feature-target separation
    • Train-test split (80-20 ratio)
    • Stratified sampling for balanced splits
  4. πŸ€– Model Training

    • Logistic Regression initialization
    • Model fitting on training data
    • Parameter optimization
  5. πŸ“ˆ Model Evaluation

    • Training accuracy assessment
    • Testing accuracy evaluation
    • Performance metric calculation
  6. 🎯 Prediction System

    • Individual patient prediction
    • Probability score generation
    • Risk assessment interpretation

πŸŽ›οΈ Model Configuration

# Model Parameters
model = LogisticRegression(
    random_state=42,        # Reproducibility
    max_iter=1000,         # Convergence assurance
    solver='liblinear'     # Optimal for small datasets
)

# Train-Test Split Configuration
X_train, X_test, Y_train, Y_test = train_test_split(
    X, Y,
    test_size=0.2,         # 20% for testing
    stratify=Y,            # Maintain class balance
    random_state=2         # Reproducible splits
)

πŸ“ˆ Results & Analysis

🎯 Model Performance Summary

  • βœ… Training Accuracy: ~85-90% (Indicates good learning)
  • βœ… Testing Accuracy: ~85%+ (Demonstrates generalization)
  • βš–οΈ Balanced Performance: No significant overfitting
  • 🎯 Consistent Results: Stable across different random states

πŸ“Š Key Insights

  1. πŸ«€ Age Factor: Strong correlation with heart disease risk
  2. πŸ‘₯ Gender Differences: Distinct risk patterns between males and females
  3. πŸ’“ Chest Pain: Critical symptom for disease prediction
  4. 🩺 Clinical Markers: Blood pressure and cholesterol are significant predictors
  5. ⚑ Exercise Metrics: Heart rate and exercise capacity provide valuable insights

πŸ”¬ Feature Importance

The logistic regression model identifies the most significant factors:

  • Chest pain type (cp)
  • Maximum heart rate achieved (thalach)
  • Exercise induced angina (exang)
  • ST depression (oldpeak)
  • Number of major vessels (ca)

🀝 Contributing

We welcome contributions to improve this heart disease prediction project!

πŸš€ How to Contribute

  1. 🍴 Fork the Repository

    git fork https://github.com/NhanPhamThanh-IT/Logistic-Heart-Disease-Prediction.git
  2. 🌿 Create Feature Branch

    git checkout -b feature/amazing-feature
  3. πŸ’Ύ Commit Changes

    git commit -m "Add amazing feature"
  4. πŸ“€ Push to Branch

    git push origin feature/amazing-feature
  5. πŸ”„ Open Pull Request

🎯 Contribution Areas

  • πŸ”¬ Model Improvements: Advanced algorithms, hyperparameter tuning
  • πŸ“Š Visualization: Enhanced plots and charts
  • πŸ“– Documentation: Improve explanations and examples
  • πŸ§ͺ Testing: Add unit tests and validation
  • 🌐 Web Interface: Create user-friendly web application
  • πŸ“± Mobile App: Develop mobile prediction interface

πŸ“„ License

This project is licensed under the MIT License - see the LICENSE file for details.

πŸ“‹ License Summary

  • βœ… Commercial Use: Allowed
  • βœ… Modification: Allowed
  • βœ… Distribution: Allowed
  • βœ… Private Use: Allowed
  • ❌ Liability: Limited
  • ❌ Warranty: None

πŸ‘¨β€πŸ’» Author

Nhan Pham Thanh


🌟 If you found this project helpful, please give it a star! ⭐

Made with ❀️ for advancing healthcare through machine learning

"Predicting today for a healthier tomorrow"

GitHub stars GitHub forks

About

πŸ«€ A machine learning project using logistic regression to predict heart disease risk from clinical data. Built with Python, scikit-learn, and Jupyter notebooks. Achieves 85%+ accuracy on 303-patient dataset with 13 medical features. Complete ML pipeline from data exploration to model evaluation.

Topics

Resources

License

Stars

Watchers

Forks