This repository contains a project focused on heart disease prediction. The data, derived from heart patients, includes various health metrics such as age, blood pressure, heart rate, and more. The primary objective is to create a predictive model that accurately identifies individuals at risk of heart disease. The emphasis is on achieving a high recall to ensure no potential heart disease case is missed.
In this project, we delve into a dataset encapsulating various health metrics from heart patients, including age, blood pressure, heart rate, and more. Our goal is to develop a predictive model capable of accurately identifying individuals with heart disease. Given the grave implications of missing a positive diagnosis, our primary emphasis is on ensuring that the model identifies all potential patients, making recall for the positive class a crucial metric.
The objectives of the project are as follows:
- Data Understanding: Familiarize ourselves with the dataset and its features.
- Exploratory Data Analysis (EDA): Unveil patterns, trends, and relationships between different variables.
- Univariate Analysis
- Bivariate Analysis
- Data Preprocessing: Prepare the data for future machine learning tasks.
- Remove irrelevant features
- Address missing values
- Treat outliers
- Encode categorical variables
- Transform skewed features to achieve normal-like distributions
- Model Building: Develop and refine the prediction models.
- Establish pipelines for models that require scaling
- Implement and tune classification models including KNN, SVM, Decision Tree, and Random Forest
- Emphasize achieving high recall for class 1, ensuring comprehensive identification of heart patients
- Evaluate and Compare Model Performance: Utilize precision, recall, and F1-score to gauge models' effectiveness.
The dataset comprises various metrics related to heart health. The features of the dataset are described in the table below:
Variable Name | Description |
---|---|
age | Age of the patient in years |
sex | Gender of the patient (0 = male, 1 = female) |
cp | Chest pain type: 0: Typical angina 1: Atypical angina 2: Non-anginal pain 3: Asymptomatic |
trestbps | Resting blood pressure in mm Hg |
chol | Serum cholesterol in mg/dl |
fbs | Fasting blood sugar level, categorized as above 120 mg/dl (1 = true, 0 = false) |
restecg | Resting electrocardiographic results: 0: Normal 1: Having ST-T wave abnormality 2: Showing probable or definite left ventricular hypertrophy |
thalach | Maximum heart rate achieved during a stress test |
exang | Exercise-induced angina (1 = yes, 0 = no) |
oldpeak | ST depression induced by exercise relative to rest |
slope | Slope of the peak exercise ST segment: 0: Upsloping 1: Flat 2: Downsloping |
ca | Number of major vessels (0-4) colored by fluoroscopy |
thal | Thalium stress test result: 0: Normal 1: Fixed defect 2: Reversible defect 3: Not described |
target | Heart disease status (0 = no disease, 1 = presence of disease) |
You can find the dataset here.
Heart Disease Prediction.ipynb
: Jupyter notebook containing all the data exploration, visualization, modeling, and evaluation code.heart.csv
: CSV file containing the heart disease data.README.md
: This file, providing an overview of the project.
- Clone this repository.
- Open the
Heart Disease Prediction.ipynb
notebook in Jupyter. - Run all cells in the notebook.
For those interested in exploring this notebook in a Kaggle environment, you can access it here.