This repository contains an analysis and machine learning project focused on diagnosing diabetes in Pima Indian women based on various medical attributes. The dataset used in this project is sourced from the National Institute of Diabetes and Digestive and Kidney Diseases and is available on Kaggle.
The project involves the following steps:
- Data Exploration: Statistical analysis and visualization of the dataset to understand the distribution of features and identify any necessary preprocessing steps.
- Correlation Analysis: Computing Pearson Correlation Coefficients (PCC) and generating scatter plots to analyze relationships between features and the target variable.
- Model Training: Training various classifiers, including Multinomial Logistic Regression, Support Vector Machines, and Random Forest, while tuning hyperparameters to improve performance.
- Model Evaluation: Evaluating models based on training, validation, and testing datasets, and comparing performance metrics such as accuracy, precision, recall, and F1 score.
- Ensemble Method: Combining different classifiers into an ensemble to enhance performance on the validation set, and testing the ensemble on the test set.
The dataset includes several medical predictor variables such as the number of pregnancies, BMI, insulin levels, age, and more. The target variable is the outcome, indicating whether a patient has diabetes.