This project analyzes the T1DiabetesGranada dataset — a collection of medical and biochemical data from patients with Type 1 Diabetes Mellitus (T1DM).
The dataset includes continuous glucose measurements, biochemical test results, patient demographics, and diagnostic codes for complications.
The goal of the study is to understand the relationships between glucose control, biochemical parameters, and diabetes-related complications, using both statistical and machine learning techniques.
- Explore and assess the quality and structure of the data.
- Analyze the relationship between glucose levels and diabetic complications.
- Study correlations between biochemical parameters and glycemic control.
- Apply clustering (K-Means, Gaussian Mixture Models) to identify patient subgroups.
- Implement classification models (Random Forest) to predict the presence of complications.
- Validate findings through statistical tests and cross-validation techniques.
Four main datasets were analyzed:
- Patient_info.csv – demographic and summary information about each patient
- Glucose_measurements.csv – continuous glucose monitoring data (every 15 minutes)
- Biochemical_parameters.csv – laboratory test results for 17 biochemical parameters
- Diagnostics.csv – ICD-9-CM diagnostic codes describing patient complications
- Computation of TIR (Time In Range), TAR (Time Above Range), and TBR (Time Below Range) indicators.
- Statistical comparison between patients with and without complications using the Mann–Whitney U test.
- Correlation analysis between glucose levels and biochemical parameters.
- Visualization of distributions by gender and age group.
- Construction of a derived dataset for unsupervised learning.
- Application of K-Means and Gaussian Mixture Models (GMM) to group patients based on glycemic and biochemical profiles.
- Evaluation of clustering results using ANOVA and visualization of cluster-specific complication patterns.
- Implementation of Random Forest classifiers to predict the presence or type of complications.
- Experiments with balanced and imbalanced datasets.
- Validation through K-Fold Cross Validation and TrainFold/TestFixed setups.
- Analysis of feature importance and feature selection (RFE).
- Among glycemic indicators, TBR (Time Below Range) showed a statistically significant difference between patients with and without complications.
- Hypoglycemia (low glucose levels) appears more strongly associated with the presence of complications.
- Hemoglobin A1c (HbA1c) correlates with average glucose levels, confirming known medical relationships.
- Clustering and classification models highlighted consistent patterns across data-driven and supervised analyses.
- Python
- pandas
- numpy
- matplotlib
- seaborn
- scikit-learn
- Jupyter Notebook
Matteo Avella
Gabriele Gaudiosi
University of Salerno – Department of Computer Science
Course: Foundations of Computer Vision and Biometrics
Academic Year: 2024/2025
This project is intended for educational and research purposes only.
All rights reserved © 2025 Matteo Avella & Gabriele Gaudiosi.