Clusty the Cluster is a comprehensive Data Science project dedicated to Patient Segmentation. By leveraging unsupervised machine learning techniques, this project aims to identify distinct groups of patients based on their characteristics and medical history. These insights can be instrumental in personalizing healthcare plans, optimizing resource allocation, and improving patient outcomes.
The project follows a structured data science workflow, from Exploratory Data Analysis (EDA) to detailed Cluster Profiling.
The analysis is divided into four sequential stages, each documented in a dedicated Jupyter Notebook:
| Stage | Notebook | Description |
|---|---|---|
| 1. Exploration | 01_eda.ipynb |
Initial data inspection, distribution analysis, and correlation checks to understand the dataset's structure. |
| 2. Preprocessing | 02_preprocessing.ipynb |
Data cleaning, handling missing values, feature engineering, and normalization/standardization. A preprocessor.pkl is generated here. |
| 3. Modeling | 03_clustering.ipynb |
Application of clustering algorithms (e.g., K-Means, Hierarchical Clustering). Includes hyperparameter tuning and model selection. |
| 4. Analysis | 04_cluster_profiling.ipynb |
Interpreting the resulting clusters. Analyzing the centroids and distribution of features within each segment to derive actionable insights. |
The project utilizes the patient_segmentation_dataset.csv located in the data/ directory.
- Input: Raw patient data.
- Output:
clustered_data.csvcontaining the original data enriched with cluster correlations.
- Language: Python
- Data Manipulation: Pandas, NumPy
- Machine Learning: Scikit-learn
- Visualization: Matplotlib, Seaborn
- Environment Management: Virtualenv
Ensure you have Python installed. It is recommended to use a virtual environment.
-
Clone the repository (if applicable) or navigate to the project directory.
-
Create and activate a virtual environment:
python -m venv .venv source .venv/bin/activate # On Windows use: .venv\Scripts\activate
-
Install dependencies:
pip install -r requirements.txt
Launch the Jupyter Notebook server to interact with the analysis files:
jupyter notebookIt is recommended to run the notebooks in order (01 to 04) to replicate the full pipeline.
The analysis culminates in 04_cluster_profiling.ipynb, which provides a detailed breakdown of the identified patient profiles. These profiles help in understanding the "archetypes" present in the patient population.
Contributions are welcome! Please feel free to submit a Pull Request.
This project is open-source and available under the MIT License.
