An enterprise-grade Customer Segment Intelligence (CSI) framework engineered to discover and profile latent customer behavioral archetypes for optimized housing product cross-selling strategies.
This repository deploys a production-ready, 11-stage analytical pipeline that bridges the gap between high-dimensional transactional data and C-level commercial execution—utilizing advanced outlier isolation, PCA variance optimization, multi-metric cluster validation (
- The Analytical Pipeline: Engineers a robust 11-step computational workflow, transitioning from raw high-dimensional transactional inputs to a standardized, normalized dimensional space for model training.
-
Advanced Governance: Implements an automated
DataPipelineclass to ensure strict adherence to transformation logic (Log-scaling, StandardScaler, and feature engineering), eliminating training-serving skew when generalizing across unseen regional datasets. -
Unsupervised Intelligence: Employs
$K$ -Means clustering to identify five distinct behavioral archetypes, validated through a rigorous Silhouette & Elbow methodology to maximize intra-cluster cohesion and inter-cluster separation. - Commercial Application: Bridges technical analytics with business strategy by mapping cluster-specific conversion rates, enabling data-driven identification of high-propensity segments for optimized housing insurance cross-selling.
- Clustering Diagnostics: Elbow & Silhouette validation for optimal k-selection.
- Behavioral Heatmap: Correlation matrix highlighting multi-regional feature relationships.
- Language: Python 3.x
- Core Libraries:
pandas,numpy,scikit-learn - Visualization:
seaborn,matplotlib - Architecture: Modularized pipeline via custom
DataPipelineclass for production-ready inference.
The project follows a modular structure designed for maintainability and scalability in data science workflows:
nordik_customer_segmentation/
├── LICENSE # License information
├── README.md # Project documentation
├── requirements.txt # Essential Python dependencies
├── data/ # Raw datasets and intermediate processed files
├── notebooks/ # Exploratory data analysis and model experimentation
└── src/ # Core modular processing logic and ML pipeline
├── __init__.py # Package initialization
└── nordik_seguros_pipeline.py # DataPipeline class (cleaning, feature engineering, and model inference)
To adhere to strict industry compliance standards and protect corporate confidentiality, the datasets and metadata within this repository have been subjected to a rigorous anonymization process:
- Customer Identity Masking: Personally Identifiable Information (PII), such as specific client names, contact details, and sensitive demographics, has been replaced with randomized synthetic placeholders to ensure full compliance with global data protection standards.
- Geospatial & Account Protection: Original regional identifiers and specific insurance account numbers have been abstracted to prevent the inference of individual policyholder behavior or proprietary market distribution strategies.
- Product & Revenue Normalization: Transactional values and product-specific labels have been normalized to protect the firm's competitive commercial intelligence and internal financial reporting structures.
The operational pipeline logic, feature engineering, and unsupervised clustering methodology remain 100% faithful to the original business intelligence requirements, ensuring full analytical reproducibility and pedagogical integrity while safeguarding the privacy of the original stakeholders.
