A comprehensive data science project for customer analytics, segmentation, and lookalike modeling. Built as an assignment for Zeotap Data Science internship position (January 2025).
- 200 customers analyzed
- 1,000 transactions processed
- 5 customer segments identified (K-Means clustering)
- Davies-Bouldin Index: 1.05 (optimal clustering quality)
- 3 analysis modules (EDA, Clustering, Lookalike)
- 20 lookalike recommendations generated
- 4 key features (Total Spend, Avg Spend, Transaction Count, Avg Quantity)
This project demonstrates end-to-end customer analytics capabilities including exploratory data analysis, customer segmentation using machine learning, and lookalike modeling for targeted marketing. Developed as part of the Zeotap Data Scientist internship assessment.
- Company: Zeotap (Customer Data Platform)
- Position: Data Science Internship
- Date: January 2025
- Objective: Demonstrate data science skills in customer analytics and segmentation
- Exploratory Data Analysis (EDA): Analyze customer, product, and transaction data
- Customer Segmentation: Group customers using K-Means clustering
- Lookalike Modeling: Identify similar customers for targeted campaigns
- K-Means clustering for customer segmentation
- Davies-Bouldin Index for cluster quality evaluation
- PCA (Principal Component Analysis) for dimensionality reduction
- Cosine similarity for lookalike modeling
- Exploratory Data Analysis (EDA)
- Feature engineering and aggregation
- Data merging and transformation
- Statistical analysis and profiling
- StandardScaler for feature normalization
- One-hot encoding for categorical variables
- DateTime parsing and manipulation
- Missing value handling
DS_Assignment_Zeotap/
โโโ data/
โ โโโ Customers.csv # Customer profiles (200 records)
โ โโโ Products.csv # Product catalog
โ โโโ Transactions.csv # Transaction history (1,000 records)
โโโ src/
โ โโโ eda.py # Exploratory Data Analysis
โ โโโ clustering.py # Customer Segmentation (K-Means)
โ โโโ lookalike.py # Lookalike Modeling
โโโ reports/
โ โโโ eda_report.pdf # EDA insights and visualizations
โ โโโ clustering_report.pdf # Segmentation analysis and recommendations
โโโ output/
โ โโโ lookalike_results.csv # Lookalike recommendations
โโโ .gitignore
โโโ LICENSE
โโโ README.md
- Python 3.8 or higher
- pip package manager
-
Clone the repository:
git clone https://github.com/patelritiq/DS_Assignment_Zeotap.git cd DS_Assignment_Zeotap -
Install required libraries:
pip install pandas numpy matplotlib seaborn scikit-learn
Analyze datasets and visualize insights:
cd src
python eda.pyOutput:
- Data summaries and statistics
- Most purchased products
- Customer behavior patterns
Report: See reports/eda_report.pdf for detailed insights and business recommendations.
Perform K-Means clustering to segment customers:
cd src
python clustering.pyOutput:
- Optimal number of clusters: 5
- Davies-Bouldin Index: 1.05
- PCA visualization of customer segments
Report: See reports/clustering_report.pdf for detailed segmentation analysis.
Generate similar customer recommendations:
cd src
python lookalike.pyOutput:
- Lookalike recommendations for 20 customers
- Results saved to
output/lookalike_results.csv
| Cluster | Description | Characteristics | Recommendation |
|---|---|---|---|
| Cluster 0 | High-Frequency Buyers | Medium spending, active engagement | Loyalty programs, personalized offers |
| Cluster 1 | Occasional Buyers | Low frequency, low spending | Basic incentives, re-engagement campaigns |
| Cluster 2 | High-Value Customers | Significant spending, moderate engagement | Premium services, upselling strategies |
| Cluster 3 | Inactive Customers | Very little activity | Reactivation campaigns, personalized offers |
| Cluster 4 | Moderate Spenders | Frequent purchases, consistent engagement | Increase basket size, bundled products |
- Algorithm: K-Means (chosen for efficiency and interpretability)
- Features: Total Spend, Avg Spend, Transaction Count, Avg Quantity
- Optimization: Davies-Bouldin Index (lower = better)
- Visualization: PCA for 2D representation
- Customers: 200 unique customers
- Transactions: 1,000 transaction records
- Products: Multiple product categories
- Regions: Multiple geographic regions
- Optimal Clusters: 5 segments
- Davies-Bouldin Index: 1.05 (acceptable quality)
- Clear Separation: PCA visualization shows distinct clusters
- Enables targeted marketing campaigns
- Identifies high-value customer segments
- Supports personalized customer engagement
- Facilitates lookalike audience targeting
- Hierarchical clustering for comparison
- DBSCAN for density-based segmentation
- Time-series analysis for customer lifetime value
- Churn prediction modeling
- Real-time recommendation system
- Interactive dashboard (Streamlit/Dash)
- A/B testing framework
- Advanced feature engineering (RFM analysis)
Detailed analysis reports are available in the reports/ folder:
- EDA Report (
eda_report.pdf): Comprehensive exploratory analysis with visualizations and business insights - Clustering Report (
clustering_report.pdf): Customer segmentation analysis with cluster descriptions and recommendations
- Python: Core programming language
- Pandas: Data manipulation and analysis
- NumPy: Numerical computations
- Scikit-learn: Machine learning algorithms
- Matplotlib: Data visualization
- Seaborn: Statistical visualizations
This project is licensed under the MIT License - see the LICENSE file for details.
Ritik Pratap Singh Patel
- Data Science & Machine Learning Enthusiast
- Email: patelritiq@gmail.com
This project was developed as an assignment for Zeotap's Data Scientist position. It demonstrates practical application of data science techniques for customer analytics and segmentation.
Zeotap: https://github.com/zeotap
# Clone repository
git clone https://github.com/patelritiq/DS_Assignment_Zeotap.git
cd DS_Assignment_Zeotap
# Install dependencies
pip install pandas numpy matplotlib seaborn scikit-learn
# Run EDA
cd src
python eda.py
# Run Clustering
python clustering.py
# Run Lookalike Modeling
python lookalike.pyTransform customer data into actionable insights! ๐โจ