Predict customer subscription to bank term deposits using Machine Learning — An end-to-end data science project that transforms raw marketing campaign data into actionable business intelligence.
- Overview
- Business Problem
- Dataset
- Project Workflow
- Key Findings
- Model Performance
- Feature Importance
- Installation
- Usage
- Results & Business Recommendations
- Technologies Used
- Contributing
- License
Bank Conversion Intelligence is a machine learning project designed to predict whether a bank customer will subscribe to a term deposit based on direct marketing campaign data. This project demonstrates the complete ML pipeline from data cleaning to model deployment insights.
- ✅ Predicts customer subscription likelihood with 85% accuracy
- ✅ Identifies the top 10 drivers of customer conversion
- ✅ Provides actionable business recommendations for marketing optimization
- ✅ Compares multiple ML algorithms (Logistic Regression, Decision Tree, Random Forest)
Banks conduct direct marketing campaigns (phone calls) to sell term deposits to customers. However:
- Only ~11% of customers actually subscribe
- Cold-calling is expensive and time-consuming
- Sales teams waste resources on uninterested leads
Build a predictive model that:
- Identifies high-potential customers before the call
- Reduces wasted calls by filtering unlikely converters
- Maximizes ROI by focusing on warm leads
| Attribute | Description |
|---|---|
| Source | UCI Machine Learning Repository - Bank Marketing Dataset |
| Records | 41,188 customer records |
| Features | 20 attributes (demographic, campaign, economic) |
| Target | y - Has the client subscribed to a term deposit? (yes/no) |
| Category | Features |
|---|---|
| Demographics | age, job, marital status, education |
| Financial | housing loan, personal loan |
| Campaign Data | contact type, month, day, duration, campaign count |
| Previous Campaign | pdays, previous contacts, outcome |
| Economic Indicators | employment rate, consumer price index, euribor rate |
┌─────────────────┐ ┌─────────────────┐ ┌─────────────────┐
│ Data Loading │───▶│ Data Cleaning │───▶│ EDA │
│ & Inspection │ │ & Preprocessing│ │ (Exploration) │
└─────────────────┘ └─────────────────┘ └─────────────────┘
│
▼
┌─────────────────┐ ┌─────────────────┐ ┌─────────────────┐
│ Business │◀───│ Model │◀───│ Feature │
│ Insights │ │ Comparison │ │ Engineering │
└─────────────────┘ └─────────────────┘ └─────────────────┘
-
Data Cleaning
- Removed 12 duplicate records
- Dropped
defaultcolumn (99% unknown values) - Dropped
durationcolumn (prevents data leakage) - Imputed
educationunknowns using job-group mode
-
Feature Engineering
- Created
previously_contactedbinary feature frompdays - Winsorized
campaignoutliers at 99th percentile - One-hot encoded all categorical variables
- Created
-
Data Preprocessing
- Train/Test split: 80/20 with stratification
- StandardScaler normalization for numerical features
-
Model Training & Evaluation
- Tested 3 algorithms with
class_weight='balanced' - Evaluated using Precision, Recall, F1-Score, and Accuracy
- Tested 3 algorithms with
Target Distribution:
├── No (Did not subscribe): 88.7%
└── Yes (Subscribed): 11.3%
| Column | Unknown % | Action Taken |
|---|---|---|
default |
20.9% | Dropped column |
education |
4.2% | Imputed by job group |
job |
0.8% | Dropped rows |
marital |
0.2% | Dropped rows |
housing & loan |
2.4% | Kept as category ("declined to answer") |
| Model | Accuracy | Precision (Yes) | Recall (Yes) | F1-Score |
|---|---|---|---|---|
| Logistic Regression | 77% | 36% | 63% | 46% |
| Decision Tree | 79% | 30% | 33% | 31% |
| Random Forest 🏆 | 85% | 40% | 61% | 48% |
| Tuned Random Forest | 85% | 40% | 61% | 48% |
- Strategy: Linear model with balanced class weights
- Result: High Recall (63%) but Low Precision (36%)
- Business Interpretation: "Aggressive Sales Manager" — catches most buyers but wastes resources on many non-buyers
- Strategy: Non-linear model to capture complex patterns
- Result: Failed with only 33% Recall and 30% Precision
- Business Interpretation: Overfitting to training noise — "fired" this model
- Strategy: Ensemble of 100 trees with max_depth=10
- Result: Best trade-off with 61% Recall and 40% Precision
- Business Interpretation: "Efficient Sales Director" — maintains high buyer capture while reducing wasted calls by 4%
- Strategy: GridSearchCV tested 81 parameter combinations across 5-fold CV
- Result: Optimal configuration maintains 85% accuracy with balanced precision-recall
- Conclusion: Baseline Random Forest was already well-optimized
The ~40% precision ceiling exists due to:
- Data Limitations: No real-time income, debt levels, or immediate financial needs
- Human Variance: Emotional/external factors invisible to the model
To break 50% precision: Integrate external data like credit scores or spending habits.
| Rank | Feature | Category | Importance |
|---|---|---|---|
| 1️⃣ | euribor3m |
Economic | ⬛⬛⬛⬛⬛⬛⬛⬛⬛⬛ |
| 2️⃣ | nr.employed |
Economic | ⬛⬛⬛⬛⬛⬛⬛⬛⬛ |
| 3️⃣ | age |
Demographic | ⬛⬛⬛⬛⬛⬛⬛ |
| 4️⃣ | pdays |
Campaign | ⬛⬛⬛⬛⬛⬛ |
| 5️⃣ | emp.var.rate |
Economic | ⬛⬛⬛⬛⬛ |
| 6️⃣ | cons.price.idx |
Economic | ⬛⬛⬛⬛ |
| 7️⃣ | campaign |
Campaign | ⬛⬛⬛ |
| 8️⃣ | cons.conf.idx |
Economic | ⬛⬛⬛ |
| 9️⃣ | previous |
Campaign | ⬛⬛ |
| 🔟 | previously_contacted |
Engineered | ⬛⬛ |
📊 Interest rates and employment levels predict conversions better than customer age or job.
| Strategy | Action | Expected Impact |
|---|---|---|
| Dynamic Scheduling | Call aggressively when interest rates are favorable | 📈 +15-20% conversion |
| Retarget Warm Leads | Prioritize customers contacted in last 30 days | 📈 +10% efficiency |
| Reduce Cold Calls | Filter out low-probability segments | 💰 -25% call costs |
| Economic Monitoring | Track euribor3m and employment rates weekly | ⏰ Timing optimization |
| Lead Prioritization | Implement scoring matrix based on top features | 🎯 3x better targeting |
| Priority Level | Criteria | Expected Conversion |
|---|---|---|
| High | Age 50+, contacted within 30 days, favorable economic indicators | 35-45% |
| Medium | Age 30-50, contacted 30-180 days ago | 20-30% |
| Low | Age <30, never contacted, unfavorable economy | 5-10% |
Daily Call Volume: 1,000 calls
Conversion Rate: 11% (baseline)
Successful Conversions: 110 customers
Cost per Call: $5
Daily Call Cost: $5,000
Cost per Acquisition: $45.45
Daily Call Volume: 400 calls (filtered by model)
Conversion Rate: 27.5% (from 40% precision)
Successful Conversions: 110 customers (maintained)
Cost per Call: $5
Daily Call Cost: $2,000
Cost per Acquisition: $18.18
- Cost Reduction: 60% fewer calls ($3,000 saved per day)
- Annual Savings: ~$1,095,000 (assuming 365 days)
- Efficiency Gain: Same revenue with 60% less effort
- ROI Improvement: Cost per customer drops from $45.45 to $18.18 (2.5x better)
If reinvesting the saved $3,000/day into high-priority leads:
- Option A: Double the contact volume on high-priority leads → Potential 50% revenue increase
- Option B: Invest in data enrichment (credit scores, spending patterns) → Push precision from 40% to 55%
- Python 3.8 or higher
- Jupyter Notebook or Google Colab
# Clone the repository
git clone https://github.com/zyna-b/Bank-Conversion-Intelligence.git
cd Bank-Conversion-Intelligence
# Create virtual environment (optional but recommended)
python -m venv venv
source venv/bin/activate # On Windows: venv\Scripts\activate
# Install dependencies
pip install -r requirements.txtpandas>=1.3.0
numpy>=1.21.0
matplotlib>=3.4.0
seaborn>=0.11.0
scikit-learn>=0.24.0jupyter notebook Bank_Conversion_Intelligence.ipynbimport pandas as pd
from sklearn.ensemble import RandomForestClassifier
from sklearn.preprocessing import StandardScaler
# Load your data
df = pd.read_csv('bank-additional-full.csv', sep=';')
# Preprocess (see notebook for full pipeline)
# ...
# Train the champion model
rf_model = RandomForestClassifier(
n_estimators=100,
max_depth=10,
class_weight='balanced',
random_state=42
)
rf_model.fit(X_train_scaled, y_train)
# Predict
predictions = rf_model.predict(X_test_scaled)| Technology | Purpose |
|---|---|
| Core programming language | |
| Data manipulation & analysis | |
| Numerical computing | |
| Data visualization | |
| Statistical visualization | |
| Machine learning models | |
| Interactive development |
Contributions are welcome! Here's how you can help:
- Fork the repository
- Create a feature branch (
git checkout -b feature/AmazingFeature) - Commit your changes (
git commit -m 'Add AmazingFeature') - Push to the branch (
git push origin feature/AmazingFeature) - Open a Pull Request
- Add XGBoost/LightGBM models
- Implement SHAP explanations
- Create a Flask/Streamlit web app
- Add cross-validation analysis
- Integrate hyperparameter tuning
This project is licensed under the MIT License - see the LICENSE file for details.
Zyna B - GitHub
Project Link: https://github.com/zyna-b/Bank-Conversion-Intelligence
⭐ If this project helped you, please give it a star! ⭐
- UCI ML Repository - Bank Marketing Dataset
- [Moro et al., 2014] S. Moro, P. Cortez and P. Rita. A Data-Driven Approach to Predict the Success of Bank Telemarketing. Decision Support Systems, Elsevier, 62:22-31, June 2014
For detailed technical implementation and comprehensive analysis:
- Feature Importance Analysis: See notebook cells after visualization
- Hyperparameter Tuning Results: GridSearchCV optimization details
- Model Limitations & Future Improvements: Version 2.0 roadmap
- Complete Project Summary: Final conclusion section in notebook
Technical Skills Demonstrated:
- End-to-end ML pipeline (cleaning → EDA → modeling → evaluation)
- Handling class imbalance with
class_weight='balanced' - Feature engineering (
previously_contactedcreation) - Model comparison (Logistic Regression vs Decision Tree vs Random Forest)
- Hyperparameter tuning with GridSearchCV
- Data leakage prevention (dropping
durationcolumn)
Business Acumen:
- Translating model metrics (precision/recall) into business terms (cost/revenue)
- Balancing False Positives vs False Negatives based on business context
- ROI calculation and projection
- Strategic recommendations from data insights
Data Science Best Practices:
- Train/test split with stratification
- Feature scaling (StandardScaler)
- Feature importance analysis for interpretability
- Strategic handling of "unknown" values