🏦 Bank Conversion Intelligence

Predict customer subscription to bank term deposits using Machine Learning — An end-to-end data science project that transforms raw marketing campaign data into actionable business intelligence.

📋 Table of Contents

Overview
Business Problem
Dataset
Project Workflow
Key Findings
Model Performance
Feature Importance
Installation
Usage
Results & Business Recommendations
Technologies Used
Contributing
License

🎯 Overview

Bank Conversion Intelligence is a machine learning project designed to predict whether a bank customer will subscribe to a term deposit based on direct marketing campaign data. This project demonstrates the complete ML pipeline from data cleaning to model deployment insights.

What This Project Does:

✅ Predicts customer subscription likelihood with 85% accuracy
✅ Identifies the top 10 drivers of customer conversion
✅ Provides actionable business recommendations for marketing optimization
✅ Compares multiple ML algorithms (Logistic Regression, Decision Tree, Random Forest)

💼 Business Problem

The Challenge

Banks conduct direct marketing campaigns (phone calls) to sell term deposits to customers. However:

Only ~11% of customers actually subscribe
Cold-calling is expensive and time-consuming
Sales teams waste resources on uninterested leads

The Solution

Build a predictive model that:

Identifies high-potential customers before the call
Reduces wasted calls by filtering unlikely converters
Maximizes ROI by focusing on warm leads

📊 Dataset

Attribute	Description
Source	UCI Machine Learning Repository - Bank Marketing Dataset
Records	41,188 customer records
Features	20 attributes (demographic, campaign, economic)
Target	`y` - Has the client subscribed to a term deposit? (yes/no)

Feature Categories

Category	Features
Demographics	age, job, marital status, education
Financial	housing loan, personal loan
Campaign Data	contact type, month, day, duration, campaign count
Previous Campaign	pdays, previous contacts, outcome
Economic Indicators	employment rate, consumer price index, euribor rate

🔄 Project Workflow

┌─────────────────┐    ┌─────────────────┐    ┌─────────────────┐
│  Data Loading   │───▶│  Data Cleaning  │───▶│      EDA        │
│   & Inspection  │    │  & Preprocessing│    │  (Exploration)  │
└─────────────────┘    └─────────────────┘    └─────────────────┘
                                                      │
                                                      ▼
┌─────────────────┐    ┌─────────────────┐    ┌─────────────────┐
│    Business     │◀───│     Model       │◀───│    Feature      │
│    Insights     │    │   Comparison    │    │   Engineering   │
└─────────────────┘    └─────────────────┘    └─────────────────┘

Step-by-Step Process

Data Cleaning
- Removed 12 duplicate records
- Dropped default column (99% unknown values)
- Dropped duration column (prevents data leakage)
- Imputed education unknowns using job-group mode
Feature Engineering
- Created previously_contacted binary feature from pdays
- Winsorized campaign outliers at 99th percentile
- One-hot encoded all categorical variables
Data Preprocessing
- Train/Test split: 80/20 with stratification
- StandardScaler normalization for numerical features
Model Training & Evaluation
- Tested 3 algorithms with class_weight='balanced'
- Evaluated using Precision, Recall, F1-Score, and Accuracy

🔍 Key Findings

Class Imbalance Discovery

Target Distribution:
├── No (Did not subscribe):  88.7%
└── Yes (Subscribed):        11.3%

Unknown Value Analysis

Column	Unknown %	Action Taken
`default`	20.9%	Dropped column
`education`	4.2%	Imputed by job group
`job`	0.8%	Dropped rows
`marital`	0.2%	Dropped rows
`housing` & `loan`	2.4%	Kept as category ("declined to answer")

📈 Model Performance

Comparison Summary

Model	Accuracy	Precision (Yes)	Recall (Yes)	F1-Score
Logistic Regression	77%	36%	63%	46%
Decision Tree	79%	30%	33%	31%
Random Forest 🏆	85%	40%	61%	48%
Tuned Random Forest	85%	40%	61%	48%

Model Analysis

🥉 Logistic Regression (Baseline)

Strategy: Linear model with balanced class weights
Result: High Recall (63%) but Low Precision (36%)
Business Interpretation: "Aggressive Sales Manager" — catches most buyers but wastes resources on many non-buyers

🥈 Decision Tree (Challenger)

Strategy: Non-linear model to capture complex patterns
Result: Failed with only 33% Recall and 30% Precision
Business Interpretation: Overfitting to training noise — "fired" this model

🥇 Random Forest (Champion) ✅

Strategy: Ensemble of 100 trees with max_depth=10
Result: Best trade-off with 61% Recall and 40% Precision
Business Interpretation: "Efficient Sales Director" — maintains high buyer capture while reducing wasted calls by 4%

🔧 Hyperparameter Tuned Random Forest

Strategy: GridSearchCV tested 81 parameter combinations across 5-fold CV
Result: Optimal configuration maintains 85% accuracy with balanced precision-recall
Conclusion: Baseline Random Forest was already well-optimized

Why Not 90% Precision?

The ~40% precision ceiling exists due to:

Data Limitations: No real-time income, debt levels, or immediate financial needs
Human Variance: Emotional/external factors invisible to the model

To break 50% precision: Integrate external data like credit scores or spending habits.

🎯 Feature Importance

Top 10 Drivers of Customer Conversion

Rank	Feature	Category	Importance
1️⃣	`euribor3m`	Economic	⬛⬛⬛⬛⬛⬛⬛⬛⬛⬛
2️⃣	`nr.employed`	Economic	⬛⬛⬛⬛⬛⬛⬛⬛⬛
3️⃣	`age`	Demographic	⬛⬛⬛⬛⬛⬛⬛
4️⃣	`pdays`	Campaign	⬛⬛⬛⬛⬛⬛
5️⃣	`emp.var.rate`	Economic	⬛⬛⬛⬛⬛
6️⃣	`cons.price.idx`	Economic	⬛⬛⬛⬛
7️⃣	`campaign`	Campaign	⬛⬛⬛
8️⃣	`cons.conf.idx`	Economic	⬛⬛⬛
9️⃣	`previous`	Campaign	⬛⬛
🔟	`previously_contacted`	Engineered	⬛⬛

Key Insight: Macroeconomics > Demographics

📊 Interest rates and employment levels predict conversions better than customer age or job.

💡 Results & Business Recommendations

Strategic Recommendations

Strategy	Action	Expected Impact
Dynamic Scheduling	Call aggressively when interest rates are favorable	📈 +15-20% conversion
Retarget Warm Leads	Prioritize customers contacted in last 30 days	📈 +10% efficiency
Reduce Cold Calls	Filter out low-probability segments	💰 -25% call costs
Economic Monitoring	Track euribor3m and employment rates weekly	⏰ Timing optimization
Lead Prioritization	Implement scoring matrix based on top features	🎯 3x better targeting

🎯 Lead Prioritization Matrix

Priority Level	Criteria	Expected Conversion
High	Age 50+, contacted within 30 days, favorable economic indicators	35-45%
Medium	Age 30-50, contacted 30-180 days ago	20-30%
Low	Age <30, never contacted, unfavorable economy	5-10%

ROI Projection & Business Impact

Current State (Without ML Model)

Daily Call Volume:       1,000 calls
Conversion Rate:         11% (baseline)
Successful Conversions:  110 customers
Cost per Call:           $5
Daily Call Cost:         $5,000
Cost per Acquisition:    $45.45

Optimized State (With ML Model - Top 40% Probability Leads)

Daily Call Volume:       400 calls (filtered by model)
Conversion Rate:         27.5% (from 40% precision)
Successful Conversions:  110 customers (maintained)
Cost per Call:           $5
Daily Call Cost:         $2,000
Cost per Acquisition:    $18.18

📈 Financial Benefits

Cost Reduction: 60% fewer calls ($3,000 saved per day)
Annual Savings: ~$1,095,000 (assuming 365 days)
Efficiency Gain: Same revenue with 60% less effort
ROI Improvement: Cost per customer drops from $45.45 to $18.18 (2.5x better)

🚀 Scalability Potential

If reinvesting the saved $3,000/day into high-priority leads:

Option A: Double the contact volume on high-priority leads → Potential 50% revenue increase
Option B: Invest in data enrichment (credit scores, spending patterns) → Push precision from 40% to 55%

🛠️ Installation

Prerequisites

Python 3.8 or higher
Jupyter Notebook or Google Colab

Setup

# Clone the repository
git clone https://github.com/zyna-b/Bank-Conversion-Intelligence.git
cd Bank-Conversion-Intelligence

# Create virtual environment (optional but recommended)
python -m venv venv
source venv/bin/activate  # On Windows: venv\Scripts\activate

# Install dependencies
pip install -r requirements.txt

Required Libraries

pandas>=1.3.0
numpy>=1.21.0
matplotlib>=3.4.0
seaborn>=0.11.0
scikit-learn>=0.24.0

🚀 Usage

Run the Notebook

jupyter notebook Bank_Conversion_Intelligence.ipynb

Quick Start (Python)

import pandas as pd
from sklearn.ensemble import RandomForestClassifier
from sklearn.preprocessing import StandardScaler

# Load your data
df = pd.read_csv('bank-additional-full.csv', sep=';')

# Preprocess (see notebook for full pipeline)
# ...

# Train the champion model
rf_model = RandomForestClassifier(
    n_estimators=100,
    max_depth=10,
    class_weight='balanced',
    random_state=42
)
rf_model.fit(X_train_scaled, y_train)

# Predict
predictions = rf_model.predict(X_test_scaled)

🔧 Technologies Used

Technology	Purpose
	Core programming language
	Data manipulation & analysis
	Numerical computing
	Data visualization
	Statistical visualization
	Machine learning models
	Interactive development

🤝 Contributing

Contributions are welcome! Here's how you can help:

Fork the repository
Create a feature branch (git checkout -b feature/AmazingFeature)
Commit your changes (git commit -m 'Add AmazingFeature')
Push to the branch (git push origin feature/AmazingFeature)
Open a Pull Request

Ideas for Contribution

Add XGBoost/LightGBM models
Implement SHAP explanations
Create a Flask/Streamlit web app
Add cross-validation analysis
Integrate hyperparameter tuning

📄 License

This project is licensed under the MIT License - see the LICENSE file for details.

📬 Contact

Zyna B - GitHub

Project Link: https://github.com/zyna-b/Bank-Conversion-Intelligence

⭐ If this project helped you, please give it a star! ⭐

📚 References

UCI ML Repository - Bank Marketing Dataset
[Moro et al., 2014] S. Moro, P. Cortez and P. Rita. A Data-Driven Approach to Predict the Success of Bank Telemarketing. Decision Support Systems, Elsevier, 62:22-31, June 2014

📊 Additional Resources

For detailed technical implementation and comprehensive analysis:

Feature Importance Analysis: See notebook cells after visualization
Hyperparameter Tuning Results: GridSearchCV optimization details
Model Limitations & Future Improvements: Version 2.0 roadmap
Complete Project Summary: Final conclusion section in notebook

🎓 Key Learnings

Technical Skills Demonstrated:

End-to-end ML pipeline (cleaning → EDA → modeling → evaluation)
Handling class imbalance with class_weight='balanced'
Feature engineering (previously_contacted creation)
Model comparison (Logistic Regression vs Decision Tree vs Random Forest)
Hyperparameter tuning with GridSearchCV
Data leakage prevention (dropping duration column)

Business Acumen:

Translating model metrics (precision/recall) into business terms (cost/revenue)
Balancing False Positives vs False Negatives based on business context
ROI calculation and projection
Strategic recommendations from data insights

Data Science Best Practices:

Train/test split with stratification
Feature scaling (StandardScaler)
Feature importance analysis for interpretability
Strategic handling of "unknown" values

Name		Name	Last commit message	Last commit date
Latest commit History 7 Commits
.gitignore		.gitignore
Bank_Conversion_Intelligence.ipynb		Bank_Conversion_Intelligence.ipynb
LICENSE		LICENSE
README.md		README.md
clean_notebook.py		clean_notebook.py
requirements.txt		requirements.txt

License

zyna-b/Bank-Conversion-Intelligence

Folders and files

Latest commit

History

Repository files navigation