Skip to content

zyna-b/Bank-Conversion-Intelligence

Repository files navigation

🏦 Bank Conversion Intelligence

Python Jupyter scikit-learn Pandas License: MIT Contributions Welcome

Predict customer subscription to bank term deposits using Machine Learning — An end-to-end data science project that transforms raw marketing campaign data into actionable business intelligence.


📋 Table of Contents


🎯 Overview

Bank Conversion Intelligence is a machine learning project designed to predict whether a bank customer will subscribe to a term deposit based on direct marketing campaign data. This project demonstrates the complete ML pipeline from data cleaning to model deployment insights.

What This Project Does:

  • ✅ Predicts customer subscription likelihood with 85% accuracy
  • ✅ Identifies the top 10 drivers of customer conversion
  • ✅ Provides actionable business recommendations for marketing optimization
  • ✅ Compares multiple ML algorithms (Logistic Regression, Decision Tree, Random Forest)

💼 Business Problem

The Challenge

Banks conduct direct marketing campaigns (phone calls) to sell term deposits to customers. However:

  • Only ~11% of customers actually subscribe
  • Cold-calling is expensive and time-consuming
  • Sales teams waste resources on uninterested leads

The Solution

Build a predictive model that:

  1. Identifies high-potential customers before the call
  2. Reduces wasted calls by filtering unlikely converters
  3. Maximizes ROI by focusing on warm leads

📊 Dataset

Attribute Description
Source UCI Machine Learning Repository - Bank Marketing Dataset
Records 41,188 customer records
Features 20 attributes (demographic, campaign, economic)
Target y - Has the client subscribed to a term deposit? (yes/no)

Feature Categories

Category Features
Demographics age, job, marital status, education
Financial housing loan, personal loan
Campaign Data contact type, month, day, duration, campaign count
Previous Campaign pdays, previous contacts, outcome
Economic Indicators employment rate, consumer price index, euribor rate

🔄 Project Workflow

┌─────────────────┐    ┌─────────────────┐    ┌─────────────────┐
│  Data Loading   │───▶│  Data Cleaning  │───▶│      EDA        │
│   & Inspection  │    │  & Preprocessing│    │  (Exploration)  │
└─────────────────┘    └─────────────────┘    └─────────────────┘
                                                      │
                                                      ▼
┌─────────────────┐    ┌─────────────────┐    ┌─────────────────┐
│    Business     │◀───│     Model       │◀───│    Feature      │
│    Insights     │    │   Comparison    │    │   Engineering   │
└─────────────────┘    └─────────────────┘    └─────────────────┘

Step-by-Step Process

  1. Data Cleaning

    • Removed 12 duplicate records
    • Dropped default column (99% unknown values)
    • Dropped duration column (prevents data leakage)
    • Imputed education unknowns using job-group mode
  2. Feature Engineering

    • Created previously_contacted binary feature from pdays
    • Winsorized campaign outliers at 99th percentile
    • One-hot encoded all categorical variables
  3. Data Preprocessing

    • Train/Test split: 80/20 with stratification
    • StandardScaler normalization for numerical features
  4. Model Training & Evaluation

    • Tested 3 algorithms with class_weight='balanced'
    • Evaluated using Precision, Recall, F1-Score, and Accuracy

🔍 Key Findings

Class Imbalance Discovery

Target Distribution:
├── No (Did not subscribe):  88.7%
└── Yes (Subscribed):        11.3%

Unknown Value Analysis

Column Unknown % Action Taken
default 20.9% Dropped column
education 4.2% Imputed by job group
job 0.8% Dropped rows
marital 0.2% Dropped rows
housing & loan 2.4% Kept as category ("declined to answer")

📈 Model Performance

Comparison Summary

Model Accuracy Precision (Yes) Recall (Yes) F1-Score
Logistic Regression 77% 36% 63% 46%
Decision Tree 79% 30% 33% 31%
Random Forest 🏆 85% 40% 61% 48%
Tuned Random Forest 85% 40% 61% 48%

Model Analysis

🥉 Logistic Regression (Baseline)

  • Strategy: Linear model with balanced class weights
  • Result: High Recall (63%) but Low Precision (36%)
  • Business Interpretation: "Aggressive Sales Manager" — catches most buyers but wastes resources on many non-buyers

🥈 Decision Tree (Challenger)

  • Strategy: Non-linear model to capture complex patterns
  • Result: Failed with only 33% Recall and 30% Precision
  • Business Interpretation: Overfitting to training noise — "fired" this model

🥇 Random Forest (Champion) ✅

  • Strategy: Ensemble of 100 trees with max_depth=10
  • Result: Best trade-off with 61% Recall and 40% Precision
  • Business Interpretation: "Efficient Sales Director" — maintains high buyer capture while reducing wasted calls by 4%

🔧 Hyperparameter Tuned Random Forest

  • Strategy: GridSearchCV tested 81 parameter combinations across 5-fold CV
  • Result: Optimal configuration maintains 85% accuracy with balanced precision-recall
  • Conclusion: Baseline Random Forest was already well-optimized

Why Not 90% Precision?

The ~40% precision ceiling exists due to:

  1. Data Limitations: No real-time income, debt levels, or immediate financial needs
  2. Human Variance: Emotional/external factors invisible to the model

To break 50% precision: Integrate external data like credit scores or spending habits.


🎯 Feature Importance

Top 10 Drivers of Customer Conversion

Rank Feature Category Importance
1️⃣ euribor3m Economic ⬛⬛⬛⬛⬛⬛⬛⬛⬛⬛
2️⃣ nr.employed Economic ⬛⬛⬛⬛⬛⬛⬛⬛⬛
3️⃣ age Demographic ⬛⬛⬛⬛⬛⬛⬛
4️⃣ pdays Campaign ⬛⬛⬛⬛⬛⬛
5️⃣ emp.var.rate Economic ⬛⬛⬛⬛⬛
6️⃣ cons.price.idx Economic ⬛⬛⬛⬛
7️⃣ campaign Campaign ⬛⬛⬛
8️⃣ cons.conf.idx Economic ⬛⬛⬛
9️⃣ previous Campaign ⬛⬛
🔟 previously_contacted Engineered ⬛⬛

Key Insight: Macroeconomics > Demographics

📊 Interest rates and employment levels predict conversions better than customer age or job.


💡 Results & Business Recommendations

Strategic Recommendations

Strategy Action Expected Impact
Dynamic Scheduling Call aggressively when interest rates are favorable 📈 +15-20% conversion
Retarget Warm Leads Prioritize customers contacted in last 30 days 📈 +10% efficiency
Reduce Cold Calls Filter out low-probability segments 💰 -25% call costs
Economic Monitoring Track euribor3m and employment rates weekly ⏰ Timing optimization
Lead Prioritization Implement scoring matrix based on top features 🎯 3x better targeting

🎯 Lead Prioritization Matrix

Priority Level Criteria Expected Conversion
High Age 50+, contacted within 30 days, favorable economic indicators 35-45%
Medium Age 30-50, contacted 30-180 days ago 20-30%
Low Age <30, never contacted, unfavorable economy 5-10%

ROI Projection & Business Impact

Current State (Without ML Model)

Daily Call Volume:       1,000 calls
Conversion Rate:         11% (baseline)
Successful Conversions:  110 customers
Cost per Call:           $5
Daily Call Cost:         $5,000
Cost per Acquisition:    $45.45

Optimized State (With ML Model - Top 40% Probability Leads)

Daily Call Volume:       400 calls (filtered by model)
Conversion Rate:         27.5% (from 40% precision)
Successful Conversions:  110 customers (maintained)
Cost per Call:           $5
Daily Call Cost:         $2,000
Cost per Acquisition:    $18.18

📈 Financial Benefits

  • Cost Reduction: 60% fewer calls ($3,000 saved per day)
  • Annual Savings: ~$1,095,000 (assuming 365 days)
  • Efficiency Gain: Same revenue with 60% less effort
  • ROI Improvement: Cost per customer drops from $45.45 to $18.18 (2.5x better)

🚀 Scalability Potential

If reinvesting the saved $3,000/day into high-priority leads:

  • Option A: Double the contact volume on high-priority leads → Potential 50% revenue increase
  • Option B: Invest in data enrichment (credit scores, spending patterns) → Push precision from 40% to 55%

🛠️ Installation

Prerequisites

  • Python 3.8 or higher
  • Jupyter Notebook or Google Colab

Setup

# Clone the repository
git clone https://github.com/zyna-b/Bank-Conversion-Intelligence.git
cd Bank-Conversion-Intelligence

# Create virtual environment (optional but recommended)
python -m venv venv
source venv/bin/activate  # On Windows: venv\Scripts\activate

# Install dependencies
pip install -r requirements.txt

Required Libraries

pandas>=1.3.0
numpy>=1.21.0
matplotlib>=3.4.0
seaborn>=0.11.0
scikit-learn>=0.24.0

🚀 Usage

Run the Notebook

jupyter notebook Bank_Conversion_Intelligence.ipynb

Quick Start (Python)

import pandas as pd
from sklearn.ensemble import RandomForestClassifier
from sklearn.preprocessing import StandardScaler

# Load your data
df = pd.read_csv('bank-additional-full.csv', sep=';')

# Preprocess (see notebook for full pipeline)
# ...

# Train the champion model
rf_model = RandomForestClassifier(
    n_estimators=100,
    max_depth=10,
    class_weight='balanced',
    random_state=42
)
rf_model.fit(X_train_scaled, y_train)

# Predict
predictions = rf_model.predict(X_test_scaled)

🔧 Technologies Used

Technology Purpose
Python Core programming language
Pandas Data manipulation & analysis
NumPy Numerical computing
Matplotlib Data visualization
Seaborn Statistical visualization
scikit-learn Machine learning models
Jupyter Interactive development

🤝 Contributing

Contributions are welcome! Here's how you can help:

  1. Fork the repository
  2. Create a feature branch (git checkout -b feature/AmazingFeature)
  3. Commit your changes (git commit -m 'Add AmazingFeature')
  4. Push to the branch (git push origin feature/AmazingFeature)
  5. Open a Pull Request

Ideas for Contribution

  • Add XGBoost/LightGBM models
  • Implement SHAP explanations
  • Create a Flask/Streamlit web app
  • Add cross-validation analysis
  • Integrate hyperparameter tuning

📄 License

This project is licensed under the MIT License - see the LICENSE file for details.


📬 Contact

Zyna B - GitHub

Project Link: https://github.com/zyna-b/Bank-Conversion-Intelligence


⭐ If this project helped you, please give it a star! ⭐


📚 References

  • UCI ML Repository - Bank Marketing Dataset
  • [Moro et al., 2014] S. Moro, P. Cortez and P. Rita. A Data-Driven Approach to Predict the Success of Bank Telemarketing. Decision Support Systems, Elsevier, 62:22-31, June 2014

📊 Additional Resources

For detailed technical implementation and comprehensive analysis:

  • Feature Importance Analysis: See notebook cells after visualization
  • Hyperparameter Tuning Results: GridSearchCV optimization details
  • Model Limitations & Future Improvements: Version 2.0 roadmap
  • Complete Project Summary: Final conclusion section in notebook

🎓 Key Learnings

Technical Skills Demonstrated:

  • End-to-end ML pipeline (cleaning → EDA → modeling → evaluation)
  • Handling class imbalance with class_weight='balanced'
  • Feature engineering (previously_contacted creation)
  • Model comparison (Logistic Regression vs Decision Tree vs Random Forest)
  • Hyperparameter tuning with GridSearchCV
  • Data leakage prevention (dropping duration column)

Business Acumen:

  • Translating model metrics (precision/recall) into business terms (cost/revenue)
  • Balancing False Positives vs False Negatives based on business context
  • ROI calculation and projection
  • Strategic recommendations from data insights

Data Science Best Practices:

  • Train/test split with stratification
  • Feature scaling (StandardScaler)
  • Feature importance analysis for interpretability
  • Strategic handling of "unknown" values

About

No description, website, or topics provided.

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Contributors 3

  •  
  •  
  •