Skip to content

A machine learning pipeline for large-scale default risk prediction using XGBoost, focused on efficient preprocessing and feature optimization.

Notifications You must be signed in to change notification settings

wbott/ml-classification-loans

Repository files navigation

Loan Default Prediction with Machine Learning

A comprehensive machine learning project for predicting loan defaults using LendingTree data. This project demonstrates advanced data preprocessing, feature engineering, and model optimization techniques on a large-scale dataset of 2.2 million records.

Overview

This project involved working with a large dataset of 2.2 million records and 150 columns. Through preprocessing, the feature set was reduced by approximately 65%, significantly optimizing the data without sacrificing model performance.

🛠️ Technologies Used

Core Technologies

  • Python 3.12+ - Primary programming language
  • Jupyter Notebooks - Interactive development and analysis
  • Git - Version control and project management

Data Science & Machine Learning

  • pandas - Data manipulation and analysis
  • NumPy - Numerical computing and array operations
  • scikit-learn - Machine learning algorithms and preprocessing
  • XGBoost - Gradient boosting framework for classification
  • scipy - Statistical analysis and scientific computing

Data Visualization & Analysis

  • matplotlib - Basic plotting and visualization
  • seaborn - Statistical data visualization
  • Point-biserial correlation - Feature-target relationship analysis

Testing & Quality Assurance

  • pytest - Testing framework and test automation
  • pytest-cov - Coverage reporting
  • Comprehensive test suite - 44 tests covering model performance, data validation, and edge cases

📊 Machine Learning Techniques

Data Preprocessing Pipeline

  • Feature Reduction - Reduced 150 features by ~65% through correlation analysis and feature selection
  • Missing Value Imputation - Strategic handling of missing data with domain-specific approaches
  • Outlier Treatment - IQR-based outlier capping to improve model robustness
  • One-Hot Encoding - Categorical variable transformation for model compatibility
  • Standard Scaling - Feature normalization for improved model performance

Feature Engineering

  • Derived Variables - Created debt-to-income ratios and payment-to-income ratios
  • Interest Rate Binning - Risk-based grouping using quantile discretization
  • Missing Value Indicators - Binary flags for systematic missing patterns
  • Correlation Analysis - Point-biserial correlation for feature-target relationships

Class Imbalance Handling

  • Scale Position Weight - XGBoost native class imbalance adjustment
  • Stratified Sampling - Maintained class distribution in train/test splits
  • Performance Metrics - ROC AUC prioritized over accuracy for imbalanced data

Model Selection & Performance

  • LazyPredict - Initially attempted but failed due to memory overload with large dataset and unlimited tree depth
  • Random Forest - Tested but yielded negligible results for this classification task
  • XGBoost (Extreme Gradient Boosting) - Selected as optimal model achieving 72.78% ROC AUC
    • Native class imbalance handling eliminates need for SMOTE
    • Efficient memory usage for large datasets
    • Superior performance on tabular data with mixed feature types

🔍 Model Performance

Final Results

  • ROC AUC Score: 72.78%
  • Accuracy: 64.15%
  • Dataset Size: 2.2M records processed
  • Feature Reduction: 65% reduction from 150 to ~53 features
  • Class Distribution: 87% good loans, 13% defaults (handled via scale_pos_weight)

Key Findings

The results aligned well with real-world predictors of default risk:

🎯 Strong Predictive Features

  • Payment Method - Direct pay demonstrates lower default risk
  • Verification Status - Income verification strongly correlates with repayment
  • Account Activity - Number of accounts opened in last 24 months indicates credit behavior

📊 Moderate Predictive Features

  • Interest Rate - Reflects borrower risk but limited direct predictive power
  • Debt-to-Income Ratio - Ranked ~50th percentile in feature importance, less impactful than anticipated
  • Credit History Length - Moderate influence on default probability

🚀 Future Enhancements

Model Improvements

  • SMOTE Implementation - Apply synthetic oversampling techniques for better class balance handling
  • Hyperparameter Tuning - Grid search optimization for XGBoost parameters
  • Ensemble Methods - Explore stacking and voting classifiers for performance gains
  • Feature Engineering - Create interaction terms and polynomial features

Technical Enhancements

  • Model Deployment - API development for real-time predictions
  • Feature Store - Centralized feature management and versioning
  • MLOps Pipeline - Automated model training, validation, and deployment
  • Explainability - SHAP values and LIME for model interpretability

🧪 Testing

This project includes a comprehensive test suite to validate model performance, data quality, and edge cases.

Setup Environment

Create and activate a virtual environment:

# Create virtual environment
python3 -m venv venv

# Activate environment (Linux/Mac)
source venv/bin/activate

# Or use the helper script
source activate_env.sh

# Install dependencies
pip install -r requirements-test.txt

Running Tests

# Run all tests
python -m pytest tests/ -v

# Run specific test categories
python -m pytest tests/test_model_performance.py -v    # Model performance tests
python -m pytest tests/test_data_validation.py -v     # Data validation tests  
python -m pytest tests/test_edge_cases.py -v          # Edge case tests

# Run with coverage report
python -m pytest tests/ --cov=tests --cov-report=html

# Run tests and generate detailed output
python -m pytest tests/ -v --tb=long

Test Categories

Model Performance Tests (12 tests)

  • Accuracy, precision, recall, F1-score validation
  • ROC AUC score validation (≥0.5 threshold)
  • Prediction probability ranges and consistency
  • Feature importance validation
  • Class imbalance handling

Data Validation Tests (18 tests)

  • Schema validation (required columns, data types)
  • Value range validation (positive amounts, valid rates)
  • Missing value thresholds (<50%)
  • Duplicate detection (<5%)
  • Outlier detection and correlation analysis
  • Class distribution validation

Edge Case Tests (16 tests)

  • Empty data handling
  • Extreme value inputs (zeros, ones, very large/small numbers)
  • Single row predictions and boundary conditions
  • High-dimensional data and class imbalance scenarios
  • Mixed data types and numerical precision edge cases

Test Results

All 44 tests pass, providing confidence in:

  • Model reliability and performance thresholds
  • Data quality and preprocessing pipeline
  • Robustness against edge cases and unexpected inputs

Deactivate Environment

deactivate

This project provided valuable insights into handling large datasets and reinforced the importance of method selection and preprocessing in machine learning pipelines.

About

A machine learning pipeline for large-scale default risk prediction using XGBoost, focused on efficient preprocessing and feature optimization.

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Contributors 2

  •  
  •