Loan Default Prediction with Machine Learning

A comprehensive machine learning project for predicting loan defaults using LendingTree data. This project demonstrates advanced data preprocessing, feature engineering, and model optimization techniques on a large-scale dataset of 2.2 million records.

Overview

This project involved working with a large dataset of 2.2 million records and 150 columns. Through preprocessing, the feature set was reduced by approximately 65%, significantly optimizing the data without sacrificing model performance.

🛠️ Technologies Used

Core Technologies

Python 3.12+ - Primary programming language
Jupyter Notebooks - Interactive development and analysis
Git - Version control and project management

Data Science & Machine Learning

pandas - Data manipulation and analysis
NumPy - Numerical computing and array operations
scikit-learn - Machine learning algorithms and preprocessing
XGBoost - Gradient boosting framework for classification
scipy - Statistical analysis and scientific computing

Data Visualization & Analysis

matplotlib - Basic plotting and visualization
seaborn - Statistical data visualization
Point-biserial correlation - Feature-target relationship analysis

Testing & Quality Assurance

pytest - Testing framework and test automation
pytest-cov - Coverage reporting
Comprehensive test suite - 44 tests covering model performance, data validation, and edge cases

📊 Machine Learning Techniques

Data Preprocessing Pipeline

Feature Reduction - Reduced 150 features by ~65% through correlation analysis and feature selection
Missing Value Imputation - Strategic handling of missing data with domain-specific approaches
Outlier Treatment - IQR-based outlier capping to improve model robustness
One-Hot Encoding - Categorical variable transformation for model compatibility
Standard Scaling - Feature normalization for improved model performance

Feature Engineering

Derived Variables - Created debt-to-income ratios and payment-to-income ratios
Interest Rate Binning - Risk-based grouping using quantile discretization
Missing Value Indicators - Binary flags for systematic missing patterns
Correlation Analysis - Point-biserial correlation for feature-target relationships

Class Imbalance Handling

Scale Position Weight - XGBoost native class imbalance adjustment
Stratified Sampling - Maintained class distribution in train/test splits
Performance Metrics - ROC AUC prioritized over accuracy for imbalanced data

Model Selection & Performance

LazyPredict - Initially attempted but failed due to memory overload with large dataset and unlimited tree depth
Random Forest - Tested but yielded negligible results for this classification task
XGBoost (Extreme Gradient Boosting) - Selected as optimal model achieving 72.78% ROC AUC
- Native class imbalance handling eliminates need for SMOTE
- Efficient memory usage for large datasets
- Superior performance on tabular data with mixed feature types

🔍 Model Performance

Final Results

ROC AUC Score: 72.78%
Accuracy: 64.15%
Dataset Size: 2.2M records processed
Feature Reduction: 65% reduction from 150 to ~53 features
Class Distribution: 87% good loans, 13% defaults (handled via scale_pos_weight)

Key Findings

The results aligned well with real-world predictors of default risk:

🎯 Strong Predictive Features

Payment Method - Direct pay demonstrates lower default risk
Verification Status - Income verification strongly correlates with repayment
Account Activity - Number of accounts opened in last 24 months indicates credit behavior

📊 Moderate Predictive Features

Interest Rate - Reflects borrower risk but limited direct predictive power
Debt-to-Income Ratio - Ranked ~50th percentile in feature importance, less impactful than anticipated
Credit History Length - Moderate influence on default probability

🚀 Future Enhancements

Model Improvements

SMOTE Implementation - Apply synthetic oversampling techniques for better class balance handling
Hyperparameter Tuning - Grid search optimization for XGBoost parameters
Ensemble Methods - Explore stacking and voting classifiers for performance gains
Feature Engineering - Create interaction terms and polynomial features

Technical Enhancements

Model Deployment - API development for real-time predictions
Feature Store - Centralized feature management and versioning
MLOps Pipeline - Automated model training, validation, and deployment
Explainability - SHAP values and LIME for model interpretability

🧪 Testing

This project includes a comprehensive test suite to validate model performance, data quality, and edge cases.

Setup Environment

Create and activate a virtual environment:

# Create virtual environment
python3 -m venv venv

# Activate environment (Linux/Mac)
source venv/bin/activate

# Or use the helper script
source activate_env.sh

# Install dependencies
pip install -r requirements-test.txt

Running Tests

# Run all tests
python -m pytest tests/ -v

# Run specific test categories
python -m pytest tests/test_model_performance.py -v    # Model performance tests
python -m pytest tests/test_data_validation.py -v     # Data validation tests  
python -m pytest tests/test_edge_cases.py -v          # Edge case tests

# Run with coverage report
python -m pytest tests/ --cov=tests --cov-report=html

# Run tests and generate detailed output
python -m pytest tests/ -v --tb=long

Test Categories

Model Performance Tests (12 tests)

Accuracy, precision, recall, F1-score validation
ROC AUC score validation (≥0.5 threshold)
Prediction probability ranges and consistency
Feature importance validation
Class imbalance handling

Data Validation Tests (18 tests)

Schema validation (required columns, data types)
Value range validation (positive amounts, valid rates)
Missing value thresholds (<50%)
Duplicate detection (<5%)
Outlier detection and correlation analysis
Class distribution validation

Edge Case Tests (16 tests)

Empty data handling
Extreme value inputs (zeros, ones, very large/small numbers)
Single row predictions and boundary conditions
High-dimensional data and class imbalance scenarios
Mixed data types and numerical precision edge cases

Test Results

All 44 tests pass, providing confidence in:

Model reliability and performance thresholds
Data quality and preprocessing pipeline
Robustness against edge cases and unexpected inputs

Deactivate Environment

deactivate

This project provided valuable insights into handling large datasets and reinforced the importance of method selection and preprocessing in machine learning pipelines.

Name		Name	Last commit message	Last commit date
Latest commit History 5 Commits
tests		tests
.gitignore		.gitignore
README.md		README.md
activate_env.sh		activate_env.sh
loan_default_prediction_analysis.ipynb		loan_default_prediction_analysis.ipynb
pytest.ini		pytest.ini
requirements-test.txt		requirements-test.txt
run_tests.py		run_tests.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

Loan Default Prediction with Machine Learning

Overview

🛠️ Technologies Used

Core Technologies

Data Science & Machine Learning

Data Visualization & Analysis

Testing & Quality Assurance

📊 Machine Learning Techniques

Data Preprocessing Pipeline

Feature Engineering

Class Imbalance Handling

Model Selection & Performance

🔍 Model Performance

Final Results

Key Findings

🚀 Future Enhancements

Model Improvements

Technical Enhancements

🧪 Testing

Setup Environment

Running Tests

Test Categories

Test Results

Deactivate Environment

About

Uh oh!

Releases

Packages

Contributors 2

Uh oh!

Languages

wbott/ml-classification-loans

Folders and files

Latest commit

History

Repository files navigation

Loan Default Prediction with Machine Learning

Overview

🛠️ Technologies Used

Core Technologies

Data Science & Machine Learning

Data Visualization & Analysis

Testing & Quality Assurance

📊 Machine Learning Techniques

Data Preprocessing Pipeline

Feature Engineering

Class Imbalance Handling

Model Selection & Performance

🔍 Model Performance

Final Results

Key Findings

🚀 Future Enhancements

Model Improvements

Technical Enhancements

🧪 Testing

Setup Environment

Running Tests

Test Categories

Test Results

Deactivate Environment

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Contributors 2

Uh oh!

Languages

Packages