Skip to content

chaboihenry/Credit-Risk

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

17 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

🏦 Strategic Credit Risk & Automated Underwriting Pipeline

A production-ready, two-stage machine learning pipeline designed to process massive financial datasets (28M+ records), automate loan triage at scale, and predict default probabilities.

🚀 System Architecture

This system utilizes a hierarchical risk pipeline to efficiently handle extreme class imbalances (92% rejection rate):

  1. Stage 1: The Gatekeeper (Automated Triage): An XGBoost classifier trained on the full 28M+ record dataset to instantly filter structurally unviable or "spam" applications.
  2. Stage 2: Default Probability (Risk Assessment): A high-precision model applied only to the accepted ~268k applications, predicting the exact likelihood of "Charged Off" vs. "Fully Paid" outcomes.

🛠️ Technical Highlights

  • Big Data & GPU Compute: Utilized NVIDIA RAPIDS (cuDF) to process 3.6GB+ of data, bypassing standard CPU memory limits for massive-scale feature engineering.
  • Imbalance Management: Implemented strategic sampling and weighted loss functions to prevent the model from collapsing into majority-class guessing.
  • Production-Ready Deployment: Models are serialized in lightweight .json formats with independent feature-list tracking in the Code/Models/ directory, ensuring environment-agnostic inference.

📊 Business Logic & Model Insights

1. The Gatekeeper (Approval Logic)

The Gatekeeper successfully learned the bank's underlying risk appetite, handling massive sparsity to automate initial triage.

Target Imbalance

DTI Logic Visual

Insight: The model autonomously identified the highest density of loan approvals at a 14.6% DTI, while acting as a hard filter for applicants with DTIs exceeding 40%.

2. Default Prediction (Profit Optimization)

Instead of relying on arbitrary accuracy metrics, the Default Probability model is calibrated strictly for business profitability.

Predicted Probabilities

By simulating expected net profit against the probability of default, the model identified the exact threshold that maximizes revenue while minimizing charged-off loans.

Profit Optimization Curve

Approval Threshold Net Profit ($M) Defaults Avoided
0.70 $112.32M 41,048
0.75 $142.05M 35,539
0.80 $171.08M 28,315
0.85 $195.64M 19,465
0.90 $198.36M 9,616
0.95 $166.67M 1,280
1.00 $153.72M 0

Insight: The model proves that an optimal risk threshold of 0.90 yields a peak net profit of $198.36 Million by strategically avoiding 9,616 high-risk defaults without over-restricting loan volume.

📁 Repository Structure

  • Gatekeeper_Model.ipynb: Massive-scale data cleaning, GPU-accelerated merging, and initial classification.
  • Default_Probability_Model.ipynb: Secondary risk modeling, feature engineering for "Accepted" loans, and profit-curve optimization.
  • Code/Models/: Serialized production-ready models and exact feature mappings.

About

Scaleable 2-stage Credit Risk pipeline using XGBoost & NVIDIA RAPIDS. Analyzes 28M+ Lending Club loan records (2007-2018) to automate underwriting and predict default probability.

Topics

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors