A production-ready, two-stage machine learning pipeline designed to process massive financial datasets (28M+ records), automate loan triage at scale, and predict default probabilities.
This system utilizes a hierarchical risk pipeline to efficiently handle extreme class imbalances (92% rejection rate):
- Stage 1: The Gatekeeper (Automated Triage): An XGBoost classifier trained on the full 28M+ record dataset to instantly filter structurally unviable or "spam" applications.
- Stage 2: Default Probability (Risk Assessment): A high-precision model applied only to the accepted ~268k applications, predicting the exact likelihood of "Charged Off" vs. "Fully Paid" outcomes.
- Big Data & GPU Compute: Utilized NVIDIA RAPIDS (cuDF) to process 3.6GB+ of data, bypassing standard CPU memory limits for massive-scale feature engineering.
- Imbalance Management: Implemented strategic sampling and weighted loss functions to prevent the model from collapsing into majority-class guessing.
- Production-Ready Deployment: Models are serialized in lightweight
.jsonformats with independent feature-list tracking in theCode/Models/directory, ensuring environment-agnostic inference.
The Gatekeeper successfully learned the bank's underlying risk appetite, handling massive sparsity to automate initial triage.
Insight: The model autonomously identified the highest density of loan approvals at a 14.6% DTI, while acting as a hard filter for applicants with DTIs exceeding 40%.
Instead of relying on arbitrary accuracy metrics, the Default Probability model is calibrated strictly for business profitability.
By simulating expected net profit against the probability of default, the model identified the exact threshold that maximizes revenue while minimizing charged-off loans.
| Approval Threshold | Net Profit ($M) | Defaults Avoided |
|---|---|---|
| 0.70 | $112.32M | 41,048 |
| 0.75 | $142.05M | 35,539 |
| 0.80 | $171.08M | 28,315 |
| 0.85 | $195.64M | 19,465 |
| 0.90 | $198.36M | 9,616 |
| 0.95 | $166.67M | 1,280 |
| 1.00 | $153.72M | 0 |
Insight: The model proves that an optimal risk threshold of 0.90 yields a peak net profit of $198.36 Million by strategically avoiding 9,616 high-risk defaults without over-restricting loan volume.
Gatekeeper_Model.ipynb: Massive-scale data cleaning, GPU-accelerated merging, and initial classification.Default_Probability_Model.ipynb: Secondary risk modeling, feature engineering for "Accepted" loans, and profit-curve optimization.Code/Models/: Serialized production-ready models and exact feature mappings.



