A comprehensive machine learning project for predicting loan defaults using LendingTree data. This project demonstrates advanced data preprocessing, feature engineering, and model optimization techniques on a large-scale dataset of 2.2 million records.
This project involved working with a large dataset of 2.2 million records and 150 columns. Through preprocessing, the feature set was reduced by approximately 65%, significantly optimizing the data without sacrificing model performance.
- Python 3.12+ - Primary programming language
- Jupyter Notebooks - Interactive development and analysis
- Git - Version control and project management
- pandas - Data manipulation and analysis
- NumPy - Numerical computing and array operations
- scikit-learn - Machine learning algorithms and preprocessing
- XGBoost - Gradient boosting framework for classification
- scipy - Statistical analysis and scientific computing
- matplotlib - Basic plotting and visualization
- seaborn - Statistical data visualization
- Point-biserial correlation - Feature-target relationship analysis
- pytest - Testing framework and test automation
- pytest-cov - Coverage reporting
- Comprehensive test suite - 44 tests covering model performance, data validation, and edge cases
- Feature Reduction - Reduced 150 features by ~65% through correlation analysis and feature selection
- Missing Value Imputation - Strategic handling of missing data with domain-specific approaches
- Outlier Treatment - IQR-based outlier capping to improve model robustness
- One-Hot Encoding - Categorical variable transformation for model compatibility
- Standard Scaling - Feature normalization for improved model performance
- Derived Variables - Created debt-to-income ratios and payment-to-income ratios
- Interest Rate Binning - Risk-based grouping using quantile discretization
- Missing Value Indicators - Binary flags for systematic missing patterns
- Correlation Analysis - Point-biserial correlation for feature-target relationships
- Scale Position Weight - XGBoost native class imbalance adjustment
- Stratified Sampling - Maintained class distribution in train/test splits
- Performance Metrics - ROC AUC prioritized over accuracy for imbalanced data
- LazyPredict - Initially attempted but failed due to memory overload with large dataset and unlimited tree depth
- Random Forest - Tested but yielded negligible results for this classification task
- XGBoost (Extreme Gradient Boosting) - Selected as optimal model achieving 72.78% ROC AUC
- Native class imbalance handling eliminates need for SMOTE
- Efficient memory usage for large datasets
- Superior performance on tabular data with mixed feature types
- ROC AUC Score: 72.78%
- Accuracy: 64.15%
- Dataset Size: 2.2M records processed
- Feature Reduction: 65% reduction from 150 to ~53 features
- Class Distribution: 87% good loans, 13% defaults (handled via scale_pos_weight)
The results aligned well with real-world predictors of default risk:
🎯 Strong Predictive Features
- Payment Method - Direct pay demonstrates lower default risk
- Verification Status - Income verification strongly correlates with repayment
- Account Activity - Number of accounts opened in last 24 months indicates credit behavior
📊 Moderate Predictive Features
- Interest Rate - Reflects borrower risk but limited direct predictive power
- Debt-to-Income Ratio - Ranked ~50th percentile in feature importance, less impactful than anticipated
- Credit History Length - Moderate influence on default probability
- SMOTE Implementation - Apply synthetic oversampling techniques for better class balance handling
- Hyperparameter Tuning - Grid search optimization for XGBoost parameters
- Ensemble Methods - Explore stacking and voting classifiers for performance gains
- Feature Engineering - Create interaction terms and polynomial features
- Model Deployment - API development for real-time predictions
- Feature Store - Centralized feature management and versioning
- MLOps Pipeline - Automated model training, validation, and deployment
- Explainability - SHAP values and LIME for model interpretability
This project includes a comprehensive test suite to validate model performance, data quality, and edge cases.
Create and activate a virtual environment:
# Create virtual environment
python3 -m venv venv
# Activate environment (Linux/Mac)
source venv/bin/activate
# Or use the helper script
source activate_env.sh
# Install dependencies
pip install -r requirements-test.txt# Run all tests
python -m pytest tests/ -v
# Run specific test categories
python -m pytest tests/test_model_performance.py -v # Model performance tests
python -m pytest tests/test_data_validation.py -v # Data validation tests
python -m pytest tests/test_edge_cases.py -v # Edge case tests
# Run with coverage report
python -m pytest tests/ --cov=tests --cov-report=html
# Run tests and generate detailed output
python -m pytest tests/ -v --tb=longModel Performance Tests (12 tests)
- Accuracy, precision, recall, F1-score validation
- ROC AUC score validation (≥0.5 threshold)
- Prediction probability ranges and consistency
- Feature importance validation
- Class imbalance handling
Data Validation Tests (18 tests)
- Schema validation (required columns, data types)
- Value range validation (positive amounts, valid rates)
- Missing value thresholds (<50%)
- Duplicate detection (<5%)
- Outlier detection and correlation analysis
- Class distribution validation
Edge Case Tests (16 tests)
- Empty data handling
- Extreme value inputs (zeros, ones, very large/small numbers)
- Single row predictions and boundary conditions
- High-dimensional data and class imbalance scenarios
- Mixed data types and numerical precision edge cases
All 44 tests pass, providing confidence in:
- Model reliability and performance thresholds
- Data quality and preprocessing pipeline
- Robustness against edge cases and unexpected inputs
deactivateThis project provided valuable insights into handling large datasets and reinforced the importance of method selection and preprocessing in machine learning pipelines.