Skip to content

This report explores the application of machine learning models in loan risk analysis, highlighting their potential to transform the lending industry.

License

Notifications You must be signed in to change notification settings

lamthienphuc/Loan-Risk-Analysis-and-Prediction-Using-Machine-Learning-Models

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

5 Commits
 
 
 
 
 
 

Repository files navigation

Loan-Risk-Analysis-and-Prediction-Using-Machine-Learning-Models

Data Collection and Preprocessing

Dataset was taken from Kaggle website, where we can explore, analyze, and share quality data. Dataset name: “Loan_data”, publicly available data from LendingClub.com. Lending Club connects people who need money (borrowers) with people who have money (investors). Hopefully, as an investor you would want to invest in people who showed a profile of having a high probability of paying you back. We will use lending data from 2007-2010 and be trying to classify and predict whether or not the borrower paid back their loan in full. image

Number of Entries: 9578

Number of Features: 14

Feature Details image

Data Cleaning

There are no missing values in the dataset Inspect the 'purpose' Column ['debt_consolidation' 'credit_card' 'all_other' 'home_improvement' 'small_business' 'major_purchase' 'educational'] This line prints the unique values in the 'purpose' column to ensure it contains valid values. This helps in understanding the different categories present in the column. Clean the 'purpose' Column If the 'purpose' column has concatenated values (e.g., "debt consolidation"), this line splits the values and keeps only the first part. This is done using the apply function with a lambda function that splits the string by spaces and takes the first element. One-Hot Encoding for the 'purpose' Column image Concatenate Encoded Features with the Original Dataset image

Feature Egineering

Income to Debt Ratio Formula: Income-to-Debt Ratio (ID Ratio) = log(Annual Income) / (Debt-to-Income Ratio (DTI) + 1) Creates a new feature called income_to_debt_ratio by dividing the logarithm of the annual income by the debt-to-income ratio (plus 1 to avoid division by zero). This ratio helps in understanding the borrower's ability to manage debt relative to their income, which is crucial for risk assessment.

FICO Score Binning Bins the FICO scores into categorical ranges: 'Poor', 'Fair', 'Good', and 'Excellent'. Binning helps in simplifying the continuous FICO score into discrete categories, making it easier to interpret and analyze the risk associated with different credit score ranges.

Exploratory Data Analysis (EDA)

These descriptive statistics provide a comprehensive overview of the dataset, highlighting the central tendencies and variations in the key numerical features. image image

Data Visualizations

image Insights Debt Consolidation Dominance: The high count for debt consolidation loans suggests that many borrowers are looking to manage their existing debts more effectively. Credit Card Debt: The significant number of loans for credit card debt indicates that credit card debt is a common financial issue among borrowers. Diverse Purposes: The presence of the "all_other" category with a substantial count shows that there are various other reasons borrowers take loans, which are not specifically categorized.

image

Common Interest Rates: The most frequent interest rates are around 12%, which could be considered the average or typical interest rate for loans in this dataset. Loan Affordability: The concentration of loans with interest rates between 10% and 14% suggests that these rates are common and possibly more affordable for borrowers. High-Interest Loans: There are fewer loans with very high interest rates (above 16%), indicating that such loans are less common or less preferred by borrowers. Understanding this distribution is crucial for financial institutions to set competitive interest rates and for borrowers to understand the typical rates they might encounter.

image This bar chart provides a clear overview of the distribution of FICO scores among borrowers. It highlights that most borrowers have FICO scores in the "Good" and "Fair" categories, indicating relatively strong credit histories.

image

Positive Correlations: Features with a positive correlation to "not.fully.paid" might indicate factors associated with an increased risk of loan default. For example, a high credit utilization ratio (revol.util) could be positively correlated with "not.fully.paid," suggesting borrowers with higher credit card balances are more likely to default.

Negative Correlations: Features with a negative correlation to "not.fully.paid" might indicate factors that mitigate default risk. For instance, a high FICO score (fico) could be negatively correlated with "not.fully.paid," implying borrowers with strong creditworthiness are less likely to default.

Model Development

Machine Learning Algorithms

  • Random Forest: Concept: This is an ensemble learning method that combines multiple decision trees to create a more robust and accurate predictor. How it Works: Random forests train individual decision trees on random subsets of data and features. Each tree predicts an outcome, and the final prediction is based on the majority vote (classification) or average (regression) of all the individual trees.
  • K-Nearest Neighbors (KNN): Concept: This is a non-parametric, lazy learning algorithm. It classifies new data points based on the labels of their k nearest neighbors in the training data. How it Works: KNN identifies the k data points in the training set that are closest to the new data point based on a chosen distance metric (e.g., Euclidean distance). The new data point is assigned the majority class label (classification) or average value (regression) of its k neighbors.
  • Gradient Boosting: Concept: This is an ensemble learning method that combines multiple weak learners (typically decision trees) in a sequential way. Each subsequent learner focuses on improving the errors made by the previous ones. image

How it Works: Gradient boosting trains a sequence of models, where each model learns from the residuals (errors) of the previous model. This sequential approach allows the model to progressively improve its predictions.

  • Logistic Regression: Concept: This is a statistical method used for binary classification problems. It estimates the probability of a data point belonging to a specific class (e.g., loan default or not default) by fitting a sigmoid function to the data. image

How it Works: Logistic regression analyzes the relationship between independent variables (features) and the dependent variable (binary outcome) using a logistic function. The model outputs a probability value between 0 (not likely) and 1 (highly likely) for the data point belonging to the positive class.

  • Naive Bayes: Concept: This is a probabilistic classification method based on Bayes' theorem. It assumes independence between features (features are not correlated) and predicts the class with the highest probability based on this assumption. image

How it Works: Naive Bayes calculates the probability of each class given the features of a new data point. It then uses Bayes' theorem to determine the most probable class for the data point. Strengths: Simple to implement and computationally efficient, works well for certain types of problems with categorical features.

Model Selection

Model Rankings based on ROC-AUC:

  1. Random Forest: 0.9978935799981683
  2. K-Nearest Neighbors: 0.9732413428173112
  3. Gradient Boosting: 0.816063742100925
  4. Logistic Regression: 0.6874703625688149
  5. Naive Bayes: 0.6323147215353461

Model Rankings based on accuracy:

  1. Random Forest: 0.9796450939457203
  2. Gradient Boosting: 0.8538622129436325
  3. K-Nearest Neighbors: 0.7693110647181628
  4. Logistic Regression: 0.6450939457202505
  5. Naive Bayes: 0.5715031315240083

Based on these models comparison, top three of the board will be select to analyze and forecast is Random Forest, Gradient Boosting, K-Nearest Neighbors

Handling class imbalance

image

Class Weight Adjustment

Class weight adjustment is a technique used to handle class imbalance by assigning different weights to classes during the training of a machine learning model. This method helps the model pay more attention to the minority class by penalizing misclassifications of minority class instances more heavily than those of the majority class. image image image

Hyperparameter tuning

What are Hyperparameters?

Hyperparameters are the parameters of a machine learning model that are not learned from the data but are set before the training process begins. They control the behavior of the training algorithm and the structure of the model.

Why Tune Hyperparameters? Hyperparameter tuning is crucial because the performance of a machine learning model can be significantly affected by the choice of hyperparameters.

Proper tuning can lead to: Improved Accuracy: Better generalization to unseen data. Reduced Overfitting: Prevents the model from learning noise in the training data. Optimized Training Time: Efficient use of computational resources.

Model Evaluation

Confusion Matrix A confusion matrix is a table used to describe the performance of a classification model. It shows the counts of true positive, true negative, false positive, and false negative predictions. Predicted Positive Predicted Negative Actual Positive True Positive (TP) False Negative (FN) Actual Negative False Positive (FP) True Negative (TN)

image

image

ROC Curves

image

Key Components of the ROC Curve: True Positive Rate (TPR):

  1. Also known as Sensitivity or Recall.
  2. It is plotted on the Y-axis.
  3. TPR = TP / (TP + FN), where TP is True Positives and FN is False Negatives.
  4. It represents the proportion of actual positives that are correctly identified by the model. False Positive Rate (FPR):
  5. It is plotted on the X-axis.
  6. FPR = FP / (FP + TN), where FP is False Positives and TN is True Negatives.
  7. It represents the proportion of actual negatives that are incorrectly identified as positives by the model. Diagonal Line (Random Classifier):
  8. The diagonal dashed line represents the performance of a random classifier.
  9. It has an Area Under the Curve (AUC) of 0.5, indicating no discrimination ability between positive and negative classes. Area Under the Curve (AUC): The Area Under the ROC Curve (AUC) is a single scalar value that summarizes the performance of the classifier. AUC ranges from 0 to 1.

AUC = 0.5: The model has no discrimination ability (equivalent to random guessing).

AUC < 0.5: The model performs worse than random guessing.

AUC > 0.5: The model has some discrimination ability, with higher values indicating better performance.

Summary:

The ROC curve provides a visual comparison of the performance of different classifiers. The KNN model has the highest AUC (0.59), indicating it performs the best among the three models in distinguishing between positive and negative classes. The Gradient Boosting model has an AUC of 0.58, indicating it performs slightly worse than KNN but better than Random Forest. The Random Forest model has the lowest AUC (0.54), indicating it has the least discrimination ability among the three models.

Model Interpretation

Feature importance refers to techniques that assign a score to input features based on how useful they are at predicting a target variable. Understanding feature importance can help in: Model Interpretation: Understanding which features are most influential in making predictions. Feature Selection: Reducing the number of input features to improve model performance and reduce overfitting. Insight Generation: Gaining insights into the underlying data and the relationships between features and the target variable. image image Interpretation: FICO Score: The FICO score is the most influential feature, suggesting that it is a strong predictor of the target variable. This makes sense as the FICO score is a widely used measure of creditworthiness. Interest Rate and Installment: These financial metrics are also highly important, indicating that they play a significant role in the model's predictions.  Credit History and Income: Features related to credit history (e.g., days with a credit line) and income (e.g., log of annual income) are also important, reflecting their relevance in financial decision-making. Loan Purpose: The purpose of the loan (e.g., debt consolidation, credit card) has varying levels of importance, with debt consolidation being more important than others like educational purposes. Less Important Features: Features like the number of public records and delinquencies in the past two years have lower importance, suggesting they are less influential in the model's predictions.

Summary of Predicted Results (Random Forest):

Actual Predicted_RF Predicted_RF_Prob

8558 0 1 0.67

4629 0 0 0.17

1383 1 0 0.14

8142 0 0 0.18

1768 0 0 0.06

Summary of Predicted Results (K-Nearest Neighbors):

Actual Predicted_KNN Predicted_KNN_Prob

8558 0 1 1.0

4629 0 0 0.0

1383 1 0 0.0

8142 0 0 0.0

1768 0 0 0.0

Summary of Predicted Results (Gradient Boosting):

Actual Predicted_GB Predicted_GB_Prob

8558 0 1 0.687005

4629 0 0 0.455550

1383 1 0 0.276111

8142 0 0 0.441451

1768 0 0 0.250791

The Random Forest model has some misclassifications. For instance, it incorrectly predicts that instance 8558 will fully pay the loan with a relatively high probability (0.67). It correctly predicts most of the instances with lower probabilities for not fully paid loans.

The KNN model also has misclassifications. For instance, it incorrectly predicts that instance 8558 will fully pay the loan with a probability of 1.0. It tends to give extreme probabilities (0.0 or 1.0), which might indicate overconfidence in its predictions.

The Gradient Boosting model has some misclassifications as well. For instance, it incorrectly predicts that instance 8558 will fully pay the loan with a probability of 0.687005. It provides more nuanced probabilities compared to KNN, indicating a more calibrated prediction.

Predictions and Future work

image

Because FICO score and Interest rate is a most importance features and the predictive models— KNN, Gradient Boosting, and Random Forest—provide varying outcomes based on different borrower profiles, showcasing their distinct decision-making mechanisms.Here's a breakdown of their predictions and some potential conclusions:

Predictions Summary

FICO Score: 700, Interest Rate: 0.125 KNN: Will fully pay the loan. Gradient Boosting: Will not fully pay the loan. Random Forest: Will fully pay the loan.

FICO Score: 600, Interest Rate: 0.2 KNN: Will fully pay the loan. Gradient Boosting: Will not fully pay the loan. Random Forest: Will not fully pay the loan.

FICO Score: 800, Interest Rate: 0.3 KNN: Will fully pay the loan. Gradient Boosting: Will not fully pay the loan. Random Forest: Will fully pay the loan.

Model Differences: KNN consistently predicts that borrowers will fully pay the loan across all profiles. Gradient Boosting consistently predicts that borrowers will not fully pay the loan across all profiles. Random Forest predictions are similar to KNN for profiles with higher FICO scores but align with Gradient Boosting for lower FICO scores and higher interest rates.

FICO Score Impact: Higher FICO scores generally indicate a better credit history and a higher likelihood of loan repayment. Gradient Boosting seems more conservative, predicting non-repayment even for higher FICO scores, possibly due to a higher sensitivity to interest rates or other factors.

Interest Rate Impact: Higher interest rates are generally associated with higher risk, which can affect the likelihood of loan repayment. KNN and Random Forest might weigh FICO scores more heavily than interest rates, leading to more optimistic predictions for high-FICO-score borrowers despite high interest rates. Gradient Boosting might consider the interest rate more heavily, leading to a pessimistic prediction even for high-FICO-score borrowers with high interest rates.

In summary, while KNN and Random Forest are more optimistic about loan repayment, Gradient Boosting is more conservative. The differences highlight the importance of model selection and the need for potentially using ensemble methods or further model refinement for consistent and reliable predictions.

About

This report explores the application of machine learning models in loan risk analysis, highlighting their potential to transform the lending industry.

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published