GitHub - LenkaRo/predicting_loan_defaults: Machine learning with Python - development of a classifier (Logistic Regression, Naive Bayes and Random Forest) in order to provide an online P2P loan provider with a risk assessment tool that predicts applicants likely to fall into default

Predicting Loan Defaults

Aim of the project

Create a risk assessment tool to help LendingClub, an online P2P loan provider, to understand whether an applicant is likely to pay a loan back or fall into default.

This repository contains underlaying code and data sources that I used to build a machine learing model (binary classifier).

Click here to access the presentation.

To process data and to create the model I worked with the programming language Python.

This is my final project as part of the CodeClan's Data Analysis course training.

Structure of the project - predictive analysis, classification:

A. Pre-modeling (data processing)

import libraries
data import and exploration (summary statistics, vizualizations)
data cleaning:
- drop features (columns) with more than 50% of values missing,
- drop features that only get populated once a loan has been granted (as model only inputs data for a new applicant)
- drop features containing just one constant value, drop categorical features showing high cardinality (= too many levels) that are not needed for machine learning and such as not worth further manipulation (eg. "binning")
- drop observations (rows) with missing values "regex" operations on Python object values (eg. get rid of '%' symbol)
- transform data types where needed (eg. str -> float)
- drop observations related to current loans (as model only inputs data for a new applicant)
feauture engineering:
- set target variable "loan_status" as binary (0 = default, 1 = paid)
- categorical features "binning" - lower the cardinality by binning values into eg. quartile intervals (results into just 4 levels and data less affected by outliers)
- apply Box-cox transformation to address skeweness (make variables distribution more normal)
feature selection:
- train/test split first to avoid overfitting!
- on train and test datasets separatelly:
  - perform further feature reduction
    - based on correlation (between features, and between features and target variable)
    - based on variation
feature dummying
- dummy categorical variables (models can only handle numerical features)

B. Model building - machine learning binary classification model:

Logistic regression,
Gaussian Naive Bayes,
Random Forest
- target variable: loan_status (binary variable)
- feature matrix (explanatory variables): all features engineered and selected in the pre-modeling step
- null model for cross validation
- build models on default classification threshold 0.5 (Logistic regression, Gaussian Naive Bayes, Decision Tree, Random Forest)
- access model performance
- calculate metrics: accuracy, precision, sensitivity (= recall), specificity, AUC
- visualize ROC
- visualize sensitivity and specificity graphs by modifying the classification threshold and find the optimal threshold
- compare the predictive power of adjusted models (TP and TN on confusion matrices) and choose the best model

C. Exploratory data analysis

list out main features that have direct and clear impact on an applicant being likely to pay the loan or fall into default
showcase model application on random sample of loan applicants

Python libraries

An example of used Python libraries that were fundamental for the project - used to read in, clean, transform, analyse and visulase data, and to train the models and evaluate their performance:

numpy, pandas, matplotlib.pyplot, seaborn, sklearn

Name		Name	Last commit message	Last commit date
Latest commit History 11 Commits
data		data
.gitignore		.gitignore
README.md		README.md
predicting_loan_defaults.ipynb		predicting_loan_defaults.ipynb
predicting_loan_defaults_presentation.pdf		predicting_loan_defaults_presentation.pdf
tree.png		tree.png

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Predicting Loan Defaults

Aim of the project

Structure of the project - predictive analysis, classification:

A. Pre-modeling (data processing)

B. Model building - machine learning binary classification model:

C. Exploratory data analysis

Python libraries

Code example

About

Releases

Packages

Languages

LenkaRo/predicting_loan_defaults

Folders and files

Latest commit

History

Repository files navigation

Predicting Loan Defaults

Aim of the project

Structure of the project - predictive analysis, classification:

A. Pre-modeling (data processing)

B. Model building - machine learning binary classification model:

C. Exploratory data analysis

Python libraries

Code example

About

Resources

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages