home | amitkaps.com | bargava.com
** Motivation for the Session**
- Solve a business problem
- Understand the end-to-end approach
- Build a data-driven Machine Learning application on the cloud
** Our approach ** is to take a case-driven example to showcase this. And we will aim to go-wide vs. go-deep to do so. The approach will be both practical and scalable.
Lets start by understanding the overall approach for doing so.
FRAME ——> ACQUIRE ——> REFINE ——>
\
TRANSFORM <——
↑ ↘
| EXPLORE
↓ ↗
MODEL <——
/
BUILD <—— DEPLOY <—— INSIGHT <——
- FRAME: Problem definition
- ACQUIRE: Data ingestion
- REFINE: Data wrangling
- TRANSFORM: Feature creation
- EXPLORE: Feature selection
- MODEL: Model creation
- INSIGHT: Model selection
- DEPLOY: Model deployment
- BUILD: Application building
"Doing data science requires quite a bit of thinking and we believe that when you’ve completed a good data science analysis, you’ve spent more time thinking than doing." - Roger Peng
A start-up providing loans to the consumer and has been running for the last few years. It is now planning to adopt a data-driven lens to its loan portfolio. What are the type of questions it can ask?
- What is the trend of loan defaults?
- Do older customers have more loan defaults?
- Which customer is likely to have a loan default?
- Why do customers default on their loan?
- Descriptive: Understand patterns, trends, deviations and outlier
- Inquisitive: Conduct hypothesis testing
- Predictive: Make a prediction
- Causal: Establish a causal link
Our Question: What is the probability of a loan default?
"Data is the new oil"
Ways to acquire data (typical data source)
- Download from an internal system
- Obtained from client, or other 3rd party
- Extracted from a web-based API
- Scraped from a website
- Extracted from a PDF file
- Gathered manually and recorded
Data Formats
- Flat files (e.g. csv)
- Excel files
- Database (e.g. MySQL)
- JSON
- HDFS (Hadoop)
#Load the libraries and configuration
import numpy as np
import pandas as pd
df = pd.read_csv("loan.csv")
df.dropna(axis=0, inplace=True)
df['log_age'] = np.log(df.age)
df['log_income'] = np.log(df.income)
from plotnine import *
%matplotlib inline
/Users/amitkaps/miniconda3/lib/python3.6/site-packages/statsmodels/compat/pandas.py:56: FutureWarning: The pandas.core.datetools module is deprecated and will be removed in a future version. Please use the pandas.tseries module instead.
from pandas.core import datetools
plt.matshow(df.corr())
-----------------------------------------------------------------------
NameError Traceback (most recent call last)
<ipython-input-70-61b8406b1e2d> in <module>()
----> 1 plt.matshow(df.corr())
NameError: name 'plt' is not defined
df.head()
.dataframe thead th {
text-align: left;
}
.dataframe tbody tr th {
vertical-align: top;
}
default | amount | grade | years | ownership | income | age | log_age | log_income | |
---|---|---|---|---|---|---|---|---|---|
0 | 0 | 1000 | B | 2.0 | RENT | 19200.0 | 24 | 3.178054 | 9.862666 |
1 | 1 | 6500 | A | 2.0 | MORTGAGE | 66000.0 | 28 | 3.332205 | 11.097410 |
2 | 0 | 2400 | A | 2.0 | RENT | 60000.0 | 36 | 3.583519 | 11.002100 |
3 | 0 | 10000 | C | 3.0 | RENT | 62000.0 | 24 | 3.178054 | 11.034890 |
4 | 1 | 4000 | C | 2.0 | RENT | 20000.0 | 28 | 3.332205 | 9.903488 |
df['default'] = df['default'].astype('category')
ggplot(df) + aes('grade', fill ="default") + geom_bar(position = 'fill')
<ggplot: (292944062)>
ggplot(df) + aes('grade', 'ownership', fill ="default") + geom_jitter(alpha = 0.2)
<ggplot: (-9223372036561379342)>
ggplot(df) + aes('ownership', fill ="default") + geom_bar(position = 'fill')
<ggplot: (-9223372036554328979)>
(
ggplot(df) +
aes('years', '..count..', color = 'default') +
geom_freqpoly()
)
/Users/amitkaps/miniconda3/lib/python3.6/site-packages/plotnine/stats/stat_bin.py:90: UserWarning: 'stat_bin()' using 'bins = 101'. Pick better value with 'binwidth'.
warn(msg.format(params['bins']))
<ggplot: (293366820)>
(
ggplot(df) +
aes('amount', '..count..', color = 'default') +
geom_freqpoly(binwidth = 0.05) +
scale_x_log10()
)
<ggplot: (299466111)>
(
ggplot(df) +
aes('income', '..count..', color = 'default') +
geom_freqpoly(binwidth = 0.05) +
scale_x_log10()
)
<ggplot: (301039624)>
(
ggplot(df) +
aes('grade', 'income', color = 'default') +
geom_jitter(alpha = 0.2) + geom_boxplot() +
scale_y_log10() +
facet_wrap('default')
)
<ggplot: (-9223372036555608828)>
ggplot(df) + aes('grade') + geom_histogram()
/Users/amitkaps/miniconda3/lib/python3.6/site-packages/plotnine/stats/stat_bin.py:90: UserWarning: 'stat_bin()' using 'bins = 30'. Pick better value with 'binwidth'.
warn(msg.format(params['bins']))
<ggplot: (303137800)>
ggplot(df) + aes('log_age', 'log_income') + geom_bin2d() + facet_wrap('default')
<ggplot: (299031395)>
from sklearn import tree
from sklearn.externals import joblib
from firefly.client import Client
X = df.loc[:,('age', 'income')]
y = df.loc[:,'default']
clf = tree.DecisionTreeClassifier(max_depth=10).fit(X,y)
joblib.dump(clf, "clf.pkl")
['clf.pkl']
%%file simple.py
import numpy as np
from sklearn.externals import joblib
model = joblib.load("clf.pkl")
def predict(age, amount):
features = [age, amount]
prob0, prob1 = model.predict_proba([features])[0]
return prob1
Overwriting simple.py
Run the following command in your terminal
cd credit-risk/notebooks/
firefly simple.predict
simple = Client("http://127.0.0.1:8000")
simple.predict(age=28, amount=10000)
0.5373423860329777
simple.predict(age=50, amount=240000)
1.0