Build & Deploy ML Models in Cloud

** Motivation for the Session**

Solve a business problem
Understand the end-to-end approach
Build a data-driven Machine Learning application on the cloud

** Our approach ** is to take a case-driven example to showcase this. And we will aim to go-wide vs. go-deep to do so. The approach will be both practical and scalable.

INTRO

Lets start by understanding the overall approach for doing so.

                FRAME  ——> ACQUIRE  ——> REFINE ——>  
                                                  \
                                                TRANSFORM <——
                                                    ↑          ↘  
                                                    |        EXPLORE
                                                    ↓          ↗
                                                  MODEL   <——
                                                  /      
                BUILD <—— DEPLOY <—— INSIGHT <——

FRAME: Problem definition
ACQUIRE: Data ingestion
REFINE: Data wrangling
TRANSFORM: Feature creation
EXPLORE: Feature selection
MODEL: Model creation
INSIGHT: Model selection
DEPLOY: Model deployment
BUILD: Application building

FRAME

"Doing data science requires quite a bit of thinking and we believe that when you’ve completed a good data science analysis, you’ve spent more time thinking than doing." - Roger Peng

A start-up providing loans to the consumer and has been running for the last few years. It is now planning to adopt a data-driven lens to its loan portfolio. What are the type of questions it can ask?

What is the trend of loan defaults?
Do older customers have more loan defaults?
Which customer is likely to have a loan default?
Why do customers default on their loan?

Type of data-driven analytics

Descriptive: Understand patterns, trends, deviations and outlier
Inquisitive: Conduct hypothesis testing
Predictive: Make a prediction
Causal: Establish a causal link

Our Question: What is the probability of a loan default?

Acquire

"Data is the new oil"

Ways to acquire data (typical data source)

Download from an internal system
Obtained from client, or other 3rd party
Extracted from a web-based API
Scraped from a website
Extracted from a PDF file
Gathered manually and recorded

Data Formats

Flat files (e.g. csv)
Excel files
Database (e.g. MySQL)
JSON
HDFS (Hadoop)

#Load the libraries and configuration
import numpy as np
import pandas as pd

df = pd.read_csv("loan.csv")

Refine - drop NAs

df.dropna(axis=0, inplace=True)

Transform - log scale

df['log_age'] = np.log(df.age)
df['log_income'] = np.log(df.income)

Explore - age, income & default

from plotnine import *
%matplotlib inline

/Users/amitkaps/miniconda3/lib/python3.6/site-packages/statsmodels/compat/pandas.py:56: FutureWarning: The pandas.core.datetools module is deprecated and will be removed in a future version. Please use the pandas.tseries module instead.
  from pandas.core import datetools

plt.matshow(df.corr())

-----------------------------------------------------------------------

NameError                             Traceback (most recent call last)

<ipython-input-70-61b8406b1e2d> in <module>()
----> 1 plt.matshow(df.corr())


NameError: name 'plt' is not defined

df.head()

.dataframe thead th {
    text-align: left;
}

.dataframe tbody tr th {
    vertical-align: top;
}

</style>

	default	amount	grade	years	ownership	income	age	log_age	log_income
0	0	1000	B	2.0	RENT	19200.0	24	3.178054	9.862666
1	1	6500	A	2.0	MORTGAGE	66000.0	28	3.332205	11.097410
2	0	2400	A	2.0	RENT	60000.0	36	3.583519	11.002100
3	0	10000	C	3.0	RENT	62000.0	24	3.178054	11.034890
4	1	4000	C	2.0	RENT	20000.0	28	3.332205	9.903488

df['default'] = df['default'].astype('category')

ggplot(df) + aes('grade', fill ="default") + geom_bar(position = 'fill')

<ggplot: (292944062)>

ggplot(df) + aes('grade', 'ownership', fill ="default") + geom_jitter(alpha = 0.2)

<ggplot: (-9223372036561379342)>

ggplot(df) + aes('ownership', fill ="default") + geom_bar(position = 'fill')

<ggplot: (-9223372036554328979)>

(
  ggplot(df) + 
  aes('years', '..count..', color = 'default') + 
  geom_freqpoly()
)

/Users/amitkaps/miniconda3/lib/python3.6/site-packages/plotnine/stats/stat_bin.py:90: UserWarning: 'stat_bin()' using 'bins = 101'. Pick better value with 'binwidth'.
  warn(msg.format(params['bins']))

<ggplot: (293366820)>

(
  ggplot(df) + 
  aes('amount', '..count..', color = 'default') + 
  geom_freqpoly(binwidth = 0.05) +
  scale_x_log10()
)

<ggplot: (299466111)>

(
  ggplot(df) + 
  aes('income', '..count..', color = 'default') + 
  geom_freqpoly(binwidth = 0.05) +
  scale_x_log10()
)

<ggplot: (301039624)>

(
    ggplot(df) + 
    aes('grade', 'income', color = 'default') + 
    geom_jitter(alpha = 0.2) + geom_boxplot() +
    scale_y_log10() +
    facet_wrap('default')
)

<ggplot: (-9223372036555608828)>

ggplot(df) + aes('grade') + geom_histogram()

/Users/amitkaps/miniconda3/lib/python3.6/site-packages/plotnine/stats/stat_bin.py:90: UserWarning: 'stat_bin()' using 'bins = 30'. Pick better value with 'binwidth'.
  warn(msg.format(params['bins']))

<ggplot: (303137800)>

ggplot(df) + aes('log_age', 'log_income') + geom_bin2d() + facet_wrap('default')

<ggplot: (299031395)>

Model - Build a tree classifier

from sklearn import tree
from sklearn.externals import joblib
from firefly.client import Client

X = df.loc[:,('age', 'income')]
y = df.loc[:,'default']
clf = tree.DecisionTreeClassifier(max_depth=10).fit(X,y)
joblib.dump(clf, "clf.pkl")

['clf.pkl']

Build - the ML API

%%file simple.py
import numpy as np
from sklearn.externals import joblib
model = joblib.load("clf.pkl")

def predict(age, amount):
    features = [age, amount]
    prob0, prob1 = model.predict_proba([features])[0]
    return prob1

Overwriting simple.py

Deploy - the ML API

Run the following command in your terminal

 cd credit-risk/notebooks/
 firefly simple.predict

Interact - get prediction using API

simple = Client("http://127.0.0.1:8000")
simple.predict(age=28, amount=10000)

0.5373423860329777

simple.predict(age=50, amount=240000)

1.0

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

index.md

index.md

Build & Deploy ML Models in Cloud

INTRO

FRAME

Type of data-driven analytics

Acquire

Refine - drop NAs

Transform - log scale

Explore - age, income & default

Model - Build a tree classifier

Build - the ML API

Deploy - the ML API

Interact - get prediction using API

Files

index.md

Latest commit

History

index.md

File metadata and controls

Build & Deploy ML Models in Cloud

INTRO

FRAME

Type of data-driven analytics

Acquire

Refine - drop NAs

** Transform - log scale **

** Explore - age, income & default **

** Model - Build a tree classifier **

** Build - the ML API **

** Deploy - the ML API **

** Interact - get prediction using API**

Transform - log scale

Explore - age, income & default

Model - Build a tree classifier

Build - the ML API

Deploy - the ML API

Interact - get prediction using API