Skip to content

Latest commit

 

History

History
517 lines (306 loc) · 8.33 KB

index.md

File metadata and controls

517 lines (306 loc) · 8.33 KB

home | amitkaps.com | bargava.com

Build & Deploy ML Models in Cloud


** Motivation for the Session**

  • Solve a business problem
  • Understand the end-to-end approach
  • Build a data-driven Machine Learning application on the cloud

** Our approach ** is to take a case-driven example to showcase this. And we will aim to go-wide vs. go-deep to do so. The approach will be both practical and scalable.



INTRO


Lets start by understanding the overall approach for doing so.

                FRAME  ——> ACQUIRE  ——> REFINE ——>  
                                                  \
                                                TRANSFORM <——
                                                    ↑          ↘  
                                                    |        EXPLORE
                                                    ↓          ↗
                                                  MODEL   <——
                                                  /      
                BUILD <—— DEPLOY <—— INSIGHT <—— 

  • FRAME: Problem definition
  • ACQUIRE: Data ingestion
  • REFINE: Data wrangling
  • TRANSFORM: Feature creation
  • EXPLORE: Feature selection
  • MODEL: Model creation
  • INSIGHT: Model selection
  • DEPLOY: Model deployment
  • BUILD: Application building


FRAME


"Doing data science requires quite a bit of thinking and we believe that when you’ve completed a good data science analysis, you’ve spent more time thinking than doing." - Roger Peng

A start-up providing loans to the consumer and has been running for the last few years. It is now planning to adopt a data-driven lens to its loan portfolio. What are the type of questions it can ask?

  • What is the trend of loan defaults?
  • Do older customers have more loan defaults?
  • Which customer is likely to have a loan default?
  • Why do customers default on their loan?

Type of data-driven analytics

  • Descriptive: Understand patterns, trends, deviations and outlier
  • Inquisitive: Conduct hypothesis testing
  • Predictive: Make a prediction
  • Causal: Establish a causal link

Our Question: What is the probability of a loan default?



Acquire


"Data is the new oil"

Ways to acquire data (typical data source)

  • Download from an internal system
  • Obtained from client, or other 3rd party
  • Extracted from a web-based API
  • Scraped from a website
  • Extracted from a PDF file
  • Gathered manually and recorded

Data Formats

  • Flat files (e.g. csv)
  • Excel files
  • Database (e.g. MySQL)
  • JSON
  • HDFS (Hadoop)
#Load the libraries and configuration
import numpy as np
import pandas as pd
df = pd.read_csv("loan.csv") 

Refine - drop NAs

df.dropna(axis=0, inplace=True) 

** Transform - log scale **

df['log_age'] = np.log(df.age)
df['log_income'] = np.log(df.income)

** Explore - age, income & default **

from plotnine import *
%matplotlib inline
/Users/amitkaps/miniconda3/lib/python3.6/site-packages/statsmodels/compat/pandas.py:56: FutureWarning: The pandas.core.datetools module is deprecated and will be removed in a future version. Please use the pandas.tseries module instead.
  from pandas.core import datetools
plt.matshow(df.corr())
-----------------------------------------------------------------------

NameError                             Traceback (most recent call last)

<ipython-input-70-61b8406b1e2d> in <module>()
----> 1 plt.matshow(df.corr())


NameError: name 'plt' is not defined
df.head()
<style> .dataframe thead tr:only-child th { text-align: right; }
.dataframe thead th {
    text-align: left;
}

.dataframe tbody tr th {
    vertical-align: top;
}
</style>
default amount grade years ownership income age log_age log_income
0 0 1000 B 2.0 RENT 19200.0 24 3.178054 9.862666
1 1 6500 A 2.0 MORTGAGE 66000.0 28 3.332205 11.097410
2 0 2400 A 2.0 RENT 60000.0 36 3.583519 11.002100
3 0 10000 C 3.0 RENT 62000.0 24 3.178054 11.034890
4 1 4000 C 2.0 RENT 20000.0 28 3.332205 9.903488
df['default'] = df['default'].astype('category')

ggplot(df) + aes('grade', fill ="default") + geom_bar(position = 'fill')

png

<ggplot: (292944062)>
ggplot(df) + aes('grade', 'ownership', fill ="default") + geom_jitter(alpha = 0.2)

png

<ggplot: (-9223372036561379342)>
ggplot(df) + aes('ownership', fill ="default") + geom_bar(position = 'fill')

png

<ggplot: (-9223372036554328979)>
(
  ggplot(df) + 
  aes('years', '..count..', color = 'default') + 
  geom_freqpoly()
)
/Users/amitkaps/miniconda3/lib/python3.6/site-packages/plotnine/stats/stat_bin.py:90: UserWarning: 'stat_bin()' using 'bins = 101'. Pick better value with 'binwidth'.
  warn(msg.format(params['bins']))

png

<ggplot: (293366820)>
(
  ggplot(df) + 
  aes('amount', '..count..', color = 'default') + 
  geom_freqpoly(binwidth = 0.05) +
  scale_x_log10()
)

png

<ggplot: (299466111)>
(
  ggplot(df) + 
  aes('income', '..count..', color = 'default') + 
  geom_freqpoly(binwidth = 0.05) +
  scale_x_log10()
)

png

<ggplot: (301039624)>
(
    ggplot(df) + 
    aes('grade', 'income', color = 'default') + 
    geom_jitter(alpha = 0.2) + geom_boxplot() +
    scale_y_log10() +
    facet_wrap('default')
)

png

<ggplot: (-9223372036555608828)>
ggplot(df) + aes('grade') + geom_histogram()
/Users/amitkaps/miniconda3/lib/python3.6/site-packages/plotnine/stats/stat_bin.py:90: UserWarning: 'stat_bin()' using 'bins = 30'. Pick better value with 'binwidth'.
  warn(msg.format(params['bins']))

png

<ggplot: (303137800)>
ggplot(df) + aes('log_age', 'log_income') + geom_bin2d() + facet_wrap('default')

png

<ggplot: (299031395)>

** Model - Build a tree classifier **

from sklearn import tree
from sklearn.externals import joblib
from firefly.client import Client
X = df.loc[:,('age', 'income')]
y = df.loc[:,'default']
clf = tree.DecisionTreeClassifier(max_depth=10).fit(X,y)
joblib.dump(clf, "clf.pkl")
['clf.pkl']

** Build - the ML API **

%%file simple.py
import numpy as np
from sklearn.externals import joblib
model = joblib.load("clf.pkl")

def predict(age, amount):
    features = [age, amount]
    prob0, prob1 = model.predict_proba([features])[0]
    return prob1
Overwriting simple.py

** Deploy - the ML API **

Run the following command in your terminal

 cd credit-risk/notebooks/
 firefly simple.predict

** Interact - get prediction using API**

simple = Client("http://127.0.0.1:8000")
simple.predict(age=28, amount=10000)
0.5373423860329777
simple.predict(age=50, amount=240000)
1.0