Skip to content

How to use vetiver with custom pipelines #192

@SamEdwardes

Description

@SamEdwardes

Is your feature request related to a problem? Please describe.

When you deploy a Vetiver model to Connect that uses a "custom" object in the pipeline the model will deploy, but when you open the API it will fail.

Describe the solution you'd like

I would like to be able to deploy a Vetiver model that uses custom sklearn transformers.

Describe alternatives you've considered

  • You could package up the custom transformer as a python package. In your model deployment code, you could import the custom transformer. Then, when vetiver deploys to Connect it will install the custom python package and have access to the transformer. However, this has major downsides: users need to know how to make a Python package, they need to be able to deploy the package somewhere that they can access both in their development and Connect environment. Posit Package Manager serves this use case, but many users will not have access to this.
  • Maybe you could define the custom transformer in another file (e.g. transformer.py). If you upload that file to Connect as one of the extra files maybe it will be able to import it? I think it will not work though because vetiver writes api.py file for you.

I am not sure what the "best" solution is. I would love to hear what you have seen other users do, or how you would approach :)

Additional context

Here is an example script:

Click to expand example script
# %% [markdown]
# # Initial Model Fit

# %% [markdown]
# In this notebook we fit a simple machine learning model to predict prepayments for student loans.  Towards this end we use the **scikit-learn** package.  Once our model is fit we deploy it to Posit Connect using the **vetiver** package.

# %% [markdown]
# ## Initial Setup

# %% [markdown]
# Let's begin by loading some packages that we will need.

# %%
import pandas as pd
import sklearn
import pins
import vetiver

# %% [markdown]
# Next, let's read-in the `CONNECT_SERVER` and `CONNECT_API_KEY` environment variables.

# %%
import os
import dotenv

dotenv.load_dotenv(override=True)
rsc_server = os.environ['CONNECT_SERVER']
rsc_key = os.environ['CONNECT_API_KEY']

# %% [markdown]
# ## Reading-In Training Data

# %% [markdown]
# We can now read-in our training data.

# %%
df_train = pd.read_csv('data/student-loan-2022-12-01.csv')
df_train

# %% [markdown]
# Let's separate features and labels.

# %%
df_X = df_train.drop(columns=['paid_label'])
df_y = df_train[['paid_label']]

# %% [markdown]
# ## Defining the Modeling Pipeline

# %% [markdown]
# Next, we identify the columns of the `df_train` that we would like to use as predictors.  We are going to ignore `trade_date` because it is simply there so we know which month the data is coming from.  We are also going to igore `mos_to_repay` because it is zero for all but a few observations.

# %%
features = ['loan_age', 'cosign', 'income_annual', 'upb', 'monthly_payment', 
            'fico', 'origbalance', 'repay_status', 'mos_to_balln']

# %% [markdown]
# In order to 

# %%
from sklearn.base import BaseEstimator, TransformerMixin

class FeatureSelector(BaseEstimator, TransformerMixin):
    def __init__(self, columns):
        self.columns = columns
        
    def fit(self, X, y=None):
        return self
    
    def transform(self, X, y=None):
        return X[self.columns]

# %%
FeatureSelector(features).fit_transform(df_train).head()

# %%
from sklearn.tree import DecisionTreeClassifier
from sklearn.pipeline import Pipeline

model = Pipeline(steps=[
    ('feature_selector', FeatureSelector(features)),
    ('decision_tree', DecisionTreeClassifier())
])

# %% [markdown]
# ## Fit the Model

# %%
model.fit(df_X, df_y)

# %% [markdown]
# ## Vetiver

# %% [markdown]
# ### Create a **vetiver** Model

# %%
from vetiver import VetiverModel
meta = {'training_data': df_train['trade_date'][0]}
v = VetiverModel(
        model, 
        model_name = "user.name/student_loan_python", 
        #prototype_data = df_X,
        metadata = meta,
    )
v

# %% [markdown]
# ### Pin (Store and Version) the Model

# %%
from vetiver import vetiver_pin_write

model_board = pins.board_rsconnect(server_url=rsc_server, api_key=rsc_key, allow_pickle_read=True)
vetiver_pin_write(model_board, v)

# %%
model_board.pin_versions('user.name/student_loan_python')

# %% [markdown]
# ### Create a REST API

# %%
from rsconnect.api import RSConnectServer
connect_server = RSConnectServer(url=rsc_server, api_key=rsc_key)

vetiver.deploy_rsconnect(
    connect_server=connect_server,
    board=model_board,
    pin_name="user.name/student_loan_python",
    version=model_board.pin_versions('user.name/student_loan_python').tail(1)['version'].iloc[0],
    #app_id='d42d839a-0672-4747-9773-174d73eff647', # <-- how would I know this for the initial deployment?
    title="Student Loan - Model - FastAPI",
    extra_files=['requirements.txt'],
)

# %%

The relevant code chunk is this:

from sklearn.base import BaseEstimator, TransformerMixin

class FeatureSelector(BaseEstimator, TransformerMixin):
    def __init__(self, columns):
        self.columns = columns
        
    def fit(self, X, y=None):
        return self
    
    def transform(self, X, y=None):
        return X[self.columns]

# %%
FeatureSelector(features).fit_transform(df_train).head()

# %%
from sklearn.tree import DecisionTreeClassifier
from sklearn.pipeline import Pipeline

model = Pipeline(steps=[
    ('feature_selector', FeatureSelector(features)),
    ('decision_tree', DecisionTreeClassifier())
])

When you deploy this model to Connect, Connect does not know what FeatureSelector is, and will fail to start the API.

CC @pritamdalal @pritamdalal-posit

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions