Skip to content

Conversation

@stephen-hoover
Copy link

If you train a scikit-learn compatible estimator outside of Civis Platform, you can use this to upload it to Civis Platform and prepare it for scoring with CivisML. There's a new Custom Script which will introspect metadata necessary for CivisML and make itself appear sufficiently like a CivisML training job that it can be used as input to a scoring job.

@stephen-hoover stephen-hoover added this to the Next Version milestone Mar 9, 2018
@stephen-hoover stephen-hoover force-pushed the paro-191-byo-trained-model branch 3 times, most recently from 8586e97 to f81823c Compare March 14, 2018 15:57
If you train a scikit-learn compatible estimator outside of Civis Platform, you can use this to upload it to Civis Platform and prepare it for scoring with CivisML. There's a new Custom Script which will introspect metadata necessary for CivisML and make itself appear sufficiently like a CivisML training job that it can be used as input to a scoring job.
@stephen-hoover stephen-hoover force-pushed the paro-191-byo-trained-model branch from f81823c to 8a233d1 Compare March 23, 2018 15:08
@stephen-hoover stephen-hoover changed the title WIP ENH BYO trained model ENH Register models trained outside of Civis Platform Mar 23, 2018
@stephen-hoover stephen-hoover requested a review from elsander March 23, 2018 15:25
@stephen-hoover
Copy link
Author

The template numbers for registration jobs should be changed when we release CivisML v2.2 and before the next civis-python release. Putting the placeholders in there seemed like the best way to make sure we remember to do it.

9112: 9113, # v1.1
8387: 9113, # v1.0
7020: 7021, # v0.5
11028: 10616, # v2.2 registration CHANGE ME
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can you make a ticket as an additional reminder, if you haven't already?

cross-validation happens over all steps together.

You can define your model in two ways, either by selecting a pre-defined algorithm
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think it would be worth mentioning the option of passing a pre-fit model here, and linking to the relevant section.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What were you picturing here? This section is talking about training models; registering a model trained outside of Platform bypasses this step.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I was just imagining a sentence like "If you have already trained a model and wish to score it using CivisML, see [section link]." My goal is to make it so that the documentation will lead you to the right place; the section title "defining a model" could be interpreted as providing the definition of a model that's already trained, and since it's an early section, users are likely to start there. You don't have to include it if you don't want to, but users sometimes struggle to find things in our documentation, so I'm trying to find ways to make that easier.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I see. That makes sense; I added a couple of sentences.

model : sklearn.base.BaseEstimator or int
The model object. This must be a fitted scikit-learn compatible
Estimator object, or else the integer Civis File ID of a
pickle which stores such an object.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Does this have to be joblib-serialized? If so, we should mention that.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The joblib.load will work on regular pickles. I added a clause about joblib; if you think it's confusing I can change it or take it out.

These will be used to ensure that tables input for prediction
have the correct features in the correct order.
primary_key : string, optional
The unique ID (primary key) of the scoring dataset..
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Stray period here.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks. I think both of these are stray since it's not a complete sentence.

from civis.ml import _model


LATEST_TRAIN_TEMPLATE = 10582
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm not sure I understand what this change is for. It seems like this is another constant that we'll have to remember to update.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Tests should fail if we don't update it, which makes it much easier to remember. The problem I had was that we'd used max(_PRED_TEMPLATES) to infer what I've set here as LATEST_TRAIN_TEMPLATE. With the way I've implemented model registration, we're no longer guaranteed that this max operation gives us the most recent template for training jobs. I could build in logic which find the maximum of the keys in _PRED_TEMPLATES which aren't also in REGISTRATION_TEMPLATES. I started to be concerned that adding extra logic like that would make the tests more reliable. Do you think it's worth writing a get_latest_train_template function for the tests?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the clarification. I think this change is fine; an obvious constant we have to change is better than having to debug a less obvious helper function.

@elsander elsander assigned stephen-hoover and unassigned elsander Mar 23, 2018
Copy link
Contributor

@elsander elsander left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I left a comment about the documentation I was imagining. I'll approve after that change, or if you don't want to make the change I can approve as is.

@elsander elsander assigned stephen-hoover and unassigned elsander Mar 26, 2018
Copy link
Contributor

@elsander elsander left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM!

@elsander elsander assigned stephen-hoover and unassigned elsander Mar 26, 2018
@stephen-hoover stephen-hoover merged commit 242d2e2 into civisanalytics:master Mar 26, 2018
@stephen-hoover stephen-hoover deleted the paro-191-byo-trained-model branch March 26, 2018 16:24
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants