-
Notifications
You must be signed in to change notification settings - Fork 24
ENH Register models trained outside of Civis Platform #242
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
ENH Register models trained outside of Civis Platform #242
Conversation
8586e97 to
f81823c
Compare
If you train a scikit-learn compatible estimator outside of Civis Platform, you can use this to upload it to Civis Platform and prepare it for scoring with CivisML. There's a new Custom Script which will introspect metadata necessary for CivisML and make itself appear sufficiently like a CivisML training job that it can be used as input to a scoring job.
f81823c to
8a233d1
Compare
|
The template numbers for registration jobs should be changed when we release CivisML v2.2 and before the next civis-python release. Putting the placeholders in there seemed like the best way to make sure we remember to do it. |
| 9112: 9113, # v1.1 | ||
| 8387: 9113, # v1.0 | ||
| 7020: 7021, # v0.5 | ||
| 11028: 10616, # v2.2 registration CHANGE ME |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can you make a ticket as an additional reminder, if you haven't already?
| cross-validation happens over all steps together. | ||
|
|
||
| You can define your model in two ways, either by selecting a pre-defined algorithm |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think it would be worth mentioning the option of passing a pre-fit model here, and linking to the relevant section.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What were you picturing here? This section is talking about training models; registering a model trained outside of Platform bypasses this step.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I was just imagining a sentence like "If you have already trained a model and wish to score it using CivisML, see [section link]." My goal is to make it so that the documentation will lead you to the right place; the section title "defining a model" could be interpreted as providing the definition of a model that's already trained, and since it's an early section, users are likely to start there. You don't have to include it if you don't want to, but users sometimes struggle to find things in our documentation, so I'm trying to find ways to make that easier.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I see. That makes sense; I added a couple of sentences.
civis/ml/_model.py
Outdated
| model : sklearn.base.BaseEstimator or int | ||
| The model object. This must be a fitted scikit-learn compatible | ||
| Estimator object, or else the integer Civis File ID of a | ||
| pickle which stores such an object. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Does this have to be joblib-serialized? If so, we should mention that.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The joblib.load will work on regular pickles. I added a clause about joblib; if you think it's confusing I can change it or take it out.
civis/ml/_model.py
Outdated
| These will be used to ensure that tables input for prediction | ||
| have the correct features in the correct order. | ||
| primary_key : string, optional | ||
| The unique ID (primary key) of the scoring dataset.. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Stray period here.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks. I think both of these are stray since it's not a complete sentence.
| from civis.ml import _model | ||
|
|
||
|
|
||
| LATEST_TRAIN_TEMPLATE = 10582 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm not sure I understand what this change is for. It seems like this is another constant that we'll have to remember to update.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Tests should fail if we don't update it, which makes it much easier to remember. The problem I had was that we'd used max(_PRED_TEMPLATES) to infer what I've set here as LATEST_TRAIN_TEMPLATE. With the way I've implemented model registration, we're no longer guaranteed that this max operation gives us the most recent template for training jobs. I could build in logic which find the maximum of the keys in _PRED_TEMPLATES which aren't also in REGISTRATION_TEMPLATES. I started to be concerned that adding extra logic like that would make the tests more reliable. Do you think it's worth writing a get_latest_train_template function for the tests?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for the clarification. I think this change is fine; an obvious constant we have to change is better than having to debug a less obvious helper function.
elsander
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I left a comment about the documentation I was imagining. I'll approve after that change, or if you don't want to make the change I can approve as is.
elsander
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM!
If you train a scikit-learn compatible estimator outside of Civis Platform, you can use this to upload it to Civis Platform and prepare it for scoring with CivisML. There's a new Custom Script which will introspect metadata necessary for CivisML and make itself appear sufficiently like a CivisML training job that it can be used as input to a scoring job.