-
Notifications
You must be signed in to change notification settings - Fork 28.6k
[DOCS][ML][SPARK-11964] Add in Pipeline Import/Export Documentation #10179
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
Needed to import the types specifically, not the general pyspark.sql
not sure if my notations are correct for the title @jkbradley, let me know if I need to change anything! |
Can you please add "[ML]" to the PR title? And also edit the first comment in the PR to describe what the PR is doing (since it will become part of the commit message)? I'll review it now, thanks! |
Oh also please remove "SPARK-6725" from the PR title. It's better to have a 1 JIRA - 1 PR correspondence. |
## Example: Saving and Loading a Previously Created Model Pipeline | ||
|
||
Often times it is worth it to save a model to disk for usage later. In Spark 1.6, similar model import/export functionality was added to the Pipeline API. Most basic transformers are supported as well as some of the more basic ML Models such as: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
"for later use" instead of "for usage later" sounds better to me.
"In Spark 1.6, a model import/export" instead of "similar"
"ML models"
A few minor comments on the documentation, but otherwise it lgtm. |
This is really similar to the existing Pipeline examples...as it probably should be. However, that makes me wonder if the best way to do this would be to:
That should simplify the PR a lot. Thanks! |
@jkbradley will make those changes shortly. @BenFradet will make those changes as well. |
@jkbradley does this work for you by the way? |
|
||
## Example: Saving and Loading a Pipeline | ||
|
||
Often times it is worth it to save a model to disk for later use. In Spark 1.6, model import/export functionality was added to the Pipeline API. Most basic transformers are supported as well as some of the more basic ML models such as: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
a model import/export functionality...
LGTM except one minor comment. |
@BenFradet integrated your feedback thanks. |
@anabranch Hm, I may not have been clear enough. The save/load functionality seems general and important enough that it should go under the "Main concepts in Pipelines" section; I would put a subsection with a small paragraph (without code) at the end of the "Main concepts in Pipelines" section, just before the "Code example" section. I would then modify the first code example "Example: Estimator, Transformer, and Param" to include saving and loading the pipeline. Thanks! |
@jkbradley gotcha! I misinterpreted your last comments, my fault. One thing I'm confused about though is that the Estimator, Transformer, and Param section doesn't seem to mention pipelines explicitly, more focusing on those components themselves. It doesn't necessarily seem like the best place to talk about pipeline persistence? I'm sure there's just something that I'm missing. I've updated to move it into the pipeline section but please let me know if you would like me to change it. to the params one and demo it more as a feature. Will be available tomorrow so I can turn it around quickly! |
@@ -140,8 +140,8 @@ If the `Pipeline` had more stages, it would call the `LogisticRegressionModel`'s | |||
method on the `DataFrame` before passing the `DataFrame` to the next stage. | |||
|
|||
A `Pipeline` is an `Estimator`. | |||
Thus, after a `Pipeline`'s `fit()` method runs, it produces a `PipelineModel`, which is a | |||
`Transformer`. | |||
Thus, after a `Pipeline`'s `fit()` method runs, it produces a `PipelineModel`, which is a `Transformer`. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I would avoid unneeded changes like this since they can cause conflicts.
@@ -471,6 +489,14 @@ model.transform(test) | |||
println(s"($id, $text) --> prob=$prob, prediction=$prediction") | |||
} | |||
|
|||
// or use the "loadedModel" to make predictions |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
No need for this
Thanks for the updates! Just minor comments now. I agree we need to improve and reorganize the sections explaining Pipelines; we're working on that in some separate PRs. |
@jkbradley should be good to go! Sorry for being such a pain! |
|
||
// and load it back in during production | ||
val sameModel = Pipeline.load("/tmp/spark-logistic-regression-model") | ||
// or equivalently |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Remove this and next line
I like the idea in this & the next line, but I'd remove this since it will cause problems if users blindly copy and paste the code into a spark shell.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Just to be crystal clear, you would like me to remove 470 - 472? The "equivalently" bit, right?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes (470-471). I'd keep the newline in 472. Thanks!
No, no problem. Just 1 more comment left |
@jkbradley should be good to go now! |
LGTM pending tests. |
Test build #2206 has finished for PR 10179 at commit
|
Merging with master and branch-1.6 |
Adding in Pipeline Import and Export Documentation.