Skip to content

[DOCS][ML][SPARK-11964] Add in Pipeline Import/Export Documentation #10179

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
wants to merge 20 commits into from

Conversation

bllchmbrs
Copy link
Contributor

Adding in Pipeline Import and Export Documentation.

@bllchmbrs
Copy link
Contributor Author

not sure if my notations are correct for the title @jkbradley, let me know if I need to change anything!

@jkbradley
Copy link
Member

Can you please add "[ML]" to the PR title? And also edit the first comment in the PR to describe what the PR is doing (since it will become part of the commit message)? I'll review it now, thanks!

@jkbradley
Copy link
Member

Oh also please remove "SPARK-6725" from the PR title. It's better to have a 1 JIRA - 1 PR correspondence.

## Example: Saving and Loading a Previously Created Model Pipeline

Often times it is worth it to save a model to disk for usage later. In Spark 1.6, similar model import/export functionality was added to the Pipeline API. Most basic transformers are supported as well as some of the more basic ML Models such as:
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

"for later use" instead of "for usage later" sounds better to me.
"In Spark 1.6, a model import/export" instead of "similar"
"ML models"

@BenFradet
Copy link
Contributor

A few minor comments on the documentation, but otherwise it lgtm.

@jkbradley
Copy link
Member

This is really similar to the existing Pipeline examples...as it probably should be. However, that makes me wonder if the best way to do this would be to:

  • Add a very small text description subsection at the end of the "Main concepts in Pipelines" section.
  • Modify one of the existing code examples to include save/load.

That should simplify the PR a lot. Thanks!

@bllchmbrs bllchmbrs changed the title [DOCS][SPARK-11964][SPARK-6725] Add in Pipeline Import/Export Documentation [DOCS][ML][SPARK-11964] Add in Pipeline Import/Export Documentation Dec 7, 2015
@bllchmbrs
Copy link
Contributor Author

@jkbradley will make those changes shortly.

@BenFradet will make those changes as well.

@bllchmbrs
Copy link
Contributor Author

@jkbradley does this work for you by the way?


## Example: Saving and Loading a Pipeline

Often times it is worth it to save a model to disk for later use. In Spark 1.6, model import/export functionality was added to the Pipeline API. Most basic transformers are supported as well as some of the more basic ML models such as:
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

a model import/export functionality...

@BenFradet
Copy link
Contributor

LGTM except one minor comment.

@bllchmbrs
Copy link
Contributor Author

@BenFradet integrated your feedback thanks.

@jkbradley
Copy link
Member

@anabranch Hm, I may not have been clear enough. The save/load functionality seems general and important enough that it should go under the "Main concepts in Pipelines" section; I would put a subsection with a small paragraph (without code) at the end of the "Main concepts in Pipelines" section, just before the "Code example" section. I would then modify the first code example "Example: Estimator, Transformer, and Param" to include saving and loading the pipeline. Thanks!

@bllchmbrs
Copy link
Contributor Author

@jkbradley gotcha! I misinterpreted your last comments, my fault.

One thing I'm confused about though is that the Estimator, Transformer, and Param section doesn't seem to mention pipelines explicitly, more focusing on those components themselves. It doesn't necessarily seem like the best place to talk about pipeline persistence? I'm sure there's just something that I'm missing.

I've updated to move it into the pipeline section but please let me know if you would like me to change it. to the params one and demo it more as a feature. Will be available tomorrow so I can turn it around quickly!

@@ -140,8 +140,8 @@ If the `Pipeline` had more stages, it would call the `LogisticRegressionModel`'s
method on the `DataFrame` before passing the `DataFrame` to the next stage.

A `Pipeline` is an `Estimator`.
Thus, after a `Pipeline`'s `fit()` method runs, it produces a `PipelineModel`, which is a
`Transformer`.
Thus, after a `Pipeline`'s `fit()` method runs, it produces a `PipelineModel`, which is a `Transformer`.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I would avoid unneeded changes like this since they can cause conflicts.

@@ -471,6 +489,14 @@ model.transform(test)
println(s"($id, $text) --> prob=$prob, prediction=$prediction")
}

// or use the "loadedModel" to make predictions
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

No need for this

@jkbradley
Copy link
Member

Thanks for the updates! Just minor comments now.

I agree we need to improve and reorganize the sections explaining Pipelines; we're working on that in some separate PRs.

@bllchmbrs
Copy link
Contributor Author

@jkbradley should be good to go! Sorry for being such a pain!


// and load it back in during production
val sameModel = Pipeline.load("/tmp/spark-logistic-regression-model")
// or equivalently
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Remove this and next line

I like the idea in this & the next line, but I'd remove this since it will cause problems if users blindly copy and paste the code into a spark shell.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Just to be crystal clear, you would like me to remove 470 - 472? The "equivalently" bit, right?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes (470-471). I'd keep the newline in 472. Thanks!

@jkbradley
Copy link
Member

No, no problem. Just 1 more comment left

@bllchmbrs
Copy link
Contributor Author

@jkbradley should be good to go now!

@jkbradley
Copy link
Member

LGTM pending tests.
Thanks!

@SparkQA
Copy link

SparkQA commented Dec 11, 2015

Test build #2206 has finished for PR 10179 at commit 10ccc22.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds the following public classes (experimental):\n * class ExecutorClassLoader(\n

@jkbradley
Copy link
Member

Merging with master and branch-1.6

@asfgit asfgit closed this in aa305dc Dec 11, 2015
asfgit pushed a commit that referenced this pull request Dec 11, 2015
Adding in Pipeline Import and Export Documentation.

Author: anabranch <wac.chambers@gmail.com>
Author: Bill Chambers <wchambers@ischool.berkeley.edu>

Closes #10179 from anabranch/master.

(cherry picked from commit aa305dc)
Signed-off-by: Joseph K. Bradley <joseph@databricks.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants