[SPARK-21386] ML LinearRegression supports warm start from user provided initial model. #18610

yanboliang · 2017-07-12T09:25:38Z

What changes were proposed in this pull request?

Allow users to set initial model when running LinearRegression. This is the first step to make ML supports warm-start, we can distribute tasks for other algorithms after this get merged.

Note: The design behind this PR follows discussions at #11119 and #17117 . With regards to those PRs, KMeans involves lots of issues which is not related to initial model itself. I'd like to proceed this work from LinearRegression which is more independent and clean.

How was this patch tested?

Added unit test.

SparkQA · 2017-07-12T09:37:28Z

Test build #79553 has finished for PR 18610 at commit ae8f32b.

This patch fails MiMa tests.
This patch merges cleanly.
This patch adds no public classes.

yanboliang · 2017-07-12T09:51:14Z

mllib/src/test/scala/org/apache/spark/ml/util/DefaultReadWriteTest.scala

@@ -113,7 +115,14 @@ trait DefaultReadWriteTest extends TempDirectory { self: Suite =>
    val estimator2 = testDefaultReadWrite(estimator)
    testEstimatorParams.foreach { case (p, v) =>
      val param = estimator.getParam(p)
-      assert(estimator.get(param).get === estimator2.get(param).get)
+      if (param.name == "initialModel") {
+        // Estimator's `initialModel` has same type as the model produced by this estimator.


This is an assumption that initialModel must be the same type of the model that the estimator produced. We should enforce this assumption by making the trait as HasInitialModel[T <: Model[T]] or add new argument checkInitialModelData for testEstimatorAndModelReadWrite. I'm open to hear others' thoughts and update here after collecting feedback. Thanks.

SparkQA · 2017-07-12T12:33:35Z

Test build #79554 has finished for PR 18610 at commit 0b7509e.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

yanboliang · 2017-07-12T14:37:12Z

cc @dbtsai @sethah @MLnick @hhbyyh @jkbradley

SparkQA · 2017-07-12T17:40:21Z

Test build #79562 has finished for PR 18610 at commit e49d452.

This patch fails SparkR unit tests.
This patch merges cleanly.
This patch adds no public classes.

yanboliang · 2017-07-12T17:44:32Z

This failure is irrelevant. Jenkins, retest this please.

SparkQA · 2017-07-12T20:51:06Z

Test build #79567 has finished for PR 18610 at commit e49d452.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

zhengruifeng · 2017-07-25T07:34:57Z

LGTM, this is really a great feature

MLnick · 2017-07-25T09:57:01Z

There was a lot of discussion around the KMeans initial model. But here we seem to be just ignoring certain issues. Such as, does the initialModel override the param settings (e.g. reg param, tolerance, iterations etc). Here it is just the coefficients that are used.

yanboliang · 2017-07-25T14:33:21Z

@MLnick Thanks for your comments. I think the implementation of this PR followed the discussion around the KMeans initial model in #11119. In that PR, we didn't use initialModel to override all param settings(e.g. maxIter, tol etc), but we override some params (e.g. k). But in this PR, we don't have params like k, so it's not necessary to override any params. This is the reason that I start the general warm start work from LinearRegression which is more clean. And I don't think we should to use initialModel to override params like tolerance and iterations. Why we need to use the tolerance of the initialModel? I just want my train start from an coefficients and I will specify new tolerance to control its convergence.
BTW, I also referred other libraries(e.g. scikit-learn), the semantic of this implementation is consistent with other popular libraries. Still, I'm very happy to get your more comments.

yanboliang · 2017-07-25T14:37:33Z

mllib/src/main/scala/org/apache/spark/ml/regression/LinearRegression.scala

@@ -365,7 +398,11 @@ class LinearRegression @Since("1.3.0") (@Since("1.3.0") override val uid: String
      new BreezeOWLQN[Int, BDV[Double]]($(maxIter), 10, effectiveL1RegFun, $(tol))
    }

-    val initialCoefficients = Vectors.zeros(numFeatures)
+    val initialCoefficients = if (isSet(initialModel)) {
+      $(initialModel).coefficients


Here we only use coefficients, not use intercept. This is because the intercept was computed using closed form after the coefficients are converged.

WeichenXu123 · 2017-07-25T20:17:30Z

mllib/src/main/scala/org/apache/spark/ml/util/ReadWrite.scala

+      if (initialModel.hasParam("initialModel")) {
+        initialModel.clear(initialModel.getParam("initialModel"))
+      }
+      initialModel.save(initialModelPath)


In order to avoid recursion, here modify the model to cut down the recursion, but may cause the problem that influence the model which is maybe used in other place.

@WeichenXu123 Can you elaborate the concrete scenario? Thanks.

I suggest when setInitialModel, first copy the passed in model (by Model.copy) and then clear the initialModel param in the passed in model to avoid recursive risk. Because the passed in model maybe used in other place we should avoid to modify its params.

Another way is avoid to add the initial model param to Model, only add it to the Estimator. (This way I think is more reasonable).

Actually we did in the later way, initialModel is only param for Estimator, not for Model. Thanks.

Fair enough.

zhengruifeng · 2017-07-26T03:25:55Z

@yanboliang Agree. I think if there is some prarms which control the "shape" of model coefficients, than they should be override if we use inital model. Like k in KMeans, GMM, LDA, other params should be let alone

JohnHBrock · 2017-08-27T17:35:51Z

Is there anymore work to do before this can get merged?

hhbyyh · 2017-08-27T18:05:21Z

Just to confirm, so we have agreed that the initialModel should be of type [T <: Model[T]] rather than a String type (path to the saved model)? Sorry I didn't find the related discussion.

According to my experience, it can really help if I can directly distinguish/confirm which model I loaded as initial model.
E.g., it's convenient if estimator.getInitialModel() return something like
"lr_20iteration_reg0.001.model" or
"lr_40iteration_reg0.001.model"

And serializing a model (initial model) to an Estimator does not feel right to me, even if we want to use [T <: Model[T]] as the type of Initial model, I wouldn't mind that we don't save this parameter with the Estimator.

Besides, using String param means we don't need to handle the recursive issue, and the overall implementation can be much simplified since we don't need to save a Param[Model] to the Estimator object.

I understand there's some scenarios that user want to continue the model training without saving it first,
but I guess saving model is a good habit...

hhbyyh · 2017-08-27T18:31:50Z

mllib/src/main/scala/org/apache/spark/ml/regression/LinearRegression.scala

@@ -226,6 +246,12 @@ class LinearRegression @Since("1.3.0") (@Since("1.3.0") override val uid: String

    if (($(solver) == Auto &&
      numFeatures <= WeightedLeastSquares.MAX_NUM_FEATURES) || $(solver) == Normal) {
+
+      if (isSet(initialModel)) {
+        logWarning("Initial model will be ignored if fitting by normal solver. " +


Since initial model is a pretty important parameter. By setting the initial model, user would expect it to work and they may neglect the warning in the overwhelming Spark logs.
Maybe we can move the parameter check to transformSchema and throws an exception.

Fair enough, I will update it after collecting all comments. Thanks.

I disagree. Normal solver provides an exact solution, so it is harmless to ignore the initial model. I don't see a reason to do anything more than log a warning.

hhbyyh · 2017-08-27T18:38:35Z

mllib/src/main/scala/org/apache/spark/ml/regression/LinearRegression.scala

@@ -72,6 +72,22 @@ private[regression] trait LinearRegressionParams extends PredictorParams
 }

 /**
+ * Params for linear regression.
+ */
+private[regression] trait LinearRegressionParams extends LinearRegressionModelParams


It maybe cleaner if we just move the param initialModel into LinearRegression? so we don't have to touch the class hierarchy.

The refactor here is not only for this PR, since model usually has less params than estimator, we make EstimatorParams extends from ModelParams in other places, for example, ALSParams extends ALSModelParams. I think the class hierarchy should be more clear.

yanboliang · 2017-08-30T06:33:01Z

@hhbyyh We got agreement that the initialModel should be of type [T <: Model[T]] at #11119. I understand the scenario you mentioned, however, I think they are different scenarios:
1, If we set initialModel as a string path, we should guarantee there is correct model stored in the path, but we can't guarantee it only to refer the string path.
2, I agree that the scenario you mentioned above is really useful, but it think the requirement is beyond initial model. The following code snippet can meet that requirement under my framework:

val initialModel = LinearRegressionModel.load("path_lr_20iteration_reg0.001.model")
val lr = new LinearRegression().setInitialModel(initialModel)
val model = lr.fit(dataset)

3, If users have a model in memory, they want to use this model as the initial model to train another one, they need to save the model firstly and then set the corresponding path as the initial model path.
4, For your comments about serializing the initial model, I agree with you. I'm ok not to save initialModel when saving estimator.
What do you think of it? Thanks.

hhbyyh · 2017-08-31T14:01:58Z

Thanks for the reply. Since there's already an agreement, I will hold my suggestion on initialModel data type.

SparkQA · 2017-10-27T12:29:37Z

Test build #83120 has finished for PR 18610 at commit e49d452.

This patch fails Spark unit tests.
This patch does not merge cleanly.
This patch adds no public classes.

sethah · 2017-10-27T23:15:14Z

A couple questions:

do we actually need to save the initialModel when we persist the current model? I'm not sure it's necessary and it adds complexity. Also, we could add it in the future anyway.
How can users set an initial model from an array of coefficients? If initialModel has to be a LinearRegressionModel then how can I create one from an array? The model constructor is private... If you can only set an initial model that was trained in Spark and saved as a Spark model, then that severely limits the utility of this functionality. Can we just overload the set method?

final val initialModel: Param[Vector]
def setInitialModel(value: Vector) = ...
def setInitialModel(value: LinearRegressionModel) = setInitialModel(value.coefficients)

holdenk · 2017-12-06T04:07:25Z

@sethah I don't think we need to saving the initial model right now, and if its going to slow down the PR progress maybe dropping it makes sense -- but I guess it depends how complicated that is?
For the second point, I also think creating an initial model from a set of coefficients could be useful, but I think that could be addressed in a follow up issue -- what do you think?

JohnHBrock · 2018-03-06T05:37:34Z

What else needs to be done before this can be merged?

holdenk · 2018-11-09T23:08:48Z

@JohnHBrock this PR is pretty old so the biggest challenge is going to be updating it to the current master branch. There's some discussion around the types needing to be changed as well. If this is a thing you want to work on I'd love to do what I can to help with the review process.

WeichenXu123 · 2019-10-18T12:32:42Z

@yanboliang This PR is stale. If no further updates, would you mind to close it ? Thanks

tilayealemu · 2020-04-24T09:09:56Z

@yanboliang The functionality here would have been really useful. Is there another solution you can suggest?

yanboliang commented Jul 12, 2017

View reviewed changes

yanboliang added 2 commits July 12, 2017 22:31

ML LinearRegression supports warm start by user provided initial model.

c871909

Fix align and add mima excludes.

e49d452

yanboliang force-pushed the spark-21386 branch from 0b7509e to e49d452 Compare July 12, 2017 14:32

MLnick mentioned this pull request Jul 25, 2017

[SPARK-21526] [MLlib] Add support to ML LogisticRegression for setting initial model #18729

Closed

yanboliang commented Jul 25, 2017

View reviewed changes

WeichenXu123 reviewed Jul 25, 2017

View reviewed changes

hhbyyh reviewed Aug 27, 2017

View reviewed changes

mdespriee mentioned this pull request Sep 1, 2017

[SPARK-20082][ml] LDA incremental model learning #17461

Closed

dongjoon-hyun added the ML label Jun 14, 2019

yanboliang closed this Oct 18, 2019

yanboliang deleted the spark-21386 branch October 18, 2019 20:10

[SPARK-21386] ML LinearRegression supports warm start from user provided initial model. #18610

[SPARK-21386] ML LinearRegression supports warm start from user provided initial model. #18610

Uh oh!

Conversation

yanboliang commented Jul 12, 2017 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

What changes were proposed in this pull request?

How was this patch tested?

Uh oh!

SparkQA commented Jul 12, 2017

Uh oh!

Choose a reason for hiding this comment

Uh oh!

SparkQA commented Jul 12, 2017

Uh oh!

yanboliang commented Jul 12, 2017 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

SparkQA commented Jul 12, 2017

Uh oh!

yanboliang commented Jul 12, 2017

Uh oh!

SparkQA commented Jul 12, 2017

Uh oh!

zhengruifeng commented Jul 25, 2017

Uh oh!

MLnick commented Jul 25, 2017

Uh oh!

yanboliang commented Jul 25, 2017 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

zhengruifeng commented Jul 26, 2017

Uh oh!

JohnHBrock commented Aug 27, 2017

Uh oh!

hhbyyh commented Aug 27, 2017 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

yanboliang commented Aug 30, 2017

Uh oh!

hhbyyh commented Aug 31, 2017

Uh oh!

SparkQA commented Oct 27, 2017

Uh oh!

sethah commented Oct 27, 2017

Uh oh!

holdenk commented Dec 6, 2017

Uh oh!

JohnHBrock commented Mar 6, 2018

Uh oh!

holdenk commented Nov 9, 2018

Uh oh!

WeichenXu123 commented Oct 18, 2019

Uh oh!

tilayealemu commented Apr 24, 2020

Uh oh!

Uh oh!

yanboliang commented Jul 12, 2017 •

edited

Loading

yanboliang commented Jul 12, 2017 •

edited

Loading

yanboliang commented Jul 25, 2017 •

edited

Loading

hhbyyh commented Aug 27, 2017 •

edited

Loading