Skip to content

[SPARK-21386] ML LinearRegression supports warm start from user provided initial model. #18610

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
wants to merge 2 commits into from

Conversation

yanboliang
Copy link
Contributor

@yanboliang yanboliang commented Jul 12, 2017

What changes were proposed in this pull request?

Allow users to set initial model when running LinearRegression. This is the first step to make ML supports warm-start, we can distribute tasks for other algorithms after this get merged.

Note: The design behind this PR follows discussions at #11119 and #17117 . With regards to those PRs, KMeans involves lots of issues which is not related to initial model itself. I'd like to proceed this work from LinearRegression which is more independent and clean.

How was this patch tested?

Added unit test.

@SparkQA
Copy link

SparkQA commented Jul 12, 2017

Test build #79553 has finished for PR 18610 at commit ae8f32b.

  • This patch fails MiMa tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@@ -113,7 +115,14 @@ trait DefaultReadWriteTest extends TempDirectory { self: Suite =>
val estimator2 = testDefaultReadWrite(estimator)
testEstimatorParams.foreach { case (p, v) =>
val param = estimator.getParam(p)
assert(estimator.get(param).get === estimator2.get(param).get)
if (param.name == "initialModel") {
// Estimator's `initialModel` has same type as the model produced by this estimator.
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is an assumption that initialModel must be the same type of the model that the estimator produced. We should enforce this assumption by making the trait as HasInitialModel[T <: Model[T]] or add new argument checkInitialModelData for testEstimatorAndModelReadWrite. I'm open to hear others' thoughts and update here after collecting feedback. Thanks.

@SparkQA
Copy link

SparkQA commented Jul 12, 2017

Test build #79554 has finished for PR 18610 at commit 0b7509e.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@yanboliang
Copy link
Contributor Author

yanboliang commented Jul 12, 2017

@SparkQA
Copy link

SparkQA commented Jul 12, 2017

Test build #79562 has finished for PR 18610 at commit e49d452.

  • This patch fails SparkR unit tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@yanboliang
Copy link
Contributor Author

This failure is irrelevant. Jenkins, retest this please.

@SparkQA
Copy link

SparkQA commented Jul 12, 2017

Test build #79567 has finished for PR 18610 at commit e49d452.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@zhengruifeng
Copy link
Contributor

LGTM, this is really a great feature

@MLnick
Copy link
Contributor

MLnick commented Jul 25, 2017

There was a lot of discussion around the KMeans initial model. But here we seem to be just ignoring certain issues. Such as, does the initialModel override the param settings (e.g. reg param, tolerance, iterations etc). Here it is just the coefficients that are used.

@yanboliang
Copy link
Contributor Author

yanboliang commented Jul 25, 2017

@MLnick Thanks for your comments. I think the implementation of this PR followed the discussion around the KMeans initial model in #11119. In that PR, we didn't use initialModel to override all param settings(e.g. maxIter, tol etc), but we override some params (e.g. k). But in this PR, we don't have params like k, so it's not necessary to override any params. This is the reason that I start the general warm start work from LinearRegression which is more clean. And I don't think we should to use initialModel to override params like tolerance and iterations. Why we need to use the tolerance of the initialModel? I just want my train start from an coefficients and I will specify new tolerance to control its convergence.
BTW, I also referred other libraries(e.g. scikit-learn), the semantic of this implementation is consistent with other popular libraries. Still, I'm very happy to get your more comments.

@@ -365,7 +398,11 @@ class LinearRegression @Since("1.3.0") (@Since("1.3.0") override val uid: String
new BreezeOWLQN[Int, BDV[Double]]($(maxIter), 10, effectiveL1RegFun, $(tol))
}

val initialCoefficients = Vectors.zeros(numFeatures)
val initialCoefficients = if (isSet(initialModel)) {
$(initialModel).coefficients
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Here we only use coefficients, not use intercept. This is because the intercept was computed using closed form after the coefficients are converged.

if (initialModel.hasParam("initialModel")) {
initialModel.clear(initialModel.getParam("initialModel"))
}
initialModel.save(initialModelPath)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In order to avoid recursion, here modify the model to cut down the recursion, but may cause the problem that influence the model which is maybe used in other place.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@WeichenXu123 Can you elaborate the concrete scenario? Thanks.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I suggest when setInitialModel, first copy the passed in model (by Model.copy) and then clear the initialModel param in the passed in model to avoid recursive risk. Because the passed in model maybe used in other place we should avoid to modify its params.

Another way is avoid to add the initial model param to Model, only add it to the Estimator. (This way I think is more reasonable).

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actually we did in the later way, initialModel is only param for Estimator, not for Model. Thanks.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fair enough.

@zhengruifeng
Copy link
Contributor

@yanboliang Agree. I think if there is some prarms which control the "shape" of model coefficients, than they should be override if we use inital model. Like k in KMeans, GMM, LDA, other params should be let alone

@JohnHBrock
Copy link

Is there anymore work to do before this can get merged?

@hhbyyh
Copy link
Contributor

hhbyyh commented Aug 27, 2017

Just to confirm, so we have agreed that the initialModel should be of type [T <: Model[T]] rather than a String type (path to the saved model)? Sorry I didn't find the related discussion.

According to my experience, it can really help if I can directly distinguish/confirm which model I loaded as initial model.
E.g., it's convenient if estimator.getInitialModel() return something like
"lr_20iteration_reg0.001.model" or
"lr_40iteration_reg0.001.model"

And serializing a model (initial model) to an Estimator does not feel right to me, even if we want to use [T <: Model[T]] as the type of Initial model, I wouldn't mind that we don't save this parameter with the Estimator.

Besides, using String param means we don't need to handle the recursive issue, and the overall implementation can be much simplified since we don't need to save a Param[Model] to the Estimator object.

I understand there's some scenarios that user want to continue the model training without saving it first,
but I guess saving model is a good habit...

@@ -226,6 +246,12 @@ class LinearRegression @Since("1.3.0") (@Since("1.3.0") override val uid: String

if (($(solver) == Auto &&
numFeatures <= WeightedLeastSquares.MAX_NUM_FEATURES) || $(solver) == Normal) {

if (isSet(initialModel)) {
logWarning("Initial model will be ignored if fitting by normal solver. " +
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Since initial model is a pretty important parameter. By setting the initial model, user would expect it to work and they may neglect the warning in the overwhelming Spark logs.
Maybe we can move the parameter check to transformSchema and throws an exception.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fair enough, I will update it after collecting all comments. Thanks.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I disagree. Normal solver provides an exact solution, so it is harmless to ignore the initial model. I don't see a reason to do anything more than log a warning.

@@ -72,6 +72,22 @@ private[regression] trait LinearRegressionParams extends PredictorParams
}

/**
* Params for linear regression.
*/
private[regression] trait LinearRegressionParams extends LinearRegressionModelParams
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It maybe cleaner if we just move the param initialModel into LinearRegression? so we don't have to touch the class hierarchy.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The refactor here is not only for this PR, since model usually has less params than estimator, we make EstimatorParams extends from ModelParams in other places, for example, ALSParams extends ALSModelParams. I think the class hierarchy should be more clear.

@yanboliang
Copy link
Contributor Author

@hhbyyh We got agreement that the initialModel should be of type [T <: Model[T]] at #11119. I understand the scenario you mentioned, however, I think they are different scenarios:
1, If we set initialModel as a string path, we should guarantee there is correct model stored in the path, but we can't guarantee it only to refer the string path.
2, I agree that the scenario you mentioned above is really useful, but it think the requirement is beyond initial model. The following code snippet can meet that requirement under my framework:

val initialModel = LinearRegressionModel.load("path_lr_20iteration_reg0.001.model")
val lr = new LinearRegression().setInitialModel(initialModel)
val model = lr.fit(dataset)

3, If users have a model in memory, they want to use this model as the initial model to train another one, they need to save the model firstly and then set the corresponding path as the initial model path.
4, For your comments about serializing the initial model, I agree with you. I'm ok not to save initialModel when saving estimator.
What do you think of it? Thanks.

@hhbyyh
Copy link
Contributor

hhbyyh commented Aug 31, 2017

Thanks for the reply. Since there's already an agreement, I will hold my suggestion on initialModel data type.

@SparkQA
Copy link

SparkQA commented Oct 27, 2017

Test build #83120 has finished for PR 18610 at commit e49d452.

  • This patch fails Spark unit tests.
  • This patch does not merge cleanly.
  • This patch adds no public classes.

@sethah
Copy link
Contributor

sethah commented Oct 27, 2017

A couple questions:

  1. do we actually need to save the initialModel when we persist the current model? I'm not sure it's necessary and it adds complexity. Also, we could add it in the future anyway.
  2. How can users set an initial model from an array of coefficients? If initialModel has to be a LinearRegressionModel then how can I create one from an array? The model constructor is private... If you can only set an initial model that was trained in Spark and saved as a Spark model, then that severely limits the utility of this functionality. Can we just overload the set method?
final val initialModel: Param[Vector]
def setInitialModel(value: Vector) = ...
def setInitialModel(value: LinearRegressionModel) = setInitialModel(value.coefficients)

@holdenk
Copy link
Contributor

holdenk commented Dec 6, 2017

@sethah I don't think we need to saving the initial model right now, and if its going to slow down the PR progress maybe dropping it makes sense -- but I guess it depends how complicated that is?
For the second point, I also think creating an initial model from a set of coefficients could be useful, but I think that could be addressed in a follow up issue -- what do you think?

@JohnHBrock
Copy link

What else needs to be done before this can be merged?

@holdenk
Copy link
Contributor

holdenk commented Nov 9, 2018

@JohnHBrock this PR is pretty old so the biggest challenge is going to be updating it to the current master branch. There's some discussion around the types needing to be changed as well. If this is a thing you want to work on I'd love to do what I can to help with the review process.

@WeichenXu123
Copy link
Contributor

@yanboliang This PR is stale. If no further updates, would you mind to close it ? Thanks

@yanboliang yanboliang closed this Oct 18, 2019
@yanboliang yanboliang deleted the spark-21386 branch October 18, 2019 20:10
@tilayealemu
Copy link

@yanboliang The functionality here would have been really useful. Is there another solution you can suggest?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

Successfully merging this pull request may close these issues.