-
Notifications
You must be signed in to change notification settings - Fork 28.6k
[SPARK-21386] ML LinearRegression supports warm start from user provided initial model. #18610
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
Test build #79553 has finished for PR 18610 at commit
|
@@ -113,7 +115,14 @@ trait DefaultReadWriteTest extends TempDirectory { self: Suite => | |||
val estimator2 = testDefaultReadWrite(estimator) | |||
testEstimatorParams.foreach { case (p, v) => | |||
val param = estimator.getParam(p) | |||
assert(estimator.get(param).get === estimator2.get(param).get) | |||
if (param.name == "initialModel") { | |||
// Estimator's `initialModel` has same type as the model produced by this estimator. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is an assumption that initialModel
must be the same type of the model that the estimator produced. We should enforce this assumption by making the trait as HasInitialModel[T <: Model[T]]
or add new argument checkInitialModelData
for testEstimatorAndModelReadWrite
. I'm open to hear others' thoughts and update here after collecting feedback. Thanks.
Test build #79554 has finished for PR 18610 at commit
|
Test build #79562 has finished for PR 18610 at commit
|
This failure is irrelevant. Jenkins, retest this please. |
Test build #79567 has finished for PR 18610 at commit
|
LGTM, this is really a great feature |
There was a lot of discussion around the |
@MLnick Thanks for your comments. I think the implementation of this PR followed the discussion around the |
@@ -365,7 +398,11 @@ class LinearRegression @Since("1.3.0") (@Since("1.3.0") override val uid: String | |||
new BreezeOWLQN[Int, BDV[Double]]($(maxIter), 10, effectiveL1RegFun, $(tol)) | |||
} | |||
|
|||
val initialCoefficients = Vectors.zeros(numFeatures) | |||
val initialCoefficients = if (isSet(initialModel)) { | |||
$(initialModel).coefficients |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Here we only use coefficients
, not use intercept
. This is because the intercept
was computed using closed form after the coefficients are converged.
if (initialModel.hasParam("initialModel")) { | ||
initialModel.clear(initialModel.getParam("initialModel")) | ||
} | ||
initialModel.save(initialModelPath) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
In order to avoid recursion, here modify the model to cut down the recursion, but may cause the problem that influence the model which is maybe used in other place.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@WeichenXu123 Can you elaborate the concrete scenario? Thanks.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I suggest when setInitialModel, first copy the passed in model (by Model.copy
) and then clear the initialModel
param in the passed in model to avoid recursive risk. Because the passed in model maybe used in other place we should avoid to modify its params.
Another way is avoid to add the initial model param to Model, only add it to the Estimator. (This way I think is more reasonable).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actually we did in the later way, initialModel
is only param for Estimator, not for Model. Thanks.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Fair enough.
@yanboliang Agree. I think if there is some prarms which control the "shape" of model coefficients, than they should be override if we use inital model. Like |
Is there anymore work to do before this can get merged? |
Just to confirm, so we have agreed that the initialModel should be of type [T <: Model[T]] rather than a String type (path to the saved model)? Sorry I didn't find the related discussion. According to my experience, it can really help if I can directly distinguish/confirm which model I loaded as initial model. And serializing a model (initial model) to an Besides, using String param means we don't need to handle the recursive issue, and the overall implementation can be much simplified since we don't need to save a Param[Model] to the Estimator object. I understand there's some scenarios that user want to continue the model training without saving it first, |
@@ -226,6 +246,12 @@ class LinearRegression @Since("1.3.0") (@Since("1.3.0") override val uid: String | |||
|
|||
if (($(solver) == Auto && | |||
numFeatures <= WeightedLeastSquares.MAX_NUM_FEATURES) || $(solver) == Normal) { | |||
|
|||
if (isSet(initialModel)) { | |||
logWarning("Initial model will be ignored if fitting by normal solver. " + |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Since initial model is a pretty important parameter. By setting the initial model, user would expect it to work and they may neglect the warning in the overwhelming Spark logs.
Maybe we can move the parameter check to transformSchema
and throws an exception.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Fair enough, I will update it after collecting all comments. Thanks.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I disagree. Normal solver provides an exact solution, so it is harmless to ignore the initial model. I don't see a reason to do anything more than log a warning.
@@ -72,6 +72,22 @@ private[regression] trait LinearRegressionParams extends PredictorParams | |||
} | |||
|
|||
/** | |||
* Params for linear regression. | |||
*/ | |||
private[regression] trait LinearRegressionParams extends LinearRegressionModelParams |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It maybe cleaner if we just move the param initialModel
into LinearRegression? so we don't have to touch the class hierarchy.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The refactor here is not only for this PR, since model usually has less params than estimator, we make EstimatorParams extends from ModelParams in other places, for example, ALSParams extends ALSModelParams. I think the class hierarchy should be more clear.
@hhbyyh We got agreement that the initialModel should be of type
3, If users have a model in memory, they want to use this model as the initial model to train another one, they need to save the model firstly and then set the corresponding path as the initial model path. |
Thanks for the reply. Since there's already an agreement, I will hold my suggestion on initialModel data type. |
Test build #83120 has finished for PR 18610 at commit
|
A couple questions:
final val initialModel: Param[Vector]
def setInitialModel(value: Vector) = ...
def setInitialModel(value: LinearRegressionModel) = setInitialModel(value.coefficients) |
@sethah I don't think we need to saving the initial model right now, and if its going to slow down the PR progress maybe dropping it makes sense -- but I guess it depends how complicated that is? |
What else needs to be done before this can be merged? |
@JohnHBrock this PR is pretty old so the biggest challenge is going to be updating it to the current master branch. There's some discussion around the types needing to be changed as well. If this is a thing you want to work on I'd love to do what I can to help with the review process. |
@yanboliang This PR is stale. If no further updates, would you mind to close it ? Thanks |
@yanboliang The functionality here would have been really useful. Is there another solution you can suggest? |
What changes were proposed in this pull request?
Allow users to set initial model when running
LinearRegression
. This is the first step to make ML supports warm-start, we can distribute tasks for other algorithms after this get merged.Note: The design behind this PR follows discussions at #11119 and #17117 . With regards to those PRs,
KMeans
involves lots of issues which is not related to initial model itself. I'd like to proceed this work fromLinearRegression
which is more independent and clean.How was this patch tested?
Added unit test.