Skip to content

Commit 52daf49

Browse files
asarbsrowen
asarb
authored andcommitted
[SPARK-27621][ML] Linear Regression - validate training related params such as loss only during fitting phase
## What changes were proposed in this pull request? When transform(...) method is called on a LinearRegressionModel created directly with the coefficients and intercepts, the following exception is encountered. ``` java.util.NoSuchElementException: Failed to find a default value for loss at org.apache.spark.ml.param.Params$$anonfun$getOrDefault$2.apply(params.scala:780) at org.apache.spark.ml.param.Params$$anonfun$getOrDefault$2.apply(params.scala:780) at scala.Option.getOrElse(Option.scala:121) at org.apache.spark.ml.param.Params$class.getOrDefault(params.scala:779) at org.apache.spark.ml.PipelineStage.getOrDefault(Pipeline.scala:42) at org.apache.spark.ml.param.Params$class.$(params.scala:786) at org.apache.spark.ml.PipelineStage.$(Pipeline.scala:42) at org.apache.spark.ml.regression.LinearRegressionParams$class.validateAndTransformSchema(LinearRegression.scala:111) at org.apache.spark.ml.regression.LinearRegressionModel.validateAndTransformSchema(LinearRegression.scala:637) at org.apache.spark.ml.PredictionModel.transformSchema(Predictor.scala:192) at org.apache.spark.ml.PipelineModel$$anonfun$transformSchema$5.apply(Pipeline.scala:311) at org.apache.spark.ml.PipelineModel$$anonfun$transformSchema$5.apply(Pipeline.scala:311) at scala.collection.IndexedSeqOptimized$class.foldl(IndexedSeqOptimized.scala:57) at scala.collection.IndexedSeqOptimized$class.foldLeft(IndexedSeqOptimized.scala:66) at scala.collection.mutable.ArrayOps$ofRef.foldLeft(ArrayOps.scala:186) at org.apache.spark.ml.PipelineModel.transformSchema(Pipeline.scala:311) at org.apache.spark.ml.PipelineStage.transformSchema(Pipeline.scala:74) at org.apache.spark.ml.PipelineModel.transform(Pipeline.scala:305) ``` This is because validateAndTransformSchema() is called both during training and scoring phases, but the checks against the training related params like loss should really be performed during training phase only, I think, please correct me if I'm missing anything :) This issue was first reported for mleap (combust/mleap#455) because basically when we serialize the Spark transformers for mleap, we only serialize the params that are relevant for scoring. We do have the option to de-serialize the serialized transformers back into Spark for scoring again, but in that case, we no longer have all the training params. ## How was this patch tested? Added a unit test to check this scenario. Please let me know if there's anything additional required, this is the first PR that I've raised in this project. Closes #24509 from ancasarb/linear_regression_params_fix. Authored-by: asarb <asarb@expedia.com> Signed-off-by: Sean Owen <sean.owen@databricks.com> (cherry picked from commit 4241a72) Signed-off-by: Sean Owen <sean.owen@databricks.com>
1 parent 6071653 commit 52daf49

File tree

2 files changed

+19
-6
lines changed

2 files changed

+19
-6
lines changed

mllib/src/main/scala/org/apache/spark/ml/regression/LinearRegression.scala

Lines changed: 7 additions & 6 deletions
Original file line numberDiff line numberDiff line change
@@ -107,12 +107,13 @@ private[regression] trait LinearRegressionParams extends PredictorParams
107107
schema: StructType,
108108
fitting: Boolean,
109109
featuresDataType: DataType): StructType = {
110-
if ($(loss) == Huber) {
111-
require($(solver)!= Normal, "LinearRegression with huber loss doesn't support " +
112-
"normal solver, please change solver to auto or l-bfgs.")
113-
require($(elasticNetParam) == 0.0, "LinearRegression with huber loss only supports " +
114-
s"L2 regularization, but got elasticNetParam = $getElasticNetParam.")
115-
110+
if (fitting) {
111+
if ($(loss) == Huber) {
112+
require($(solver)!= Normal, "LinearRegression with huber loss doesn't support " +
113+
"normal solver, please change solver to auto or l-bfgs.")
114+
require($(elasticNetParam) == 0.0, "LinearRegression with huber loss only supports " +
115+
s"L2 regularization, but got elasticNetParam = $getElasticNetParam.")
116+
}
116117
}
117118
super.validateAndTransformSchema(schema, fitting, featuresDataType)
118119
}

mllib/src/test/scala/org/apache/spark/ml/regression/LinearRegressionSuite.scala

Lines changed: 12 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -182,6 +182,18 @@ class LinearRegressionSuite extends MLTest with DefaultReadWriteTest {
182182
assert(model.numFeatures === numFeatures)
183183
}
184184

185+
test("linear regression: can transform data with LinearRegressionModel") {
186+
withClue("training related params like loss are only validated during fitting phase") {
187+
val original = new LinearRegression().fit(datasetWithDenseFeature)
188+
189+
val deserialized = new LinearRegressionModel(uid = original.uid,
190+
coefficients = original.coefficients,
191+
intercept = original.intercept)
192+
val output = deserialized.transform(datasetWithDenseFeature)
193+
assert(output.collect().size > 0) // simple assertion to ensure no exception thrown
194+
}
195+
}
196+
185197
test("linear regression: illegal params") {
186198
withClue("LinearRegression with huber loss only supports L2 regularization") {
187199
intercept[IllegalArgumentException] {

0 commit comments

Comments
 (0)