[SPARK-17848][ML] Move LabelCol datatype cast into Predictor.fit #15414

zhengruifeng · 2016-10-10T05:48:40Z

What changes were proposed in this pull request?

1, move cast to Predictor
2, and then, remove unnecessary cast

How was this patch tested?

existing tests

SparkQA · 2016-10-10T06:27:44Z

Test build #66629 has finished for PR 15414 at commit 5cb06fc.

This patch fails Spark unit tests.
This patch merges cleanly.
This patch adds no public classes.

SparkQA · 2016-10-10T08:13:51Z

Test build #66635 has finished for PR 15414 at commit 6c2a8d0.

This patch fails to build.
This patch merges cleanly.
This patch adds no public classes.

zhengruifeng · 2016-10-10T08:17:43Z

Jenkins, test this please

SparkQA · 2016-10-10T08:28:50Z

Test build #66637 has finished for PR 15414 at commit 6c2a8d0.

This patch fails to build.
This patch merges cleanly.
This patch adds no public classes.

zhengruifeng · 2016-10-10T12:51:42Z

Jenkins, retest this please

SparkQA · 2016-10-10T13:53:02Z

Test build #66649 has finished for PR 15414 at commit 6c2a8d0.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

hhbyyh · 2016-10-10T21:13:55Z

mllib/src/main/scala/org/apache/spark/ml/Predictor.scala

+   * Return the given DataFrame, with [[labelCol]] casted to DoubleType.
+   */
+    protected def castDataSet(dataset: Dataset[_]): DataFrame = {
+      val labelMeta = dataset.schema.fields.filter(_.name == $(labelCol)).head.metadata


Maybe simplify it: dataset.schema("value").metadata

sethah · 2016-10-10T21:30:52Z

What do you think about adding a new suite PredictorSuite where we can create a mock predictor, and call train on data of various types. The train method can just require that the label column is DoubleType:

class MockPredictor(override val uid: String)
  extends Predictor[Vector, MockPredictor, MockPredictionModel] {

  override def train(dataset: Dataset[_]): MockPredictionModel = {
    require(dataset.schema("label").dataType == DoubleType)
    new MockPredictionModel(uid)
  }

  override def copy(extra: ParamMap): MockPredictor = defaultCopy(extra)
}

class MockPredictionModel(override val uid: String)
  extends PredictionModel[Vector, MockPredictionModel] {

  override def predict(features: Vector): Double = 1.0

  override def copy(extra: ParamMap): MockPredictionModel = defaultCopy(extra)
}

Then we just have a test that calls fit for each type of data.

zhengruifeng · 2016-10-11T03:11:13Z

Ok, I will create this Suite.

SparkQA · 2016-10-11T04:27:59Z

Test build #66710 has finished for PR 15414 at commit 6c61e73.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

sethah · 2016-10-12T03:43:45Z

mllib/src/test/scala/org/apache/spark/ml/PredictorSuite.scala

+
+  import testImplicits._
+
+  class MockPredictor(override val uid: String)


move into companion object.

sethah · 2016-10-12T03:46:38Z

mllib/src/main/scala/org/apache/spark/ml/Predictor.scala

+  /**
+   * Return the given DataFrame, with [[labelCol]] casted to DoubleType.
+   */
+    protected def castDataSet(dataset: Dataset[_]): DataFrame = {


let's just put this logic directly in fit

sethah · 2016-10-12T04:00:19Z

mllib/src/test/scala/org/apache/spark/ml/util/MLTestingUtils.scala

@@ -117,7 +117,7 @@ object MLTestingUtils extends SparkFunSuite {
      Seq(ShortType, LongType, IntegerType, FloatType, ByteType, DoubleType, DecimalType(10, 0))
    types.map { t =>
        val castDF = df.select(col(labelColName).cast(t), col(featuresColName))
-        t -> TreeTests.setMetadata(castDF, 2, labelColName, featuresColName)
+        t -> TreeTests.setMetadata(castDF, 0, labelColName, featuresColName)


What is this for? If the intent is to force getNumClasses to infer the number of classes, then you're no longer testing the not inferred case. Further, the point of this PR is to eliminate the need to do that since it is not a robust solution, IMO.

Also, I'd like to remove the dependence on TreeTests here (and genRegressionDF) and just explicitly set the attributes in the functions.

Ok, I will revert this

sethah · 2016-10-12T04:02:08Z

mllib/src/test/scala/org/apache/spark/ml/PredictorSuite.scala

+
+  test("should support all NumericType labels and not support other types") {
+    val predictor = new MockPredictor("mock")
+    MLTestingUtils.checkNumericTypes[MockPredictionModel, MockPredictor](


Why don't we just cycle through the types here and call fit. I think it's a bit confusing the way it is now.

OK, I will update this.

sethah · 2016-10-12T04:41:31Z

mllib/src/test/scala/org/apache/spark/ml/PredictorSuite.scala

+  class MockPredictionModel(override val uid: String)
+    extends PredictionModel[Vector, MockPredictionModel] {
+
+    override def predict(features: Vector): Double = 1.0


override def predict(features: Vector): Double = throw new NotImplementedError() We can do this for everything except train.

SparkQA · 2016-10-12T12:46:57Z

Test build #66814 has finished for PR 15414 at commit 6ef17b7.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

zhengruifeng · 2016-10-13T03:02:17Z

@sethah I have maken some modification according to the comments

sethah · 2016-10-12T17:32:10Z

mllib/src/test/scala/org/apache/spark/ml/PredictorSuite.scala

+      new MockPredictionModel(uid)
+    }
+
+    override def copy(extra: ParamMap): MockPredictor = defaultCopy(extra)


change the copy methods to throw NotImplementedError

sethah · 2016-10-13T03:21:38Z

Thanks, I'll take a more detailed look in the next couple of days. Let's also wait and see if we can get @yanboliang or @jkbradley to give an opinion.

sethah · 2016-10-13T03:23:42Z

mllib/src/test/scala/org/apache/spark/ml/PredictorSuite.scala

+import org.apache.spark.sql.functions._
+import org.apache.spark.sql.types._
+
+class PredictorSuite extends SparkFunSuite with MLlibTestSparkContext with DefaultReadWriteTest {


don't need DefaultReadWriteTest

SparkQA · 2016-10-13T07:03:22Z

Test build #66872 has finished for PR 15414 at commit 7e2d501.

This patch passes all tests.
This patch merges cleanly.
This patch adds the following public classes (experimental):
- class PredictorSuite extends SparkFunSuite with MLlibTestSparkContext

SparkQA · 2016-10-13T08:46:29Z

Test build #66880 has finished for PR 15414 at commit 7cb4510.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

zhengruifeng · 2016-10-18T03:03:08Z

@jkbradley @yanboliang Could you please have a review of this? This PR unify usage of labelCol casting and fixs a bug described in [https://issues.apache.org/jira/browse/SPARK-17797]

zhengruifeng · 2016-10-28T03:09:50Z

@jkbradley @yanboliang Just re-pinging for your opinions.

jkbradley · 2016-10-31T17:08:07Z

Can you please document in Predictor that it accepts all NumericType labels? Other than that, this LGTM. Thanks!

sethah · 2016-11-01T00:35:00Z

LGTM as well after adding @jkbradley's suggestion.

zhengruifeng · 2016-11-01T02:04:42Z

@jkbradley @sethah I add a comment, thanks for reviews.

SparkQA · 2016-11-01T02:58:39Z

Test build #67861 has finished for PR 15414 at commit 810c973.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

jkbradley · 2016-11-01T17:45:46Z

LGTM
Merging with master
Thanks!

## What changes were proposed in this pull request? 1, move cast to `Predictor` 2, and then, remove unnecessary cast ## How was this patch tested? existing tests Author: Zheng RuiFeng <ruifengz@foxmail.com> Closes apache#15414 from zhengruifeng/move_cast.

hhbyyh reviewed Oct 10, 2016

View reviewed changes

zhengruifeng force-pushed the move_cast branch from 6c2a8d0 to 6c61e73 Compare October 11, 2016 03:20

sethah reviewed Oct 12, 2016

View reviewed changes

zhengruifeng force-pushed the move_cast branch from 6c61e73 to 6ef17b7 Compare October 12, 2016 11:44

sethah reviewed Oct 13, 2016

View reviewed changes

zhengruifeng added 6 commits November 1, 2016 09:28

create pr

9e4413f

rename func

6ad6508

revert lr

41e63e2

del cast in regression

764650a

add testsuite for predictor

59d02d5

fix one nit

e0bbc34

zhengruifeng added 4 commits November 1, 2016 09:28

update Predictor and PredictorSuite

db83800

update copy() & del unused interface

1944cf1

update another copy

5b4f34a

add doc

810c973

zhengruifeng force-pushed the move_cast branch from 7cb4510 to 810c973 Compare November 1, 2016 01:47

asfgit closed this in 8ac0910 Nov 1, 2016

zhengruifeng deleted the move_cast branch November 2, 2016 01:32

zhengruifeng mentioned this pull request Nov 4, 2016

[SPARK-14709][ML] spark.ml API for linear SVM #15211

Closed


		import testImplicits._

		class MockPredictor(override val uid: String)

[SPARK-17848][ML] Move LabelCol datatype cast into Predictor.fit #15414

[SPARK-17848][ML] Move LabelCol datatype cast into Predictor.fit #15414

Uh oh!

Conversation

zhengruifeng commented Oct 10, 2016

What changes were proposed in this pull request?

How was this patch tested?

Uh oh!

SparkQA commented Oct 10, 2016

Uh oh!

SparkQA commented Oct 10, 2016

Uh oh!

zhengruifeng commented Oct 10, 2016 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

SparkQA commented Oct 10, 2016

Uh oh!

zhengruifeng commented Oct 10, 2016

Uh oh!

SparkQA commented Oct 10, 2016

Uh oh!

hhbyyh Oct 10, 2016 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

zhengruifeng Oct 11, 2016

Choose a reason for hiding this comment

Uh oh!

sethah commented Oct 10, 2016

Uh oh!

zhengruifeng commented Oct 11, 2016

Uh oh!

SparkQA commented Oct 11, 2016

Uh oh!

sethah Oct 12, 2016 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

sethah Oct 12, 2016

Choose a reason for hiding this comment

Uh oh!

sethah Oct 12, 2016 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

zhengruifeng Oct 12, 2016

Choose a reason for hiding this comment

Uh oh!

sethah Oct 12, 2016

Choose a reason for hiding this comment

Uh oh!

zhengruifeng Oct 12, 2016

Choose a reason for hiding this comment

Uh oh!

sethah Oct 12, 2016 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

SparkQA commented Oct 12, 2016

Uh oh!

zhengruifeng commented Oct 13, 2016

Uh oh!

sethah Oct 12, 2016

Choose a reason for hiding this comment

Uh oh!

sethah commented Oct 13, 2016

Uh oh!

sethah Oct 13, 2016

Choose a reason for hiding this comment

Uh oh!

SparkQA commented Oct 13, 2016

Uh oh!

SparkQA commented Oct 13, 2016

Uh oh!

zhengruifeng commented Oct 18, 2016

Uh oh!

zhengruifeng commented Oct 28, 2016

Uh oh!

jkbradley commented Oct 31, 2016

Uh oh!

zhengruifeng commented Oct 10, 2016 •

edited

Loading

hhbyyh Oct 10, 2016 •

edited

Loading

sethah Oct 12, 2016 •

edited

Loading

sethah Oct 12, 2016 •

edited

Loading

sethah Oct 12, 2016 •

edited

Loading