[SPARK-24333][ML][PYTHON]Add fit with validation set to spark.ml GBT: Python API #21465

huaxingao · 2018-05-30T23:59:18Z

What changes were proposed in this pull request?

Add validationIndicatorCol and validationTol to GBT Python.

How was this patch tested?

Add test in doctest to test the new API.

SparkQA · 2018-05-31T00:25:25Z

Test build #91317 has finished for PR 21465 at commit 79fc83b.

This patch passes all tests.
This patch merges cleanly.
This patch adds the following public classes (experimental):
class HasValidationIndicatorCol(Params):

MLnick · 2018-06-08T18:24:34Z

python/pyspark/ml/classification.py

    @keyword_only
    def __init__(self, featuresCol="features", labelCol="label", predictionCol="prediction",
                 maxDepth=5, maxBins=32, minInstancesPerNode=1, minInfoGain=0.0,
                 maxMemoryInMB=256, cacheNodeIds=False, checkpointInterval=10, lossType="logistic",
                 maxIter=20, stepSize=0.1, seed=None, subsamplingRate=1.0,
-                 featureSubsetStrategy="all"):
+                 featureSubsetStrategy="all", validationTol=0.01):


Shouldn't validationIndicatorCol be in init too? Set to None default?

@MLnick Yes, I should add it in init. Will change it now. Thanks a lot for your review!

SparkQA · 2018-06-08T23:17:02Z

Test build #91586 has finished for PR 21465 at commit 4290b58.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

SparkQA · 2018-06-13T22:50:58Z

Test build #91798 has finished for PR 21465 at commit 4290b58.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

SparkQA · 2018-09-10T19:09:19Z

Test build #95893 has finished for PR 21465 at commit 1169db8.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

SparkQA · 2018-11-13T04:57:30Z

Test build #98751 has finished for PR 21465 at commit 1169db8.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

SparkQA · 2018-11-19T19:26:23Z

Test build #99013 has finished for PR 21465 at commit 6e177a3.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

BryanCutler

Thanks @huaxingao for the PR! I think the new params should be added in GBTParams. While there, maybe you could add HasMaxIter and HasStepSize also, to match the Scala side.

BryanCutler · 2018-11-20T18:56:22Z

python/pyspark/ml/classification.py

-                    GBTParams, HasCheckpointInterval, HasStepSize, HasSeed, JavaMLWritable,
-                    JavaMLReadable):
+                    GBTParams, HasCheckpointInterval, HasStepSize, HasSeed,
+                    HasValidationIndicatorCol, JavaMLWritable, JavaMLReadable):


I think this should be added to GBTParams, which is done on the Scala side too.

@BryanCutler Thank you very much for reviewing my PR. I moved HasValidationIndicatorCol, HasMaxIter and HasStepSize to GBTParams.

SparkQA · 2018-11-21T18:24:10Z

Test build #99136 has finished for PR 21465 at commit 88ff888.

This patch passes all tests.
This patch merges cleanly.
This patch adds the following public classes (experimental):
class GBTClassifier(JavaEstimator, HasFeaturesCol, HasLabelCol, HasPredictionCol, GBTParams,
class GBTParams(TreeEnsembleParams, HasMaxIter, HasStepSize, HasValidationIndicatorCol):
class GBTRegressor(JavaEstimator, HasFeaturesCol, HasLabelCol, HasPredictionCol, GBTParams,

BryanCutler

Thanks @huaxingao but lets also add GBTClassifierParams and GBTRegressorParams to handle lossType as is done in Scala.

BryanCutler · 2018-11-27T22:56:31Z

python/pyspark/ml/regression.py

@@ -705,12 +705,38 @@ def getNumTrees(self):
        return self.getOrDefault(self.numTrees)


-class GBTParams(TreeEnsembleParams):
+class GBTParams(TreeEnsembleParams, HasMaxIter, HasStepSize, HasValidationIndicatorCol):


I like having a common GBTParams class, it was strange to have this defined in both estimators. But you should also define GBTClassifierParams and GBTRegressorParams, then put the supportedLossTypes in there so you don't need to override them later. You can also put the lossType param and getLossType() method there. This makes it clean and follows how it's done in Scala.

BryanCutler · 2018-11-27T22:57:38Z

python/pyspark/ml/regression.py

-                   GBTParams, HasCheckpointInterval, HasStepSize, HasSeed, JavaMLWritable,
-                   JavaMLReadable, TreeRegressorParams):
+class GBTRegressor(JavaEstimator, HasFeaturesCol, HasLabelCol, HasPredictionCol, GBTParams,
+                   HasCheckpointInterval, HasStepSize, HasSeed, JavaMLWritable, JavaMLReadable,


I think you can remove HasStepSize since it is in GBTParams

… Python API

SparkQA · 2018-11-28T22:32:44Z

Test build #99408 has finished for PR 21465 at commit c0fcbb3.

This patch fails Python style tests.
This patch merges cleanly.
This patch adds the following public classes (experimental):
class GBTClassifierParams(GBTParams, HasVarianceImpurity):
class GBTClassifier(JavaEstimator, HasFeaturesCol, HasLabelCol, HasPredictionCol,
class HasVarianceImpurity(Params):
class TreeRegressorParams(HasVarianceImpurity):
class GBTRegressorParams(GBTParams, TreeRegressorParams):
class GBTRegressor(JavaEstimator, HasFeaturesCol, HasLabelCol, HasPredictionCol,

SparkQA · 2018-11-28T23:25:17Z

Test build #99413 has finished for PR 21465 at commit c0586bd.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

BryanCutler

@huaxingao there are quite a lot of deviations from how these classes are in Scala, please follow how the class hierarchy is defined there and it should all fit together.

BryanCutler · 2018-12-04T19:17:45Z

python/pyspark/ml/classification.py

+                     "Supported options: " + ", ".join(supportedLossTypes),
+                     typeConverter=TypeConverters.toString)
+
+    @since("3.0.0")


don't change the version, since we are just refactoring the base classes

please address the previous comment, to not change the since version since we are just refactoring the base class.

BryanCutler · 2018-12-04T19:18:40Z

python/pyspark/ml/classification.py

+                     typeConverter=TypeConverters.toString)
+
+    @since("3.0.0")
+    def setLossType(self, value):


setLossType should be in the estimators, getLossType should be here

please address the above comment, this method should be in the estimator

BryanCutler · 2018-12-04T19:36:57Z

python/pyspark/ml/classification.py

@@ -1174,9 +1165,31 @@ def trees(self):
        return [DecisionTreeClassificationModel(m) for m in list(self._call_java("trees"))]


+class GBTClassifierParams(GBTParams, HasVarianceImpurity):


this should extend TreeClassifierParams

@BryanCutler Thanks for your review.
Seems recently #22986 added trait HasVarianceImpurity and made
private[ml] trait GBTClassifierParams extends GBTParams with HasVarianceImpurity

ah, I see. let me take another look..

Yeah, you're correct, this is fine

BryanCutler · 2018-12-04T19:38:50Z

python/pyspark/ml/regression.py

@@ -650,19 +650,20 @@ def getFeatureSubsetStrategy(self):
        return self.getOrDefault(self.featureSubsetStrategy)


-class TreeRegressorParams(Params):
+class HasVarianceImpurity(Params):


~~This shouldn't be changed, impurity is different for regression and classification, so the param needs to be defined in TreeRegressorParams and TreeClassifierParams, as it was already~~ This is correct and matches Scala currently

BryanCutler · 2018-12-04T19:39:37Z

python/pyspark/ml/classification.py

    @keyword_only
    def __init__(self, featuresCol="features", labelCol="label", predictionCol="prediction",
                 maxDepth=5, maxBins=32, minInstancesPerNode=1, minInfoGain=0.0,
                 maxMemoryInMB=256, cacheNodeIds=False, checkpointInterval=10, lossType="logistic",
-                 maxIter=20, stepSize=0.1, seed=None, subsamplingRate=1.0,
-                 featureSubsetStrategy="all"):
+                 maxIter=20, stepSize=0.1, seed=None, subsamplingRate=1.0, impurity="variance",


~~this is not the correct default impurity~~ default value has been changed in Scala, this is correct

BryanCutler

Please look at some of my previous comments and fix those, then I think it will be good to go, thanks!

BryanCutler · 2018-12-05T21:15:44Z

python/pyspark/ml/regression.py

+                          typeConverter=TypeConverters.toFloat)
+
+    @since("3.0.0")
+    def setValidationTol(self, value):


It seems scala does not have this API right? If not then let's remove it here for now

BryanCutler · 2018-12-05T21:22:50Z

python/pyspark/ml/classification.py

@@ -1174,9 +1165,31 @@ def trees(self):
        return [DecisionTreeClassificationModel(m) for m in list(self._call_java("trees"))]


+class GBTClassifierParams(GBTParams, HasVarianceImpurity):


Yeah, you're correct, this is fine

BryanCutler · 2018-12-05T21:26:49Z

python/pyspark/ml/classification.py

+                     "Supported options: " + ", ".join(supportedLossTypes),
+                     typeConverter=TypeConverters.toString)
+
+    @since("3.0.0")


please address the previous comment, to not change the since version since we are just refactoring the base class.

BryanCutler · 2018-12-05T21:26:53Z

python/pyspark/ml/classification.py

+                     typeConverter=TypeConverters.toString)
+
+    @since("3.0.0")
+    def setLossType(self, value):


please address the above comment, this method should be in the estimator

BryanCutler · 2018-12-05T21:32:34Z

python/pyspark/ml/regression.py

+                     typeConverter=TypeConverters.toString)
+
+    @since("1.4.0")
+    def setLossType(self, value):


setLossType should be in the estimator and getLossType should be here

huaxingao · 2018-12-05T22:27:20Z

@BryanCutler Thank you very much for your review! I will submit changes soon.

SparkQA · 2018-12-05T22:50:43Z

Test build #99744 has finished for PR 21465 at commit 30a743d.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

BryanCutler · 2018-12-06T18:46:34Z

python/pyspark/ml/param/shared.py

@@ -814,3 +814,25 @@ def getDistanceMeasure(self):
        """
        return self.getOrDefault(self.distanceMeasure)

+
+class HasValidationIndicatorCol(Params):


Would you mind running the codegen again, like this command for example
pushd python/pyspark/ml/param/ && python _shared_params_code_gen.py > shared.py && popd and push the result if there is a diff? I think the DecisionTreeParams should be at the bottom of the file..

You are right. DecisionTreeParams should be at the bottom.

SparkQA · 2018-12-07T19:04:02Z

Test build #99838 has finished for PR 21465 at commit 6fc95a7.

This patch passes all tests.
This patch merges cleanly.
This patch adds the following public classes (experimental):
class HasDistanceMeasure(Params):
class HasValidationIndicatorCol(Params):

BryanCutler

LGTM

BryanCutler · 2018-12-07T21:55:27Z

merged to master, thanks @huaxingao !

huaxingao · 2018-12-08T00:18:08Z

@BryanCutler Thank you very much for your help!

…: Python API ## What changes were proposed in this pull request? Add validationIndicatorCol and validationTol to GBT Python. ## How was this patch tested? Add test in doctest to test the new API. Closes apache#21465 from huaxingao/spark-24333. Authored-by: Huaxin Gao <huaxing@us.ibm.com> Signed-off-by: Bryan Cutler <cutlerb@gmail.com>

MLnick reviewed Jun 8, 2018

View reviewed changes

huaxingao force-pushed the spark-24333 branch from 4290b58 to 1169db8 Compare September 10, 2018 18:30

BryanCutler requested changes Nov 20, 2018

View reviewed changes

BryanCutler requested changes Nov 27, 2018

View reviewed changes

huaxingao added 5 commits November 28, 2018 10:17

[SPARK-24333][ML][PYTHON]Add fit with validation set to spark.ml GBT:…

aa27851

… Python API

add validationIndicatorCol in init

43ff084

change version to 3.0

c0e5757

address comment

3919057

add GBTClassifierParams and GBTRegressorParams

c0fcbb3

huaxingao force-pushed the spark-24333 branch from 88ff888 to c0fcbb3 Compare November 28, 2018 22:24

fix docstring problem

c0586bd

BryanCutler requested changes Dec 4, 2018

View reviewed changes

BryanCutler requested changes Dec 5, 2018

View reviewed changes

BryanCutler reviewed Dec 5, 2018

View reviewed changes

address comments

30a743d

BryanCutler reviewed Dec 6, 2018

View reviewed changes

regenerate shared.py

6fc95a7

BryanCutler approved these changes Dec 7, 2018

View reviewed changes

asfgit closed this in 20278e7 Dec 7, 2018

		@@ -1174,9 +1165,31 @@ def trees(self):
		return [DecisionTreeClassificationModel(m) for m in list(self._call_java("trees"))]


		class GBTClassifierParams(GBTParams, HasVarianceImpurity):

[SPARK-24333][ML][PYTHON]Add fit with validation set to spark.ml GBT: Python API #21465

[SPARK-24333][ML][PYTHON]Add fit with validation set to spark.ml GBT: Python API #21465

Uh oh!

Conversation

huaxingao commented May 30, 2018

What changes were proposed in this pull request?

How was this patch tested?

Uh oh!

SparkQA commented May 31, 2018

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

SparkQA commented Jun 8, 2018

Uh oh!

SparkQA commented Jun 13, 2018

Uh oh!

SparkQA commented Sep 10, 2018

Uh oh!

SparkQA commented Nov 13, 2018

Uh oh!

SparkQA commented Nov 19, 2018

Uh oh!

BryanCutler left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

SparkQA commented Nov 21, 2018

Uh oh!

BryanCutler left a comment

Choose a reason for hiding this comment

Uh oh!

BryanCutler Nov 27, 2018 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

SparkQA commented Nov 28, 2018

Uh oh!

SparkQA commented Nov 28, 2018

Uh oh!

BryanCutler left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

BryanCutler Dec 5, 2018 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

BryanCutler Dec 5, 2018 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

BryanCutler Dec 4, 2018 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

BryanCutler Dec 4, 2018 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

BryanCutler left a comment

Choose a reason for hiding this comment

Uh oh!

BryanCutler Nov 27, 2018 •

edited

Loading

BryanCutler Dec 5, 2018 •

edited

Loading

BryanCutler Dec 5, 2018 •

edited

Loading

BryanCutler Dec 4, 2018 •

edited

Loading

BryanCutler Dec 4, 2018 •

edited

Loading

BryanCutler Dec 5, 2018 •

edited

Loading

BryanCutler Dec 5, 2018 •

edited

Loading