[SPARK-25959][ML] GBTClassifier picks wrong impurity stats on loading #22986

mgaido91 · 2018-11-08T15:29:57Z

What changes were proposed in this pull request?

Our GBTClassifier supports only variance impurity. But unfortunately, its impurity param by default contains the value gini: it is not even modifiable by the user and it differs from the actual impurity used, which is variance. This issue does not limit to a wrong value returned for it if the user queries by getImpurity, but it also affect the load of a saved model, as its impurityStats are created as gini (since this is the value stored for the model impurity) which leads to wrong featureImportances in model loaded from saved ones.

The PR changes the impurity param used to one which allows only the value variance.

How was this patch tested?

modified UT

SparkQA · 2018-11-08T15:42:55Z

Test build #98607 has finished for PR 22986 at commit 9babef5.

This patch fails MiMa tests.
This patch merges cleanly.
This patch adds no public classes.

srowen · 2018-11-08T16:51:37Z

mllib/src/main/scala/org/apache/spark/ml/tree/treeParams.scala

@@ -538,7 +538,7 @@ private[ml] object GBTClassifierParams {
    Array("logistic").map(_.toLowerCase(Locale.ROOT))
 }

-private[ml] trait GBTClassifierParams extends GBTParams with TreeClassifierParams {
+private[ml] trait GBTClassifierParams extends GBTParams with TreeRegressorParams {


Why does it need regressor params to fix this? it sounds like it should extend TreeClassifierParams, just given the name

Yes, by the names seems so, but if you check their implementation, they both just add a impurity param with the only difference of the allowed values. In the case of TreeRegressorParams the only allowed value is variance (which is what GBTClassifier uses), in TreeClassifierParams the allowed values are gini (default) and entropy. So variance is not a valid value for the impurity param of TreeClassifierParams, so there is no way to set it properly for the GBTClassifier....

Surely that's not the ideal way to fix it though, semantically. Can't GBTClassifier override the impurity internally (or can we let it do so)?

To continue to extend TreeClassifierParams but be able to use correct impurity, seems we need to remove final from impurity in trait TreeClassifierParams...

~~I will try a small refactor creating a HasImpurity trait with the allowed values being overridden by its descendant. This way it should work. Let me know what you think about it. Thanks.~~

I tried but this is causing way too many MiMa issues.... Removing the final would be problematic as well, because of the setDefault...

Because in GBTClassifier the only valid value is variance. It is true that the setImpurity method throws an exception if someone tries to set it, but we are not safe against bad values as it happens now. And, most importantly TreeClassifierParams is extended by other classes, so we would allow the variance value to be set for RandomForest and DecisionTree, which is bad because it is not a valid value for them.

Yes, other classes have to restrict what's settable. I don't see the issue with bad values here? Is it serialized forms? At least this sounds like the normal OOP way to solve this. Other ways around this sound like hacks.

I mean restricting in the setImpurity method as you are suggesting is probably feasible, but I am not sure it is the right thing to do. I haven't seen that pattern anywhere else in the codebase. We set the allowed parameters in the Param itself usually. One of the reason is to avoid problems when you are setting the default for instance or when set is used (this happens for instance when loading a saved model, so we can load an invalid model): so I think the solution is not really safe.

I am not sure the solution you are proposing is cleaner. Also, before this PR there were 2 interfaces: TreeClassifierParams and TreeRegressorParams only for this reason, ie. for the impurity param to allow/have different values. We could switch to a single HasImpurity with your approach.

I looked more into this and see that setImpurity is deprecated. I presume the point is to use set(impurity, ...) instead. Yeah, that's no longer possible to override in subclasses, scratch that; overriding setImpurity would have been just fine IMHO but that's not going to last anyway.

I do agree this is why there are different traits for classifiers and regressors, but I don't think that means a classifier should extend TreeRegressorParams because its parameters happen to match.

One option is to let the definition of impurity itself be overridden. That seems OK. Or we could make a new 'VarianceClassifier' or something that defines this variance-only impurity parameter and let TreeRegressionParams and GBTClassifierParams extend it. It is a little wacky, but quite reasonable from an OOP perspective. I think.

@srowen yes, I'll try and create a VarianceClassifier trait. The only problem there is that it requires many MiMa exclusions, but I'll do that. Thanks.

SparkQA · 2018-11-08T20:24:59Z

Test build #98610 has finished for PR 22986 at commit c2be400.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

SparkQA · 2018-11-08T20:38:34Z

Test build #98609 has finished for PR 22986 at commit 7d24f33.

This patch passes all tests.
This patch does not merge cleanly.
This patch adds no public classes.

surajnayak · 2018-11-09T17:59:47Z

Tried this patch. Seems this resolves the issue.

viirya · 2018-11-10T08:27:00Z

mllib/src/main/scala/org/apache/spark/ml/tree/treeParams.scala

@@ -538,7 +538,7 @@ private[ml] object GBTClassifierParams {
    Array("logistic").map(_.toLowerCase(Locale.ROOT))
 }

-private[ml] trait GBTClassifierParams extends GBTParams with TreeClassifierParams {
+private[ml] trait GBTClassifierParams extends GBTParams with TreeRegressorParams {

  /**
   * Loss function which GBT tries to minimize. (case-insensitive)


Another issue is that GBTClassifier also logs wrong value of impurity param.

Yes, that's right. Currently we just saying that we are using gini, while we are using variance...

SparkQA · 2018-11-13T18:14:37Z

Test build #98780 has finished for PR 22986 at commit 176b0d8.

This patch passes all tests.
This patch merges cleanly.
This patch adds the following public classes (experimental):
class MulticlassMetrics @Since(\"1.1.0\") (predAndLabelsWithOptWeight: RDD[_ <: Product])

SparkQA · 2018-11-13T20:45:15Z

Test build #98785 has finished for PR 22986 at commit 6b91211.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

SparkQA · 2018-11-14T13:42:40Z

Test build #98819 has finished for PR 22986 at commit 4aefc9f.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

srowen

Yes this type of approach looks good

srowen · 2018-11-14T15:56:22Z

mllib/src/main/scala/org/apache/spark/ml/tree/treeParams.scala

-/**
- * Parameters for Decision Tree-based regression algorithms.
- */
-private[ml] trait TreeRegressorParams extends Params {


You can actually keep this trait and extend HasVarianceImpurity. It would be a no-op like a few other traits here. Although 'redundant', would it avoid any MiMa warnings or very slightly improve compatibility, even if this is private[spark]?

actually this trait is still there (just some lines below). The MiMa warnings are there because the returned type of setImpurity returns this.type, which is now HasVarianceImpurity and the same for the setter...

Just looking at this one more time ... hm, isn't this.type always referring to the current class's type? maybe it's some detail of the generated code and this is essentially a false positive. Or, honestly setImpurity can just be removed in this change too. It's deprecated for all of these classes. I don't mind doing that later too; would just avoid these MiMa exclusions too.

yes, it is AFAIK. I think it is a false positive.

I wouldn't remove it in this change because since this is a bug, this PR can be backported to 2.4.1 too. I'd do it in a separate change. We would need anyway to add the MiMa rules for the missing methods, so it doesn't change much on the exclusions either...

That's true, but I don't know if we can back-port it because of the binary incompatibility, internal as it may be. I don't know. If it's not an issue then yes it can back port

I see. I am not sure how to verify that.

srowen

Ah right I wasn't reading carefully. OK.

srowen · 2018-11-17T15:46:51Z

Merged to master

## What changes were proposed in this pull request? Our `GBTClassifier` supports only `variance` impurity. But unfortunately, its `impurity` param by default contains the value `gini`: it is not even modifiable by the user and it differs from the actual impurity used, which is `variance`. This issue does not limit to a wrong value returned for it if the user queries by `getImpurity`, but it also affect the load of a saved model, as its `impurityStats` are created as `gini` (since this is the value stored for the model impurity) which leads to wrong `featureImportances` in model loaded from saved ones. The PR changes the `impurity` param used to one which allows only the value `variance`. ## How was this patch tested? modified UT Closes apache#22986 from mgaido91/SPARK-25959. Authored-by: Marco Gaido <marcogaido91@gmail.com> Signed-off-by: Sean Owen <sean.owen@databricks.com>

Our `GBTClassifier` supports only `variance` impurity. But unfortunately, its `impurity` param by default contains the value `gini`: it is not even modifiable by the user and it differs from the actual impurity used, which is `variance`. This issue does not limit to a wrong value returned for it if the user queries by `getImpurity`, but it also affect the load of a saved model, as its `impurityStats` are created as `gini` (since this is the value stored for the model impurity) which leads to wrong `featureImportances` in model loaded from saved ones. The PR changes the `impurity` param used to one which allows only the value `variance`. modified UT Closes apache#22986 from mgaido91/SPARK-25959. Authored-by: Marco Gaido <marcogaido91@gmail.com> Signed-off-by: Sean Owen <sean.owen@databricks.com> Change-Id: Ib0e24dbeb4f3a622622173b3038ce3d69a136bf0 Back ported to 2.4 by Dagang Wei. Change-Id: I9258aff57799f1e03ffa31436d75057bdbc18bcd

[SPARK-25959][ML] GBTClassifier picks wrong impurity stats on loading

9babef5

mgaido91 added 2 commits November 8, 2018 16:57

fix mima

7d24f33

Merge branch 'master' of github.com:apache/spark into SPARK-25959

c2be400

srowen reviewed Nov 8, 2018

View reviewed changes

viirya reviewed Nov 10, 2018

View reviewed changes

mgaido91 added 2 commits November 13, 2018 15:21

Merge branch 'master' of github.com:apache/spark into SPARK-25959

176b0d8

address comments

6b91211

address comment

4aefc9f

srowen reviewed Nov 14, 2018

View reviewed changes

srowen approved these changes Nov 14, 2018

View reviewed changes

asfgit closed this in e00cac9 Nov 17, 2018

huaxingao mentioned this pull request Dec 5, 2018

[SPARK-24333][ML][PYTHON]Add fit with validation set to spark.ml GBT: Python API #21465

Closed

[SPARK-25959][ML] GBTClassifier picks wrong impurity stats on loading #22986

[SPARK-25959][ML] GBTClassifier picks wrong impurity stats on loading #22986

Uh oh!

Conversation

mgaido91 commented Nov 8, 2018

What changes were proposed in this pull request?

How was this patch tested?

Uh oh!

SparkQA commented Nov 8, 2018

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

mgaido91 Nov 10, 2018 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

mgaido91 Nov 14, 2018 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

SparkQA commented Nov 8, 2018

Uh oh!

SparkQA commented Nov 8, 2018

Uh oh!

surajnayak commented Nov 9, 2018

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

SparkQA commented Nov 13, 2018

Uh oh!

SparkQA commented Nov 13, 2018

Uh oh!

SparkQA commented Nov 14, 2018

Uh oh!

srowen left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

srowen left a comment

Choose a reason for hiding this comment

Uh oh!

srowen commented Nov 17, 2018

Uh oh!

Uh oh!

mgaido91 Nov 10, 2018 •

edited

Loading

mgaido91 Nov 14, 2018 •

edited

Loading