Skip to content

[SPARK-25959][ML] GBTClassifier picks wrong impurity stats on loading #22986

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
wants to merge 6 commits into from

Conversation

mgaido91
Copy link
Contributor

@mgaido91 mgaido91 commented Nov 8, 2018

What changes were proposed in this pull request?

Our GBTClassifier supports only variance impurity. But unfortunately, its impurity param by default contains the value gini: it is not even modifiable by the user and it differs from the actual impurity used, which is variance. This issue does not limit to a wrong value returned for it if the user queries by getImpurity, but it also affect the load of a saved model, as its impurityStats are created as gini (since this is the value stored for the model impurity) which leads to wrong featureImportances in model loaded from saved ones.

The PR changes the impurity param used to one which allows only the value variance.

How was this patch tested?

modified UT

@SparkQA
Copy link

SparkQA commented Nov 8, 2018

Test build #98607 has finished for PR 22986 at commit 9babef5.

  • This patch fails MiMa tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@@ -538,7 +538,7 @@ private[ml] object GBTClassifierParams {
Array("logistic").map(_.toLowerCase(Locale.ROOT))
}

private[ml] trait GBTClassifierParams extends GBTParams with TreeClassifierParams {
private[ml] trait GBTClassifierParams extends GBTParams with TreeRegressorParams {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why does it need regressor params to fix this? it sounds like it should extend TreeClassifierParams, just given the name

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, by the names seems so, but if you check their implementation, they both just add a impurity param with the only difference of the allowed values. In the case of TreeRegressorParams the only allowed value is variance (which is what GBTClassifier uses), in TreeClassifierParams the allowed values are gini (default) and entropy. So variance is not a valid value for the impurity param of TreeClassifierParams, so there is no way to set it properly for the GBTClassifier....

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Surely that's not the ideal way to fix it though, semantically. Can't GBTClassifier override the impurity internally (or can we let it do so)?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

To continue to extend TreeClassifierParams but be able to use correct impurity, seems we need to remove final from impurity in trait TreeClassifierParams...

Copy link
Contributor Author

@mgaido91 mgaido91 Nov 10, 2018

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I will try a small refactor creating a HasImpurity trait with the allowed values being overridden by its descendant. This way it should work. Let me know what you think about it. Thanks.

I tried but this is causing way too many MiMa issues.... Removing the final would be problematic as well, because of the setDefault...

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Because in GBTClassifier the only valid value is variance. It is true that the setImpurity method throws an exception if someone tries to set it, but we are not safe against bad values as it happens now. And, most importantly TreeClassifierParams is extended by other classes, so we would allow the variance value to be set for RandomForest and DecisionTree, which is bad because it is not a valid value for them.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, other classes have to restrict what's settable. I don't see the issue with bad values here? Is it serialized forms? At least this sounds like the normal OOP way to solve this. Other ways around this sound like hacks.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I mean restricting in the setImpurity method as you are suggesting is probably feasible, but I am not sure it is the right thing to do. I haven't seen that pattern anywhere else in the codebase. We set the allowed parameters in the Param itself usually. One of the reason is to avoid problems when you are setting the default for instance or when set is used (this happens for instance when loading a saved model, so we can load an invalid model): so I think the solution is not really safe.

I am not sure the solution you are proposing is cleaner. Also, before this PR there were 2 interfaces: TreeClassifierParams and TreeRegressorParams only for this reason, ie. for the impurity param to allow/have different values. We could switch to a single HasImpurity with your approach.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I looked more into this and see that setImpurity is deprecated. I presume the point is to use set(impurity, ...) instead. Yeah, that's no longer possible to override in subclasses, scratch that; overriding setImpurity would have been just fine IMHO but that's not going to last anyway.

I do agree this is why there are different traits for classifiers and regressors, but I don't think that means a classifier should extend TreeRegressorParams because its parameters happen to match.

One option is to let the definition of impurity itself be overridden. That seems OK. Or we could make a new 'VarianceClassifier' or something that defines this variance-only impurity parameter and let TreeRegressionParams and GBTClassifierParams extend it. It is a little wacky, but quite reasonable from an OOP perspective. I think.

Copy link
Contributor Author

@mgaido91 mgaido91 Nov 14, 2018

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@srowen yes, I'll try and create a VarianceClassifier trait. The only problem there is that it requires many MiMa exclusions, but I'll do that. Thanks.

@SparkQA
Copy link

SparkQA commented Nov 8, 2018

Test build #98610 has finished for PR 22986 at commit c2be400.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@SparkQA
Copy link

SparkQA commented Nov 8, 2018

Test build #98609 has finished for PR 22986 at commit 7d24f33.

  • This patch passes all tests.
  • This patch does not merge cleanly.
  • This patch adds no public classes.

@surajnayak
Copy link

Tried this patch. Seems this resolves the issue.

@@ -538,7 +538,7 @@ private[ml] object GBTClassifierParams {
Array("logistic").map(_.toLowerCase(Locale.ROOT))
}

private[ml] trait GBTClassifierParams extends GBTParams with TreeClassifierParams {
private[ml] trait GBTClassifierParams extends GBTParams with TreeRegressorParams {

/**
* Loss function which GBT tries to minimize. (case-insensitive)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Another issue is that GBTClassifier also logs wrong value of impurity param.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, that's right. Currently we just saying that we are using gini, while we are using variance...

@SparkQA
Copy link

SparkQA commented Nov 13, 2018

Test build #98780 has finished for PR 22986 at commit 176b0d8.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds the following public classes (experimental):
  • class MulticlassMetrics @Since(\"1.1.0\") (predAndLabelsWithOptWeight: RDD[_ <: Product])

@SparkQA
Copy link

SparkQA commented Nov 13, 2018

Test build #98785 has finished for PR 22986 at commit 6b91211.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@SparkQA
Copy link

SparkQA commented Nov 14, 2018

Test build #98819 has finished for PR 22986 at commit 4aefc9f.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

Copy link
Member

@srowen srowen left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes this type of approach looks good

/**
* Parameters for Decision Tree-based regression algorithms.
*/
private[ml] trait TreeRegressorParams extends Params {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You can actually keep this trait and extend HasVarianceImpurity. It would be a no-op like a few other traits here. Although 'redundant', would it avoid any MiMa warnings or very slightly improve compatibility, even if this is private[spark]?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

actually this trait is still there (just some lines below). The MiMa warnings are there because the returned type of setImpurity returns this.type, which is now HasVarianceImpurity and the same for the setter...

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Just looking at this one more time ... hm, isn't this.type always referring to the current class's type? maybe it's some detail of the generated code and this is essentially a false positive. Or, honestly setImpurity can just be removed in this change too. It's deprecated for all of these classes. I don't mind doing that later too; would just avoid these MiMa exclusions too.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yes, it is AFAIK. I think it is a false positive.

I wouldn't remove it in this change because since this is a bug, this PR can be backported to 2.4.1 too. I'd do it in a separate change. We would need anyway to add the MiMa rules for the missing methods, so it doesn't change much on the exclusions either...

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That's true, but I don't know if we can back-port it because of the binary incompatibility, internal as it may be. I don't know. If it's not an issue then yes it can back port

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I see. I am not sure how to verify that.

Copy link
Member

@srowen srowen left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ah right I wasn't reading carefully. OK.

@srowen
Copy link
Member

srowen commented Nov 17, 2018

Merged to master

@asfgit asfgit closed this in e00cac9 Nov 17, 2018
jackylee-ch pushed a commit to jackylee-ch/spark that referenced this pull request Feb 18, 2019
## What changes were proposed in this pull request?

Our `GBTClassifier` supports only `variance` impurity. But unfortunately, its `impurity` param by default contains the value `gini`: it is not even modifiable by the user and it differs from the actual impurity used, which is `variance`. This issue does not limit to a wrong value returned for it if the user queries by `getImpurity`, but it also affect the load of a saved model, as its `impurityStats` are created as `gini` (since this is the value stored for the model impurity) which leads to wrong `featureImportances` in model loaded from saved ones.

The PR changes the `impurity` param used to one which allows only the value `variance`.

## How was this patch tested?

modified UT

Closes apache#22986 from mgaido91/SPARK-25959.

Authored-by: Marco Gaido <marcogaido91@gmail.com>
Signed-off-by: Sean Owen <sean.owen@databricks.com>
Gaurangi94 pushed a commit to Gaurangi94/spark that referenced this pull request Sep 2, 2020
Our `GBTClassifier` supports only `variance` impurity. But unfortunately, its `impurity` param by default contains the value `gini`: it is not even modifiable by the user and it differs from the actual impurity used, which is `variance`. This issue does not limit to a wrong value returned for it if the user queries by `getImpurity`, but it also affect the load of a saved model, as its `impurityStats` are created as `gini` (since this is the value stored for the model impurity) which leads to wrong `featureImportances` in model loaded from saved ones.

The PR changes the `impurity` param used to one which allows only the value `variance`.

modified UT

Closes apache#22986 from mgaido91/SPARK-25959.

Authored-by: Marco Gaido <marcogaido91@gmail.com>
Signed-off-by: Sean Owen <sean.owen@databricks.com>
Change-Id: Ib0e24dbeb4f3a622622173b3038ce3d69a136bf0

Back ported to 2.4 by Dagang Wei.

Change-Id: I9258aff57799f1e03ffa31436d75057bdbc18bcd
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants