-
Notifications
You must be signed in to change notification settings - Fork 28.6k
[SPARK-25959][ML] GBTClassifier picks wrong impurity stats on loading #22986
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
Test build #98607 has finished for PR 22986 at commit
|
@@ -538,7 +538,7 @@ private[ml] object GBTClassifierParams { | |||
Array("logistic").map(_.toLowerCase(Locale.ROOT)) | |||
} | |||
|
|||
private[ml] trait GBTClassifierParams extends GBTParams with TreeClassifierParams { | |||
private[ml] trait GBTClassifierParams extends GBTParams with TreeRegressorParams { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why does it need regressor params to fix this? it sounds like it should extend TreeClassifierParams, just given the name
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, by the names seems so, but if you check their implementation, they both just add a impurity
param with the only difference of the allowed values. In the case of TreeRegressorParams
the only allowed value is variance
(which is what GBTClassifier
uses), in TreeClassifierParams
the allowed values are gini
(default) and entropy
. So variance
is not a valid value for the impurity
param of TreeClassifierParams
, so there is no way to set it properly for the GBTClassifier
....
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Surely that's not the ideal way to fix it though, semantically. Can't GBTClassifier override the impurity internally (or can we let it do so)?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
To continue to extend TreeClassifierParams
but be able to use correct impurity
, seems we need to remove final
from impurity
in trait TreeClassifierParams
...
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I will try a small refactor creating a HasImpurity
trait with the allowed values being overridden by its descendant. This way it should work. Let me know what you think about it. Thanks.
I tried but this is causing way too many MiMa issues.... Removing the final would be problematic as well, because of the setDefault
...
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Because in GBTClassifier the only valid value is variance
. It is true that the setImpurity
method throws an exception if someone tries to set it, but we are not safe against bad values as it happens now. And, most importantly TreeClassifierParams
is extended by other classes, so we would allow the variance
value to be set for RandomForest and DecisionTree, which is bad because it is not a valid value for them.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, other classes have to restrict what's settable. I don't see the issue with bad values here? Is it serialized forms? At least this sounds like the normal OOP way to solve this. Other ways around this sound like hacks.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I mean restricting in the setImpurity
method as you are suggesting is probably feasible, but I am not sure it is the right thing to do. I haven't seen that pattern anywhere else in the codebase. We set the allowed parameters in the Param itself usually. One of the reason is to avoid problems when you are setting the default for instance or when set
is used (this happens for instance when loading a saved model, so we can load an invalid model): so I think the solution is not really safe.
I am not sure the solution you are proposing is cleaner. Also, before this PR there were 2 interfaces: TreeClassifierParams
and TreeRegressorParams
only for this reason, ie. for the impurity param to allow/have different values. We could switch to a single HasImpurity
with your approach.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I looked more into this and see that setImpurity is deprecated. I presume the point is to use set(impurity, ...) instead. Yeah, that's no longer possible to override in subclasses, scratch that; overriding setImpurity would have been just fine IMHO but that's not going to last anyway.
I do agree this is why there are different traits for classifiers and regressors, but I don't think that means a classifier should extend TreeRegressorParams because its parameters happen to match.
One option is to let the definition of impurity
itself be overridden. That seems OK. Or we could make a new 'VarianceClassifier' or something that defines this variance-only impurity parameter and let TreeRegressionParams and GBTClassifierParams extend it. It is a little wacky, but quite reasonable from an OOP perspective. I think.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@srowen yes, I'll try and create a VarianceClassifier
trait. The only problem there is that it requires many MiMa exclusions, but I'll do that. Thanks.
Test build #98610 has finished for PR 22986 at commit
|
Test build #98609 has finished for PR 22986 at commit
|
Tried this patch. Seems this resolves the issue. |
@@ -538,7 +538,7 @@ private[ml] object GBTClassifierParams { | |||
Array("logistic").map(_.toLowerCase(Locale.ROOT)) | |||
} | |||
|
|||
private[ml] trait GBTClassifierParams extends GBTParams with TreeClassifierParams { | |||
private[ml] trait GBTClassifierParams extends GBTParams with TreeRegressorParams { | |||
|
|||
/** | |||
* Loss function which GBT tries to minimize. (case-insensitive) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Another issue is that GBTClassifier
also logs wrong value of impurity
param.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, that's right. Currently we just saying that we are using gini
, while we are using variance
...
Test build #98780 has finished for PR 22986 at commit
|
Test build #98785 has finished for PR 22986 at commit
|
Test build #98819 has finished for PR 22986 at commit
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes this type of approach looks good
/** | ||
* Parameters for Decision Tree-based regression algorithms. | ||
*/ | ||
private[ml] trait TreeRegressorParams extends Params { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
You can actually keep this trait and extend HasVarianceImpurity. It would be a no-op like a few other traits here. Although 'redundant', would it avoid any MiMa warnings or very slightly improve compatibility, even if this is private[spark]?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
actually this trait is still there (just some lines below). The MiMa warnings are there because the returned type of setImpurity
returns this.type
, which is now HasVarianceImpurity
and the same for the setter...
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Just looking at this one more time ... hm, isn't this.type
always referring to the current class's type? maybe it's some detail of the generated code and this is essentially a false positive. Or, honestly setImpurity can just be removed in this change too. It's deprecated for all of these classes. I don't mind doing that later too; would just avoid these MiMa exclusions too.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
yes, it is AFAIK. I think it is a false positive.
I wouldn't remove it in this change because since this is a bug, this PR can be backported to 2.4.1 too. I'd do it in a separate change. We would need anyway to add the MiMa rules for the missing methods, so it doesn't change much on the exclusions either...
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
That's true, but I don't know if we can back-port it because of the binary incompatibility, internal as it may be. I don't know. If it's not an issue then yes it can back port
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I see. I am not sure how to verify that.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ah right I wasn't reading carefully. OK.
Merged to master |
## What changes were proposed in this pull request? Our `GBTClassifier` supports only `variance` impurity. But unfortunately, its `impurity` param by default contains the value `gini`: it is not even modifiable by the user and it differs from the actual impurity used, which is `variance`. This issue does not limit to a wrong value returned for it if the user queries by `getImpurity`, but it also affect the load of a saved model, as its `impurityStats` are created as `gini` (since this is the value stored for the model impurity) which leads to wrong `featureImportances` in model loaded from saved ones. The PR changes the `impurity` param used to one which allows only the value `variance`. ## How was this patch tested? modified UT Closes apache#22986 from mgaido91/SPARK-25959. Authored-by: Marco Gaido <marcogaido91@gmail.com> Signed-off-by: Sean Owen <sean.owen@databricks.com>
Our `GBTClassifier` supports only `variance` impurity. But unfortunately, its `impurity` param by default contains the value `gini`: it is not even modifiable by the user and it differs from the actual impurity used, which is `variance`. This issue does not limit to a wrong value returned for it if the user queries by `getImpurity`, but it also affect the load of a saved model, as its `impurityStats` are created as `gini` (since this is the value stored for the model impurity) which leads to wrong `featureImportances` in model loaded from saved ones. The PR changes the `impurity` param used to one which allows only the value `variance`. modified UT Closes apache#22986 from mgaido91/SPARK-25959. Authored-by: Marco Gaido <marcogaido91@gmail.com> Signed-off-by: Sean Owen <sean.owen@databricks.com> Change-Id: Ib0e24dbeb4f3a622622173b3038ce3d69a136bf0 Back ported to 2.4 by Dagang Wei. Change-Id: I9258aff57799f1e03ffa31436d75057bdbc18bcd
What changes were proposed in this pull request?
Our
GBTClassifier
supports onlyvariance
impurity. But unfortunately, itsimpurity
param by default contains the valuegini
: it is not even modifiable by the user and it differs from the actual impurity used, which isvariance
. This issue does not limit to a wrong value returned for it if the user queries bygetImpurity
, but it also affect the load of a saved model, as itsimpurityStats
are created asgini
(since this is the value stored for the model impurity) which leads to wrongfeatureImportances
in model loaded from saved ones.The PR changes the
impurity
param used to one which allows only the valuevariance
.How was this patch tested?
modified UT