-
Notifications
You must be signed in to change notification settings - Fork 28.6k
[SPARK-19806][ML][PySpark] PySpark GeneralizedLinearRegression supports tweedie distribution. #17146
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
Test build #73810 has finished for PR 17146 at commit
|
Test build #73811 has finished for PR 17146 at commit
|
Test build #73812 has finished for PR 17146 at commit
|
python/pyspark/ml/regression.py
Outdated
"for the Tweedie family. Supported values: 0 and [1, Inf).", | ||
typeConverter=TypeConverters.toFloat) | ||
linkPower = Param(Params._dummy(), "linkPower", "The index in the power link function. " + | ||
"Only applicable for the Tweedie family.", |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nit: I think it should say applicable to
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Updated.
@@ -1305,6 +1305,9 @@ class GeneralizedLinearRegression(JavaEstimator, HasLabelCol, HasFeaturesCol, Ha | |||
|
|||
* "gamma" -> "inverse", "identity", "log" | |||
|
|||
* "tweedie" -> power link function specified through "linkPower". \ | |||
The default link power in the tweedie family is 1 - variancePower. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
what happens when both variancePower ad linkPower is set?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It will produce a model according to the specified variancePower
and linkPower
. The doc here is to explain the value of linkPower
if users don't specify.
typeConverter=TypeConverters.toFloat) | ||
linkPower = Param(Params._dummy(), "linkPower", "The index in the power link function. " + | ||
"Only applicable for the Tweedie family.", | ||
typeConverter=TypeConverters.toFloat) | ||
|
||
@keyword_only | ||
def __init__(self, labelCol="label", featuresCol="features", predictionCol="prediction", | ||
family="gaussian", link=None, fitIntercept=True, maxIter=25, tol=1e-6, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
is there check to make sure link=None when family="Tweedie"?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
No, actually we allow users to set link
even if the family is tweedie
, we can't disable the set function. However, in this case, any link
value will be ignored, and we will print warning log to tell users link
will take no effect.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
hmm, it sounds like link
should really be None
then
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeah, there is no default value for link
.
Test build #74013 has finished for PR 17146 at commit
|
Test build #74014 has finished for PR 17146 at commit
|
@actuaryzhang would you take a look at this one. If recall, it's one option we considered for R API. |
Will take a look tonight. |
This looks good to me. Thanks |
|
||
glr = GeneralizedLinearRegression(family="tweedie", variancePower=1.6) | ||
model = glr.fit(df) | ||
self.assertTrue(np.allclose(model.coefficients.toArray(), [-0.4645, 0.3402], atol=1E-4)) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm curious: where did the expected values come from?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
They are produced by R under the same input.
LGTM |
Merged into master. Thanks for reviewing. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hi thanks for the update and the great job.
I have used the 2.2 version and tried GLM with all default values http://spark.apache.org/docs/2.2.0/api/python/pyspark.ml.html#pyspark.ml.regression.GeneralizedLinearRegression .
the value 'link=None' raised and error. would it be possible to set link function as default (as suggested in documentation) when link=None?
@Antoinelypro Sorry for late response. Actually we have default value if users don't set link explicitly. Could you show the detail of your error case? Thanks. |
What changes were proposed in this pull request?
PySpark
GeneralizedLinearRegression
supports tweedie distribution.How was this patch tested?
Add unit tests.