Skip to content

[SPARK-19806][ML][PySpark] PySpark GeneralizedLinearRegression supports tweedie distribution. #17146

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
wants to merge 4 commits into from

Conversation

yanboliang
Copy link
Contributor

What changes were proposed in this pull request?

PySpark GeneralizedLinearRegression supports tweedie distribution.

How was this patch tested?

Add unit tests.

@SparkQA
Copy link

SparkQA commented Mar 3, 2017

Test build #73810 has finished for PR 17146 at commit fcb5cfb.

  • This patch fails Python style tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@SparkQA
Copy link

SparkQA commented Mar 3, 2017

Test build #73811 has finished for PR 17146 at commit 99cbe35.

  • This patch fails Python style tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@SparkQA
Copy link

SparkQA commented Mar 3, 2017

Test build #73812 has finished for PR 17146 at commit f414390.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@yanboliang
Copy link
Contributor Author

"for the Tweedie family. Supported values: 0 and [1, Inf).",
typeConverter=TypeConverters.toFloat)
linkPower = Param(Params._dummy(), "linkPower", "The index in the power link function. " +
"Only applicable for the Tweedie family.",
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: I think it should say applicable to

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Updated.

@@ -1305,6 +1305,9 @@ class GeneralizedLinearRegression(JavaEstimator, HasLabelCol, HasFeaturesCol, Ha

* "gamma" -> "inverse", "identity", "log"

* "tweedie" -> power link function specified through "linkPower". \
The default link power in the tweedie family is 1 - variancePower.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

what happens when both variancePower ad linkPower is set?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It will produce a model according to the specified variancePower and linkPower. The doc here is to explain the value of linkPower if users don't specify.

typeConverter=TypeConverters.toFloat)
linkPower = Param(Params._dummy(), "linkPower", "The index in the power link function. " +
"Only applicable for the Tweedie family.",
typeConverter=TypeConverters.toFloat)

@keyword_only
def __init__(self, labelCol="label", featuresCol="features", predictionCol="prediction",
family="gaussian", link=None, fitIntercept=True, maxIter=25, tol=1e-6,
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

is there check to make sure link=None when family="Tweedie"?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

No, actually we allow users to set link even if the family is tweedie, we can't disable the set function. However, in this case, any link value will be ignored, and we will print warning log to tell users link will take no effect.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

hmm, it sounds like link should really be None then

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah, there is no default value for link.

@SparkQA
Copy link

SparkQA commented Mar 6, 2017

Test build #74013 has finished for PR 17146 at commit eef5666.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@SparkQA
Copy link

SparkQA commented Mar 6, 2017

Test build #74014 has finished for PR 17146 at commit fe1d3ae.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@felixcheung
Copy link
Member

@actuaryzhang would you take a look at this one. If recall, it's one option we considered for R API.

@actuaryzhang
Copy link
Contributor

Will take a look tonight.

@actuaryzhang
Copy link
Contributor

This looks good to me. Thanks


glr = GeneralizedLinearRegression(family="tweedie", variancePower=1.6)
model = glr.fit(df)
self.assertTrue(np.allclose(model.coefficients.toArray(), [-0.4645, 0.3402], atol=1E-4))
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm curious: where did the expected values come from?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

They are produced by R under the same input.

@felixcheung
Copy link
Member

LGTM

@yanboliang
Copy link
Contributor Author

Merged into master. Thanks for reviewing.

@asfgit asfgit closed this in 81303f7 Mar 8, 2017
@yanboliang yanboliang deleted the spark-19806 branch March 8, 2017 10:12
Copy link

@Antoinelypro Antoinelypro left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hi thanks for the update and the great job.
I have used the 2.2 version and tried GLM with all default values http://spark.apache.org/docs/2.2.0/api/python/pyspark.ml.html#pyspark.ml.regression.GeneralizedLinearRegression .

the value 'link=None' raised and error. would it be possible to set link function as default (as suggested in documentation) when link=None?

@yanboliang
Copy link
Contributor Author

@Antoinelypro Sorry for late response. Actually we have default value if users don't set link explicitly. Could you show the detail of your error case? Thanks.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants