Skip to content

[SPARK-19806][ML][PySpark] PySpark GeneralizedLinearRegression supports tweedie distribution. #17146

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
wants to merge 4 commits into from
Closed
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
Original file line number Diff line number Diff line change
Expand Up @@ -66,7 +66,7 @@ private[regression] trait GeneralizedLinearRegressionBase extends PredictorParam
/**
* Param for the power in the variance function of the Tweedie distribution which provides
* the relationship between the variance and mean of the distribution.
* Only applicable for the Tweedie family.
* Only applicable to the Tweedie family.
* (see <a href="https://en.wikipedia.org/wiki/Tweedie_distribution">
* Tweedie Distribution (Wikipedia)</a>)
* Supported values: 0 and [1, Inf).
Expand All @@ -79,7 +79,7 @@ private[regression] trait GeneralizedLinearRegressionBase extends PredictorParam
final val variancePower: DoubleParam = new DoubleParam(this, "variancePower",
"The power in the variance function of the Tweedie distribution which characterizes " +
"the relationship between the variance and mean of the distribution. " +
"Only applicable for the Tweedie family. Supported values: 0 and [1, Inf).",
"Only applicable to the Tweedie family. Supported values: 0 and [1, Inf).",
(x: Double) => x >= 1.0 || x == 0.0)

/** @group getParam */
Expand All @@ -106,7 +106,7 @@ private[regression] trait GeneralizedLinearRegressionBase extends PredictorParam
def getLink: String = $(link)

/**
* Param for the index in the power link function. Only applicable for the Tweedie family.
* Param for the index in the power link function. Only applicable to the Tweedie family.
* Note that link power 0, 1, -1 or 0.5 corresponds to the Log, Identity, Inverse or Sqrt
* link, respectively.
* When not set, this value defaults to 1 - [[variancePower]], which matches the R "statmod"
Expand All @@ -116,7 +116,7 @@ private[regression] trait GeneralizedLinearRegressionBase extends PredictorParam
*/
@Since("2.2.0")
final val linkPower: DoubleParam = new DoubleParam(this, "linkPower",
"The index in the power link function. Only applicable for the Tweedie family.")
"The index in the power link function. Only applicable to the Tweedie family.")

/** @group getParam */
@Since("2.2.0")
Expand Down
61 changes: 53 additions & 8 deletions python/pyspark/ml/regression.py
Original file line number Diff line number Diff line change
Expand Up @@ -1294,8 +1294,8 @@ class GeneralizedLinearRegression(JavaEstimator, HasLabelCol, HasFeaturesCol, Ha

Fit a Generalized Linear Model specified by giving a symbolic description of the linear
predictor (link function) and a description of the error distribution (family). It supports
"gaussian", "binomial", "poisson" and "gamma" as family. Valid link functions for each family
is listed below. The first link function of each family is the default one.
"gaussian", "binomial", "poisson", "gamma" and "tweedie" as family. Valid link functions for
each family is listed below. The first link function of each family is the default one.

* "gaussian" -> "identity", "log", "inverse"

Expand All @@ -1305,6 +1305,9 @@ class GeneralizedLinearRegression(JavaEstimator, HasLabelCol, HasFeaturesCol, Ha

* "gamma" -> "inverse", "identity", "log"

* "tweedie" -> power link function specified through "linkPower". \
The default link power in the tweedie family is 1 - variancePower.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

what happens when both variancePower ad linkPower is set?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It will produce a model according to the specified variancePower and linkPower. The doc here is to explain the value of linkPower if users don't specify.


.. seealso:: `GLM <https://en.wikipedia.org/wiki/Generalized_linear_model>`_

>>> from pyspark.ml.linalg import Vectors
Expand Down Expand Up @@ -1344,40 +1347,54 @@ class GeneralizedLinearRegression(JavaEstimator, HasLabelCol, HasFeaturesCol, Ha

family = Param(Params._dummy(), "family", "The name of family which is a description of " +
"the error distribution to be used in the model. Supported options: " +
"gaussian (default), binomial, poisson and gamma.",
"gaussian (default), binomial, poisson, gamma and tweedie.",
typeConverter=TypeConverters.toString)
link = Param(Params._dummy(), "link", "The name of link function which provides the " +
"relationship between the linear predictor and the mean of the distribution " +
"function. Supported options: identity, log, inverse, logit, probit, cloglog " +
"and sqrt.", typeConverter=TypeConverters.toString)
linkPredictionCol = Param(Params._dummy(), "linkPredictionCol", "link prediction (linear " +
"predictor) column name", typeConverter=TypeConverters.toString)
variancePower = Param(Params._dummy(), "variancePower", "The power in the variance function " +
"of the Tweedie distribution which characterizes the relationship " +
"between the variance and mean of the distribution. Only applicable " +
"for the Tweedie family. Supported values: 0 and [1, Inf).",
typeConverter=TypeConverters.toFloat)
linkPower = Param(Params._dummy(), "linkPower", "The index in the power link function. " +
"Only applicable to the Tweedie family.",
typeConverter=TypeConverters.toFloat)

@keyword_only
def __init__(self, labelCol="label", featuresCol="features", predictionCol="prediction",
family="gaussian", link=None, fitIntercept=True, maxIter=25, tol=1e-6,
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

is there check to make sure link=None when family="Tweedie"?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

No, actually we allow users to set link even if the family is tweedie, we can't disable the set function. However, in this case, any link value will be ignored, and we will print warning log to tell users link will take no effect.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

hmm, it sounds like link should really be None then

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah, there is no default value for link.

regParam=0.0, weightCol=None, solver="irls", linkPredictionCol=None):
regParam=0.0, weightCol=None, solver="irls", linkPredictionCol=None,
variancePower=0.0, linkPower=None):
"""
__init__(self, labelCol="label", featuresCol="features", predictionCol="prediction", \
family="gaussian", link=None, fitIntercept=True, maxIter=25, tol=1e-6, \
regParam=0.0, weightCol=None, solver="irls", linkPredictionCol=None)
regParam=0.0, weightCol=None, solver="irls", linkPredictionCol=None, \
variancePower=0.0, linkPower=None)
"""
super(GeneralizedLinearRegression, self).__init__()
self._java_obj = self._new_java_obj(
"org.apache.spark.ml.regression.GeneralizedLinearRegression", self.uid)
self._setDefault(family="gaussian", maxIter=25, tol=1e-6, regParam=0.0, solver="irls")
self._setDefault(family="gaussian", maxIter=25, tol=1e-6, regParam=0.0, solver="irls",
variancePower=0.0)
kwargs = self._input_kwargs

self.setParams(**kwargs)

@keyword_only
@since("2.0.0")
def setParams(self, labelCol="label", featuresCol="features", predictionCol="prediction",
family="gaussian", link=None, fitIntercept=True, maxIter=25, tol=1e-6,
regParam=0.0, weightCol=None, solver="irls", linkPredictionCol=None):
regParam=0.0, weightCol=None, solver="irls", linkPredictionCol=None,
variancePower=0.0, linkPower=None):
"""
setParams(self, labelCol="label", featuresCol="features", predictionCol="prediction", \
family="gaussian", link=None, fitIntercept=True, maxIter=25, tol=1e-6, \
regParam=0.0, weightCol=None, solver="irls", linkPredictionCol=None)
regParam=0.0, weightCol=None, solver="irls", linkPredictionCol=None, \
variancePower=0.0, linkPower=None)
Sets params for generalized linear regression.
"""
kwargs = self._input_kwargs
Expand Down Expand Up @@ -1428,6 +1445,34 @@ def getLink(self):
"""
return self.getOrDefault(self.link)

@since("2.2.0")
def setVariancePower(self, value):
"""
Sets the value of :py:attr:`variancePower`.
"""
return self._set(variancePower=value)

@since("2.2.0")
def getVariancePower(self):
"""
Gets the value of variancePower or its default value.
"""
return self.getOrDefault(self.variancePower)

@since("2.2.0")
def setLinkPower(self, value):
"""
Sets the value of :py:attr:`linkPower`.
"""
return self._set(linkPower=value)

@since("2.2.0")
def getLinkPower(self):
"""
Gets the value of linkPower or its default value.
"""
return self.getOrDefault(self.linkPower)


class GeneralizedLinearRegressionModel(JavaModel, JavaPredictionModel, JavaMLWritable,
JavaMLReadable):
Expand Down
20 changes: 20 additions & 0 deletions python/pyspark/ml/tests.py
Original file line number Diff line number Diff line change
Expand Up @@ -1223,6 +1223,26 @@ def test_apply_binary_term_freqs(self):
": expected " + str(expected[i]) + ", got " + str(features[i]))


class GeneralizedLinearRegressionTest(SparkSessionTestCase):

def test_tweedie_distribution(self):

df = self.spark.createDataFrame(
[(1.0, Vectors.dense(0.0, 0.0)),
(1.0, Vectors.dense(1.0, 2.0)),
(2.0, Vectors.dense(0.0, 0.0)),
(2.0, Vectors.dense(1.0, 1.0)), ], ["label", "features"])

glr = GeneralizedLinearRegression(family="tweedie", variancePower=1.6)
model = glr.fit(df)
self.assertTrue(np.allclose(model.coefficients.toArray(), [-0.4645, 0.3402], atol=1E-4))
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm curious: where did the expected values come from?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

They are produced by R under the same input.

self.assertTrue(np.isclose(model.intercept, 0.7841, atol=1E-4))

model2 = glr.setLinkPower(-1.0).fit(df)
self.assertTrue(np.allclose(model2.coefficients.toArray(), [-0.6667, 0.5], atol=1E-4))
self.assertTrue(np.isclose(model2.intercept, 0.6667, atol=1E-4))


class ALSTest(SparkSessionTestCase):

def test_storage_levels(self):
Expand Down