[SPARK-17508][PYSPARK][ML] PySpark treat Param values None same as not setting the Param #15113

BryanCutler · 2016-09-15T20:19:27Z

What changes were proposed in this pull request?

In PySpark, if params are set with a value of None it will get converted by Py4J to a typed value and usually result in an error. For the case of string params, it will assign them a null value and lead to a java.lang.NullPointerException. From the user perspective, this is not the expected outcome and can be confusing.

This change causes PySpark to not set Params with a value of None, and they will be treated the same as not being set. If a Param is already set, then attempting to set a value of None will have no effect. This is mostly useful for a user that could be automating Param settings or wrapping classes that use Params.

How was this patch tested?

Added new unit tests for setting a Param to None and ran existing PySpark-ml tests.

BryanCutler · 2016-09-15T20:24:30Z

@srowen Here is the change that would treat PySpark Params with a value of None the same as not setting the Param in the first place. I also think this makes sense from a Python perspective, and I'll copy some others to see if there is a consensus.

cc @jkbradley @yanboliang

SparkQA · 2016-09-15T20:46:02Z

Test build #65458 has finished for PR 15113 at commit 53bd9d2.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

yanboliang · 2016-09-22T13:24:13Z

@BryanCutler Thanks for working on this. I'm a bit worried that if users set weightCol = None for Python means he would like to set weightCol = null for Scala.
The common scenario would be:

LogisticRegression(maxIter=5, regParam=0.0, weightCol="weight")
LogisticRegression(maxIter=5, regParam=0.0, weightCol="")    
LogisticRegression(maxIter=5, regParam=0.0)

And users are unlikely to write the following code:

LogisticRegression(maxIter=5, regParam=0.0, weightCol=None)

However, if users actually do this, I'm more likely to consider that he will forcibly specify weightCol as null. So I vote to not change this behavior. Thanks.
cc @srowen @MLnick @jkbradley

yanboliang · 2016-09-22T13:33:20Z

Further more, weightCol=None in the Python API doc may be confused for users, I think we can add some annotations to clarify the meanings. Thanks.

BryanCutler · 2016-09-22T17:27:10Z

Hi @yanboliang , thanks for taking a look!
I'm more inclined to think that most Python users would think that weightCol=None does not cause a problem, as in this corresponding JIRA. Most are unaware that Py4J will convert the value to null behind the scenes. Let me show the LogisticRegression constructor as an example

    @keyword_only
    def __init__(self, featuresCol="features", labelCol="label", predictionCol="prediction",
                 maxIter=100, regParam=0.0, elasticNetParam=0.0, tol=1e-6, fitIntercept=True,
                 threshold=0.5, thresholds=None, probabilityCol="probability",
                 rawPredictionCol="rawPrediction", standardization=True, weightCol=None,
                 aggregationDepth=2):

Here it shows maxIter=100 which means that the param maxIter has a default value of 100. Then it shows weightCol=None, but this does not mean that weightCol has a default value of null, it means that the param weightCol is not set by default. So why would it have a different effect if the user explicitly writes it?

holdenk · 2016-10-07T19:47:08Z

@yanboliang : I've certainly seen even Spark developers be tripped up with the handling of None by the Py4J bridge so I wouldn't be surprised if this is a more common problem than we might think.

ueshin · 2017-06-20T18:06:36Z

@BryanCutler Hi, are you still working on this?

BryanCutler · 2017-06-21T20:41:20Z

Hi @ueshin , there isn't anymore work for this, just a decision if this behavior should change. I still think it should and would like to hear some more opinions before I close this. Any thoughts @holdenk @jkbradley ?

holdenk · 2017-07-02T03:22:11Z

So I think our current handling of None is confusing, but I'd really like to see what the other committers have to say on this.

holdenk · 2018-04-13T18:09:40Z

Is this PR still needed?

BryanCutler · 2018-04-17T17:33:13Z

I still think this makes sense, but maybe I'm the minority. I'll go ahead and close it unless anyone else thinks it should be changed.

change setting of Param to ignore None values

53bd9d2

MLnick mentioned this pull request Jan 26, 2018

[SPARK-23234][ML][PYSPARK] Remove setting defaults on Java params #20410

Closed

BryanCutler closed this Apr 17, 2018

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

[SPARK-17508][PYSPARK][ML] PySpark treat Param values None same as not setting the Param #15113

[SPARK-17508][PYSPARK][ML] PySpark treat Param values None same as not setting the Param #15113

Uh oh!

BryanCutler commented Sep 15, 2016

Uh oh!

BryanCutler commented Sep 15, 2016

Uh oh!

SparkQA commented Sep 15, 2016

Uh oh!

yanboliang commented Sep 22, 2016 •

edited

Loading

Uh oh!

yanboliang commented Sep 22, 2016

Uh oh!

BryanCutler commented Sep 22, 2016

Uh oh!

holdenk commented Oct 7, 2016

Uh oh!

ueshin commented Jun 20, 2017

Uh oh!

BryanCutler commented Jun 21, 2017

Uh oh!

holdenk commented Jul 2, 2017

Uh oh!

holdenk commented Apr 13, 2018

Uh oh!

BryanCutler commented Apr 17, 2018

Uh oh!

Uh oh!

[SPARK-17508][PYSPARK][ML] PySpark treat Param values None same as not setting the Param #15113

[SPARK-17508][PYSPARK][ML] PySpark treat Param values None same as not setting the Param #15113

Uh oh!

Conversation

BryanCutler commented Sep 15, 2016

What changes were proposed in this pull request?

How was this patch tested?

Uh oh!

BryanCutler commented Sep 15, 2016

Uh oh!

SparkQA commented Sep 15, 2016

Uh oh!

yanboliang commented Sep 22, 2016 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

yanboliang commented Sep 22, 2016

Uh oh!

BryanCutler commented Sep 22, 2016

Uh oh!

holdenk commented Oct 7, 2016

Uh oh!

ueshin commented Jun 20, 2017

Uh oh!

BryanCutler commented Jun 21, 2017

Uh oh!

holdenk commented Jul 2, 2017

Uh oh!

holdenk commented Apr 13, 2018

Uh oh!

BryanCutler commented Apr 17, 2018

Uh oh!

Uh oh!

yanboliang commented Sep 22, 2016 •

edited

Loading