Skip to content

[SPARK-17508][PYSPARK][ML] PySpark treat Param values None same as not setting the Param #15113

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed

Conversation

BryanCutler
Copy link
Member

What changes were proposed in this pull request?

In PySpark, if params are set with a value of None it will get converted by Py4J to a typed value and usually result in an error. For the case of string params, it will assign them a null value and lead to a java.lang.NullPointerException. From the user perspective, this is not the expected outcome and can be confusing.

This change causes PySpark to not set Params with a value of None, and they will be treated the same as not being set. If a Param is already set, then attempting to set a value of None will have no effect. This is mostly useful for a user that could be automating Param settings or wrapping classes that use Params.

How was this patch tested?

Added new unit tests for setting a Param to None and ran existing PySpark-ml tests.

@BryanCutler
Copy link
Member Author

@srowen Here is the change that would treat PySpark Params with a value of None the same as not setting the Param in the first place. I also think this makes sense from a Python perspective, and I'll copy some others to see if there is a consensus.

cc @jkbradley @yanboliang

@SparkQA
Copy link

SparkQA commented Sep 15, 2016

Test build #65458 has finished for PR 15113 at commit 53bd9d2.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@yanboliang
Copy link
Contributor

yanboliang commented Sep 22, 2016

@BryanCutler Thanks for working on this. I'm a bit worried that if users set weightCol = None for Python means he would like to set weightCol = null for Scala.
The common scenario would be:

LogisticRegression(maxIter=5, regParam=0.0, weightCol="weight")
LogisticRegression(maxIter=5, regParam=0.0, weightCol="")    
LogisticRegression(maxIter=5, regParam=0.0)  

And users are unlikely to write the following code:

LogisticRegression(maxIter=5, regParam=0.0, weightCol=None)    

However, if users actually do this, I'm more likely to consider that he will forcibly specify weightCol as null. So I vote to not change this behavior. Thanks.
cc @srowen @MLnick @jkbradley

@yanboliang
Copy link
Contributor

Further more, weightCol=None in the Python API doc may be confused for users, I think we can add some annotations to clarify the meanings. Thanks.

@BryanCutler
Copy link
Member Author

Hi @yanboliang , thanks for taking a look!
I'm more inclined to think that most Python users would think that weightCol=None does not cause a problem, as in this corresponding JIRA. Most are unaware that Py4J will convert the value to null behind the scenes. Let me show the LogisticRegression constructor as an example

    @keyword_only
    def __init__(self, featuresCol="features", labelCol="label", predictionCol="prediction",
                 maxIter=100, regParam=0.0, elasticNetParam=0.0, tol=1e-6, fitIntercept=True,
                 threshold=0.5, thresholds=None, probabilityCol="probability",
                 rawPredictionCol="rawPrediction", standardization=True, weightCol=None,
                 aggregationDepth=2):

Here it shows maxIter=100 which means that the param maxIter has a default value of 100. Then it shows weightCol=None, but this does not mean that weightCol has a default value of null, it means that the param weightCol is not set by default. So why would it have a different effect if the user explicitly writes it?

@holdenk
Copy link
Contributor

holdenk commented Oct 7, 2016

@yanboliang : I've certainly seen even Spark developers be tripped up with the handling of None by the Py4J bridge so I wouldn't be surprised if this is a more common problem than we might think.

@ueshin
Copy link
Member

ueshin commented Jun 20, 2017

@BryanCutler Hi, are you still working on this?

@BryanCutler
Copy link
Member Author

Hi @ueshin , there isn't anymore work for this, just a decision if this behavior should change. I still think it should and would like to hear some more opinions before I close this. Any thoughts @holdenk @jkbradley ?

@holdenk
Copy link
Contributor

holdenk commented Jul 2, 2017

So I think our current handling of None is confusing, but I'd really like to see what the other committers have to say on this.

@holdenk
Copy link
Contributor

holdenk commented Apr 13, 2018

Is this PR still needed?

@BryanCutler
Copy link
Member Author

I still think this makes sense, but maybe I'm the minority. I'll go ahead and close it unless anyone else thinks it should be changed.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants