Skip to content

[SPARK-20498][PYSPARK][ML] Expose getMaxDepth for ensemble tree model in PySpark #18120

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
wants to merge 2 commits into from

Conversation

facaiy
Copy link
Contributor

@facaiy facaiy commented May 26, 2017

What changes were proposed in this pull request?

add getMaxDepth method for ensemble tree models:

  • RandomForestClassifierModel
  • RandomForestRegressionModel
  • GBTClassificationModel
  • GBTRegressionModel

How was this patch tested?

  • pass all unit tests.
  • add new doctest.

@facaiy
Copy link
Contributor Author

facaiy commented May 26, 2017

@keypointt Hi, could you help check whether the pr is consistent with your #17207 ? Thanks.

@AmplabJenkins
Copy link

Can one of the admins verify this patch?

@keypointt
Copy link
Contributor

Hi Facai, you exposed api getMaxDepth() in TreeEnsembleModel, and the rest changes are in comment and not being run.

I'd suggest you put the tests in test module and run them to verify

@facaiy
Copy link
Contributor Author

facaiy commented May 27, 2017

Hi, @keypointt . It's the feature of Python. The doctest is both document and unit test.

@keypointt
Copy link
Contributor

keypointt commented May 27, 2017 via email

@sethah
Copy link
Contributor

sethah commented May 30, 2017

cc @BryanCutler.

Bryan did some work on #17849. It seems even with that patch, we still need to add methods like these, hoping Bryan can confirm. If we're going to add some param accessors to the models, best to do them all at once yes? E.g. we can also add getMaxBins and others.

@BryanCutler
Copy link
Member

Thanks @facaiy for the PR. This might be enough to simply retrieve the value from the Java model, but I think the Python model also needs to "own" the param. For example, if we have a DecisionTreeRegressor called dt and a DecisionTreeRegressionModel called model then

In [8]: dt.hasParam("maxDepth")
Out[8]: True

In [9]: model.hasParam("maxDepth")
Out[9]: False

This is because the Python object does not have an instance of the param, its only getting a value from the Java model. Additionally, many of the methods you would expect to work from class Params would raise an error like

In [4]: dt.explainParam("maxDepth")
Out[4]: 'maxDepth: Maximum depth of the tree. (>= 0) E.g., depth 0 means 1 leaf node; depth 1 means 1 internal node + 2 leaf nodes. (default: 5, current: 2)

In [5]: model.explainParam("maxDepth")
...
AttributeError: 'DecisionTreeRegressionModel' object has no attribute 'maxDepth'

As @sethah pointed out #17849 has the fix so that the Python models would have an instance of each param, so that should go in first. Then, the accessor could be written like this:

def getMaxDepth(self):
    return self.getOrDefault(self.maxDepth)

I'm not sure what the best approach for adding these accessors, all at once or one by one as needed, like with maxDepth?

cc @holdenk @jkbradley for your input

@facaiy
Copy link
Contributor Author

facaiy commented May 31, 2017

Thanks, @BryanCutler.
It seems that #17849 copys Params from Estimator to Model automatically, which is pretty useful. However, getter method is still missing and need to be added one by one manually as the pr.

If only model could inherit both params and its getter method, perhaps it will be better in my opinion.

Anyway, it is OK whether the pr will be merged or not, as it is trivial.

@facaiy facaiy closed this Aug 26, 2017
@facaiy facaiy deleted the ENH/pyspark_rf_add_get_max_depth branch August 26, 2017 09:43
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants