-
Notifications
You must be signed in to change notification settings - Fork 28.6k
[SPARK-11938][ML] Expose numFeatures in all ML PredictionModel for Py… #9936
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
eaf149a
to
25475ba
Compare
Test build #46609 has finished for PR 9936 at commit
|
@mengxr Could you review it? Thanks. |
@since("1.6.0") | ||
def numFeatures(self): | ||
""" | ||
The number of features |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
"The number of features" -> "The number of features used to train the model."
I think this is more explicit since, for things like decision trees, this property is what number of features were used in training and not necessarily what number of features the model incorporates in its prediction.
Nit: add period to follow docstring convention.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@sethah Thank you. I updated!
Test build #47029 has finished for PR 9936 at commit
|
@@ -371,6 +378,126 @@ def test_fit_maximize_metric(self): | |||
self.assertEqual(1.0, bestModelMetric, "Best model has R-squared of 1") | |||
|
|||
|
|||
class RegressorTest(PySparkTestCase): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Could we test numFeatures
at each model that we can save cost of creating DataFrame and training model?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Each trainer output corresponding models. So we cannot share training code but creating DataFrames can be separated each RegressorTest
and ClassificationTest
. I'll update so.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@Lewuathe Sorry for late response. What I means is that we can add test for numFeatures
in the current test cases of each algorithm. For example, we can add
>>> model.numFeatures
2
after here for LogisticRegressionModel
.
The python doctest can also used as examples that users can get how to use numFeatures
from the test case context.
@yanboliang I made test code to avoid recreation of datasets. |
Test build #47604 has finished for PR 9936 at commit
|
Test build #47605 has finished for PR 9936 at commit
|
try: | ||
self.df | ||
except AttributeError: | ||
from pyspark.mllib.linalg import Vectors |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
is there any reason for putting the import in the code?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Vectors
and StringIndexer
is not used in any other place. It is better not to expand the scope in my though.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think either is fine. I'm sure Vectors will be used elsewhere in this file in the future, but am not sure about StringIndexer.
Test build #49926 has finished for PR 9936 at commit
|
Test build #49928 has finished for PR 9936 at commit
|
It seems like there is a lot of duplicated just calls to Java & PyDoc could we maybe make a HasNumFeatures or similar (similar to how we do this for params which we use in many places)? |
Ping: Is this PR still active? (It would be nice) |
@Lewuathe Do you have time to update this PR? If not, I can help. |
Sorry for late response. I'll do update. @yanboliang Thanks for taking care! |
Aggregate numFeatures property in HasNumFeaturesModel in base.py.
Test build #54704 has finished for PR 9936 at commit
|
Test build #54705 has finished for PR 9936 at commit
|
Test build #54749 has finished for PR 9936 at commit
|
@Lewuathe Things got busy over here too; sorry for the delay! Btw, can you please add "[PYTHON]" to the PR title? |
It should be mixin with JavaModel. | ||
""" | ||
@property | ||
@since("1.7.0") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Remove since tag. If this is added to another class in a later version, the since tag will be incorrect.
Test build #58466 has finished for PR 9936 at commit
|
ping @Lewuathe Is this still active? If not, I could help with this. |
@vectorijk It seems difficult to handle this JIRA for me. Could you help me with this? |
Closing the following PRs due to requests or unresponsive users. Closes apache#13923 Closes apache#14462 Closes apache#13123 Closes apache#14423 (requested by srowen) Closes apache#14424 (requested by srowen) Closes apache#14101 (requested by jkbradley) Closes apache#10676 (requested by srowen) Closes apache#10943 (requested by yhuai) Closes apache#9936 Closes apache#10701
…Spark
see: https://issues.apache.org/jira/browse/SPARK-11938
Note: Test cases for
ml.regression
andml.classification
were not implemented. The test written in this patch target only fornumFeatures
. We can do enhancement of test case forml.regression
andml.classification
in separate JIRA.