Skip to content

[SPARK-11938][ML] Expose numFeatures in all ML PredictionModel for Py… #9936

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
wants to merge 2 commits into from

Conversation

Lewuathe
Copy link
Contributor

…Spark

see: https://issues.apache.org/jira/browse/SPARK-11938

Note: Test cases for ml.regression and ml.classification were not implemented. The test written in this patch target only for numFeatures. We can do enhancement of test case for ml.regression and ml.classification in separate JIRA.

@SparkQA
Copy link

SparkQA commented Nov 24, 2015

Test build #46609 has finished for PR 9936 at commit 25475ba.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@Lewuathe
Copy link
Contributor Author

@mengxr Could you review it? Thanks.

@since("1.6.0")
def numFeatures(self):
"""
The number of features
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

"The number of features" -> "The number of features used to train the model."

I think this is more explicit since, for things like decision trees, this property is what number of features were used in training and not necessarily what number of features the model incorporates in its prediction.

Nit: add period to follow docstring convention.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@sethah Thank you. I updated!

@SparkQA
Copy link

SparkQA commented Dec 2, 2015

Test build #47029 has finished for PR 9936 at commit 478e113.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@@ -371,6 +378,126 @@ def test_fit_maximize_metric(self):
self.assertEqual(1.0, bestModelMetric, "Best model has R-squared of 1")


class RegressorTest(PySparkTestCase):
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could we test numFeatures at each model that we can save cost of creating DataFrame and training model?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Each trainer output corresponding models. So we cannot share training code but creating DataFrames can be separated each RegressorTest and ClassificationTest. I'll update so.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@Lewuathe Sorry for late response. What I means is that we can add test for numFeatures in the current test cases of each algorithm. For example, we can add

>>> model.numFeatures
2

after here for LogisticRegressionModel.
The python doctest can also used as examples that users can get how to use numFeatures from the test case context.

@Lewuathe
Copy link
Contributor Author

@yanboliang I made test code to avoid recreation of datasets.
Could you check again please?

@SparkQA
Copy link

SparkQA commented Dec 12, 2015

Test build #47604 has finished for PR 9936 at commit 3b02d4e.

  • This patch fails Python style tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@SparkQA
Copy link

SparkQA commented Dec 12, 2015

Test build #47605 has finished for PR 9936 at commit 5b4c500.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

try:
self.df
except AttributeError:
from pyspark.mllib.linalg import Vectors
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

is there any reason for putting the import in the code?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Vectors and StringIndexer is not used in any other place. It is better not to expand the scope in my though.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think either is fine. I'm sure Vectors will be used elsewhere in this file in the future, but am not sure about StringIndexer.

@SparkQA
Copy link

SparkQA commented Jan 23, 2016

Test build #49926 has finished for PR 9936 at commit cb8904e.

  • This patch fails PySpark unit tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@SparkQA
Copy link

SparkQA commented Jan 23, 2016

Test build #49928 has finished for PR 9936 at commit 2ac0329.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@holdenk
Copy link
Contributor

holdenk commented Jan 28, 2016

It seems like there is a lot of duplicated just calls to Java & PyDoc could we maybe make a HasNumFeatures or similar (similar to how we do this for params which we use in many places)?

@jkbradley
Copy link
Member

Ping: Is this PR still active? (It would be nice)

@yanboliang
Copy link
Contributor

@Lewuathe Do you have time to update this PR? If not, I can help.

@Lewuathe
Copy link
Contributor Author

Lewuathe commented Apr 1, 2016

Sorry for late response. I'll do update. @yanboliang Thanks for taking care!

Aggregate numFeatures property in HasNumFeaturesModel in base.py.
@SparkQA
Copy link

SparkQA commented Apr 1, 2016

Test build #54704 has finished for PR 9936 at commit b211e3d.

  • This patch fails Python style tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@SparkQA
Copy link

SparkQA commented Apr 1, 2016

Test build #54705 has finished for PR 9936 at commit 3f718d0.

  • This patch fails Python style tests.
  • This patch merges cleanly.
  • This patch adds the following public classes (experimental):
    • class LogisticRegressionModel(HasNumFeaturesModel, JavaModel, JavaMLWritable, JavaMLReadable):
    • class DecisionTreeClassificationModel(HasNumFeaturesModel, DecisionTreeModel, JavaMLWritable, JavaMLReadable):
    • class RandomForestClassificationModel(HasNumFeaturesModel, TreeEnsembleModels):
    • class GBTClassificationModel(HasNumFeaturesModel, TreeEnsembleModels):
    • class NaiveBayesModel(HasNumFeaturesModel, JavaModel, JavaMLWritable, JavaMLReadable):
    • class MultilayerPerceptronClassificationModel(HasNumFeaturesModel, JavaModel, JavaMLWritable, JavaMLReadable):
    • class LinearRegressionModel(HasNumFeaturesModel, JavaModel, JavaMLWritable, JavaMLReadable):
    • class DecisionTreeRegressionModel(HasNumFeaturesModel, DecisionTreeModel, JavaMLWritable, JavaMLReadable):
    • class RandomForestRegressionModel(HasNumFeaturesModel, TreeEnsembleModels):
    • class GBTRegressionModel(HasNumFeaturesModel, TreeEnsembleModels):

@SparkQA
Copy link

SparkQA commented Apr 2, 2016

Test build #54749 has finished for PR 9936 at commit a128635.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds the following public classes (experimental):
    • class DecisionTreeClassificationModel(HasNumFeaturesModel, DecisionTreeModel, JavaMLWritable,
    • class MultilayerPerceptronClassificationModel(HasNumFeaturesModel, JavaModel, JavaMLWritable,
    • class DecisionTreeRegressionModel(HasNumFeaturesModel, DecisionTreeModel, JavaMLWritable,

@jkbradley
Copy link
Member

@Lewuathe Things got busy over here too; sorry for the delay! Btw, can you please add "[PYTHON]" to the PR title?

It should be mixin with JavaModel.
"""
@property
@since("1.7.0")
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Remove since tag. If this is added to another class in a later version, the since tag will be incorrect.

@SparkQA
Copy link

SparkQA commented May 12, 2016

Test build #58466 has finished for PR 9936 at commit a128635.

  • This patch passes all tests.
  • This patch does not merge cleanly.
  • This patch adds the following public classes (experimental):
    • class DecisionTreeClassificationModel(HasNumFeaturesModel, DecisionTreeModel, JavaMLWritable,
    • class MultilayerPerceptronClassificationModel(HasNumFeaturesModel, JavaModel, JavaMLWritable,
    • class DecisionTreeRegressionModel(HasNumFeaturesModel, DecisionTreeModel, JavaMLWritable,

@vectorijk
Copy link
Contributor

ping @Lewuathe Is this still active? If not, I could help with this.

@Lewuathe
Copy link
Contributor Author

@vectorijk It seems difficult to handle this JIRA for me. Could you help me with this?
Thanks so much for taking care!

@vectorijk
Copy link
Contributor

vectorijk commented Jun 27, 2016

@Lewuathe Thanks! I opened a new PR #13922 here for this issue. Would you mind closing this PR later? And feel free to give comments to that PR.

vanzin pushed a commit to vanzin/spark that referenced this pull request Aug 4, 2016
Closing the following PRs due to requests or unresponsive users.

Closes apache#13923
Closes apache#14462
Closes apache#13123
Closes apache#14423 (requested by srowen)
Closes apache#14424 (requested by srowen)
Closes apache#14101 (requested by jkbradley)
Closes apache#10676 (requested by srowen)
Closes apache#10943 (requested by yhuai)
Closes apache#9936
Closes apache#10701
@asfgit asfgit closed this in 53e766c Aug 4, 2016
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

8 participants