[SPARK-29142][PYTHON][ML] Pyspark clustering models support column setters/getters/predict #25859

huaxingao · 2019-09-19T21:52:48Z

What changes were proposed in this pull request?

Add the following Params classes in Pyspark clustering
GaussianMixtureParams
KMeansParams
BisectingKMeansParams
LDAParams
PowerIterationClusteringParams

Why are the changes needed?

To be consistent with scala side

Does this PR introduce any user-facing change?

Yes. Add the following changes:

GaussianMixtureModel
- get/setMaxIter
- get/setFeaturesCol
- get/setSeed
- get/setPredictionCol
- get/setProbabilityCol
- get/setTol
- predict

KMeansModel
- get/setMaxIter
- get/setFeaturesCol
- get/setSeed
- get/setPredictionCol
- get/setDistanceMeasure
- get/setTol
- predict

BisectingKMeansModel
- get/setMaxIter
- get/setFeaturesCol
- get/setSeed
- get/setPredictionCol
- get/setDistanceMeasure
- predict

LDAModel(HasMaxIter, HasFeaturesCol, HasSeed, HasCheckpointInterval):
- get/setMaxIter
- get/setFeaturesCol
- get/setSeed
- get/setCheckpointInterval

How was this patch tested?

Add doctests

…tters/getters/predict

SparkQA · 2019-09-19T22:23:37Z

Test build #111028 has finished for PR 25859 at commit 7acada0.

This patch passes all tests.
This patch merges cleanly.
This patch adds the following public classes (experimental):
class GaussianMixtureParams(HasMaxIter, HasFeaturesCol, HasSeed, HasPredictionCol,
class GaussianMixtureModel(JavaModel, GaussianMixtureParams, JavaMLWritable, JavaMLReadable,
class GaussianMixture(JavaEstimator, GaussianMixtureParams, JavaMLWritable, JavaMLReadable):
class KMeansParams(HasMaxIter, HasFeaturesCol, HasSeed, HasPredictionCol, HasTol,
class KMeansModel(JavaModel, KMeansParams, GeneralJavaMLWritable, JavaMLReadable,
class KMeans(JavaEstimator, KMeansParams, JavaMLWritable, JavaMLReadable):
class BisectingKMeansParams(HasMaxIter, HasFeaturesCol, HasSeed, HasPredictionCol,
class BisectingKMeansModel(JavaModel, BisectingKMeansParams, JavaMLWritable, JavaMLReadable,
class BisectingKMeans(JavaEstimator, BisectingKMeansParams, JavaMLWritable, JavaMLReadable):
class LDAParams(HasMaxIter, HasFeaturesCol, HasSeed, HasCheckpointInterval):
class LDAModel(JavaModel, LDAParams):
class LDA(JavaEstimator, LDAParams, JavaMLReadable, JavaMLWritable):
class PowerIterationClusteringParams(HasMaxIter, HasWeightCol):
class PowerIterationClustering(PowerIterationClusteringParams, JavaParams, JavaMLReadable,

zhengruifeng · 2019-09-24T07:41:50Z

python/pyspark/ml/clustering.py

+        Predict label for the given features.
+        """
+        return self._call_java("predict", value)
+


we should add def predictProbability as well

zhengruifeng · 2019-09-24T07:45:22Z

python/pyspark/ml/clustering.py

+    >>> model.getDistanceMeasure()
+    'euclidean'
+    >>> model.setPredictionCol("newPrediction")
+    KMeans...


KMeansModel...

It is KMeans_3487dfaa7c0e

This maybe a little bug, KMeans & KMeansModel should have different uid (like LogisticRegression & LogisticRegressionModel). This issue seems also happen in other place. But we can leave it here.

SparkQA · 2019-09-24T17:13:51Z

Test build #111300 has finished for PR 25859 at commit 4af0dc6.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

srowen

All this does is refactor the getters, right? the description says it adds setters, but I'm not seeing that at a first look.

Also it's a tricky question, what to do about @since annotations when subclasses add them at different times. I might suggest making it the highest version of any of the methods that are removed in favor of a new one.

huaxingao · 2019-09-25T17:48:51Z

@srowen
This PR also adds the setters.
Use GaussianMixtureModel as an example:
before the PR:

class GaussianMixtureModel(JavaModel, JavaMLWritable, JavaMLReadable, HasTrainingSummary):

after the PR:

class GaussianMixtureParams(HasMaxIter, HasFeaturesCol, HasSeed, HasPredictionCol,
                            HasProbabilityCol, HasTol):
class GaussianMixtureModel(JavaModel, GaussianMixtureParams, JavaMLWritable, JavaMLReadable,
                           HasTrainingSummary):

Since currently, HasXXX has both setters and getters, so this PR adds both the setters and getters to GaussianMixtureModel.
After next refactor jira https://issues.apache.org/jira/browse/SPARK-29093 (remove automatically generated param setters in _shared_params_code_gen.py), setters will be removed from HasXXX, I will need to explicitly add setFeaturesCol, setPredictionCol and setProbabilityCol to GaussianMixtureModel, then the code will be as following

class GaussianMixtureModel(JavaModel, GaussianMixtureParams, JavaMLWritable, JavaMLReadable,
                           HasTrainingSummary):
  def setFeaturesCol
  def setPredictionCol
  def setProbabilityCol

It will be exactly the same as the currently scala code below:

class GaussianMixtureModel extends Model with GaussianMixtureParams with MLWritable
  with HasTrainingSummary
  def setFeaturesCol
  def setPredictionCol
  def setProbabilityCol

I agree with you that we should retain @since annotations with the highest version of any of the removed methods.

srowen · 2019-09-25T18:02:56Z

OK so this will need to be followed up with another PR. That's fine, just remind me to review it. (We can link the JIRAs too to make it clear.) I'll leave it open for more comments for a bit.

zhengruifeng · 2019-09-26T03:48:59Z

One more thing: we may need to rename xxxParams to _xxxParams in this PR & #25908 ?

srowen · 2019-09-26T12:27:26Z

PS @zhengruifeng I think you can merge this, and the other related PR, once you are both comfortable with it.

SparkQA · 2019-09-26T17:42:58Z

Test build #111438 has finished for PR 25859 at commit 712ef78.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

zhengruifeng · 2019-09-27T03:25:16Z

Merged to master, thanks all!

huaxingao · 2019-09-27T04:23:36Z

Thanks! @srowen @zhengruifeng

[SPARK-29142][PYTHON][ML] Pyspark clustering models support column se…

7acada0

…tters/getters/predict

dongjoon-hyun added ML PYSPARK labels Sep 20, 2019

zhengruifeng reviewed Sep 24, 2019

View reviewed changes

address comments

4af0dc6

srowen reviewed Sep 25, 2019

View reviewed changes

zhengruifeng approved these changes Sep 26, 2019

View reviewed changes

huaxingao added 2 commits September 26, 2019 10:12

add _ in front of xxxParams to indicate internal use (PEP8)

6e6a5d5

fix a problem

712ef78

zhengruifeng closed this in bdc4943 Sep 27, 2019

huaxingao deleted the spark-29142 branch September 27, 2019 04:23

zero323 mentioned this pull request Sep 28, 2019

Sync with changes merged after 6378d4bc06cd1bb1a209bd5fb63d10ef52d75eb4 zero323/pyspark-stubs#230

Closed

47 tasks

[SPARK-29142][PYTHON][ML] Pyspark clustering models support column setters/getters/predict #25859

[SPARK-29142][PYTHON][ML] Pyspark clustering models support column setters/getters/predict #25859

Uh oh!

Conversation

huaxingao commented Sep 19, 2019

What changes were proposed in this pull request?

Why are the changes needed?

Does this PR introduce any user-facing change?

How was this patch tested?

Uh oh!

SparkQA commented Sep 19, 2019

Uh oh!

zhengruifeng Sep 24, 2019

Choose a reason for hiding this comment

Uh oh!

zhengruifeng Sep 24, 2019

Choose a reason for hiding this comment

Uh oh!

huaxingao Sep 24, 2019

Choose a reason for hiding this comment

Uh oh!

zhengruifeng Sep 27, 2019

Choose a reason for hiding this comment

Uh oh!

SparkQA commented Sep 24, 2019

Uh oh!

srowen left a comment

Choose a reason for hiding this comment

Uh oh!

huaxingao commented Sep 25, 2019

Uh oh!

srowen commented Sep 25, 2019

Uh oh!

zhengruifeng commented Sep 26, 2019

Uh oh!

srowen commented Sep 26, 2019

Uh oh!

SparkQA commented Sep 26, 2019

Uh oh!

zhengruifeng commented Sep 27, 2019

Uh oh!

huaxingao commented Sep 27, 2019

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

5 participants