Skip to content

Conversation

@huaxingao
Copy link
Contributor

What changes were proposed in this pull request?

Add the following Params classes in Pyspark clustering
GaussianMixtureParams
KMeansParams
BisectingKMeansParams
LDAParams
PowerIterationClusteringParams

Why are the changes needed?

To be consistent with scala side

Does this PR introduce any user-facing change?

Yes. Add the following changes:

GaussianMixtureModel
- get/setMaxIter
- get/setFeaturesCol
- get/setSeed
- get/setPredictionCol
- get/setProbabilityCol
- get/setTol
- predict
KMeansModel
- get/setMaxIter
- get/setFeaturesCol
- get/setSeed
- get/setPredictionCol
- get/setDistanceMeasure
- get/setTol
- predict
BisectingKMeansModel
- get/setMaxIter
- get/setFeaturesCol
- get/setSeed
- get/setPredictionCol
- get/setDistanceMeasure
- predict
LDAModel(HasMaxIter, HasFeaturesCol, HasSeed, HasCheckpointInterval):
- get/setMaxIter
- get/setFeaturesCol
- get/setSeed
- get/setCheckpointInterval

How was this patch tested?

Add doctests

@SparkQA
Copy link

SparkQA commented Sep 19, 2019

Test build #111028 has finished for PR 25859 at commit 7acada0.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds the following public classes (experimental):
  • class GaussianMixtureParams(HasMaxIter, HasFeaturesCol, HasSeed, HasPredictionCol,
  • class GaussianMixtureModel(JavaModel, GaussianMixtureParams, JavaMLWritable, JavaMLReadable,
  • class GaussianMixture(JavaEstimator, GaussianMixtureParams, JavaMLWritable, JavaMLReadable):
  • class KMeansParams(HasMaxIter, HasFeaturesCol, HasSeed, HasPredictionCol, HasTol,
  • class KMeansModel(JavaModel, KMeansParams, GeneralJavaMLWritable, JavaMLReadable,
  • class KMeans(JavaEstimator, KMeansParams, JavaMLWritable, JavaMLReadable):
  • class BisectingKMeansParams(HasMaxIter, HasFeaturesCol, HasSeed, HasPredictionCol,
  • class BisectingKMeansModel(JavaModel, BisectingKMeansParams, JavaMLWritable, JavaMLReadable,
  • class BisectingKMeans(JavaEstimator, BisectingKMeansParams, JavaMLWritable, JavaMLReadable):
  • class LDAParams(HasMaxIter, HasFeaturesCol, HasSeed, HasCheckpointInterval):
  • class LDAModel(JavaModel, LDAParams):
  • class LDA(JavaEstimator, LDAParams, JavaMLReadable, JavaMLWritable):
  • class PowerIterationClusteringParams(HasMaxIter, HasWeightCol):
  • class PowerIterationClustering(PowerIterationClusteringParams, JavaParams, JavaMLReadable,

Predict label for the given features.
"""
return self._call_java("predict", value)

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

we should add def predictProbability as well

>>> model.getDistanceMeasure()
'euclidean'
>>> model.setPredictionCol("newPrediction")
KMeans...
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

KMeansModel...

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It is KMeans_3487dfaa7c0e

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This maybe a little bug, KMeans & KMeansModel should have different uid (like LogisticRegression & LogisticRegressionModel). This issue seems also happen in other place. But we can leave it here.

@SparkQA
Copy link

SparkQA commented Sep 24, 2019

Test build #111300 has finished for PR 25859 at commit 4af0dc6.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

Copy link
Member

@srowen srowen left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

All this does is refactor the getters, right? the description says it adds setters, but I'm not seeing that at a first look.

Also it's a tricky question, what to do about @since annotations when subclasses add them at different times. I might suggest making it the highest version of any of the methods that are removed in favor of a new one.

@huaxingao
Copy link
Contributor Author

@srowen
This PR also adds the setters.
Use GaussianMixtureModel as an example:
before the PR:

class GaussianMixtureModel(JavaModel, JavaMLWritable, JavaMLReadable, HasTrainingSummary):

after the PR:

class GaussianMixtureParams(HasMaxIter, HasFeaturesCol, HasSeed, HasPredictionCol,
                            HasProbabilityCol, HasTol):
class GaussianMixtureModel(JavaModel, GaussianMixtureParams, JavaMLWritable, JavaMLReadable,
                           HasTrainingSummary):

Since currently, HasXXX has both setters and getters, so this PR adds both the setters and getters to GaussianMixtureModel.
After next refactor jira https://issues.apache.org/jira/browse/SPARK-29093 (remove automatically generated param setters in _shared_params_code_gen.py), setters will be removed from HasXXX, I will need to explicitly add setFeaturesCol, setPredictionCol and setProbabilityCol to GaussianMixtureModel, then the code will be as following

class GaussianMixtureModel(JavaModel, GaussianMixtureParams, JavaMLWritable, JavaMLReadable,
                           HasTrainingSummary):
  def setFeaturesCol
  def setPredictionCol
  def setProbabilityCol

It will be exactly the same as the currently scala code below:

class GaussianMixtureModel extends Model with GaussianMixtureParams with MLWritable
  with HasTrainingSummary
  def setFeaturesCol
  def setPredictionCol
  def setProbabilityCol

I agree with you that we should retain @since annotations with the highest version of any of the removed methods.

@srowen
Copy link
Member

srowen commented Sep 25, 2019

OK so this will need to be followed up with another PR. That's fine, just remind me to review it. (We can link the JIRAs too to make it clear.) I'll leave it open for more comments for a bit.

@zhengruifeng
Copy link
Contributor

One more thing: we may need to rename xxxParams to _xxxParams in this PR & #25908 ?

@srowen
Copy link
Member

srowen commented Sep 26, 2019

PS @zhengruifeng I think you can merge this, and the other related PR, once you are both comfortable with it.

@SparkQA
Copy link

SparkQA commented Sep 26, 2019

Test build #111438 has finished for PR 25859 at commit 712ef78.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@zhengruifeng
Copy link
Contributor

Merged to master, thanks all!

@huaxingao
Copy link
Contributor Author

Thanks! @srowen @zhengruifeng

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

5 participants