Skip to content

[SPARK-19281][PYTHON][ML] spark.ml Python API for FPGrowth #17218

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
wants to merge 22 commits into from

Conversation

zero323
Copy link
Member

@zero323 zero323 commented Mar 9, 2017

What changes were proposed in this pull request?

  • Add HasSupport and HasConfidence Params.
  • Add new module pyspark.ml.fpm.
  • Add FPGrowth / FPGrowthModel wrappers.
  • Provide tests for new features.

How was this patch tested?

Unit tests.

@SparkQA
Copy link

SparkQA commented Mar 9, 2017

Test build #74230 has finished for PR 17218 at commit 3b10a30.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds the following public classes (experimental):
  • class FPGrowthModel(JavaModel, JavaMLWritable, JavaMLReadable):
  • class FPGrowth(JavaEstimator, HasFeaturesCol, HasPredictionCol,
  • class HasSupport(Params):
  • class HasConfidence(Params):

@SparkQA
Copy link

SparkQA commented Mar 10, 2017

Test build #74354 has finished for PR 17218 at commit e599f31.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@SparkQA
Copy link

SparkQA commented Mar 11, 2017

Test build #74356 has finished for PR 17218 at commit 4fe6257.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@jkbradley
Copy link
Member

Thanks for the PR! I'll wait until this isn't "WIP" to review it thoroughly, but I'll make two comments now:

  • The params should not be added to shared.py since they are not shared by any other algorithm. They can be added later if needed, but I expect them not to be since the documentation for these in particular should be specialized for FPM algorithms.
  • For future reference: Never add stuff directly to shared.py; it should go in the generating file in the same folder.

@zero323
Copy link
Member Author

zero323 commented Mar 13, 2017

@jkbradley Thanks for the comment. I thought about PrefixSpan in the future so I wanted to avoid embedding this in FPGrowth. I'll put it in the fpm.

@jkbradley
Copy link
Member

True, if minSupport can be shared, then that's OK. confidence won't be shared though.

@zero323
Copy link
Member Author

zero323 commented Mar 13, 2017

@jkbradley As far as I remember some variants of PrefixSpan use confidence but I doubt we'll encounter this problem any time soon :)

Somewhat related - could you take a look at SPARK-19899?

@SparkQA
Copy link

SparkQA commented Mar 14, 2017

Test build #74534 has finished for PR 17218 at commit c9ab242.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@zero323 zero323 force-pushed the SPARK-19281 branch 2 times, most recently from bebb363 to 9074312 Compare March 16, 2017 00:21
@SparkQA
Copy link

SparkQA commented Mar 16, 2017

Test build #74630 has finished for PR 17218 at commit 9074312.

  • This patch fails PySpark unit tests.
  • This patch merges cleanly.
  • This patch adds the following public classes (experimental):
  • class HasSupport(Params):
  • class HasConfidence(Params):

@SparkQA
Copy link

SparkQA commented Mar 16, 2017

Test build #74632 has finished for PR 17218 at commit a2afb74.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds the following public classes (experimental):
  • class HasSupport(Params):
  • class HasConfidence(Params):

@zero323 zero323 changed the title [SPARK-19281][WIP][PYTHON][ML] spark.ml Python API for FPGrowth [SPARK-19281][PYTHON][ML] spark.ml Python API for FPGrowth Mar 17, 2017
@zero323 zero323 force-pushed the SPARK-19281 branch 2 times, most recently from 9bde018 to 0a3798d Compare March 19, 2017 15:42
@zero323
Copy link
Member Author

zero323 commented Mar 19, 2017

Note: should be retested after #17321 is resolved.

@SparkQA
Copy link

SparkQA commented Mar 19, 2017

Test build #74827 has finished for PR 17218 at commit 0a3798d.

  • This patch fails Python style tests.
  • This patch merges cleanly.
  • This patch adds the following public classes (experimental):
  • class HasItemsCol(Params):
  • class FPGrowthModel(JavaModel, JavaMLWritable, JavaMLReadable):
  • class FPGrowth(JavaEstimator, HasItemsCol, HasPredictionCol,

@zero323
Copy link
Member Author

zero323 commented Mar 20, 2017

Jenkins retest this please.

@SparkQA
Copy link

SparkQA commented Mar 20, 2017

Test build #74895 has finished for PR 17218 at commit 0a3798d.

  • This patch fails Python style tests.
  • This patch merges cleanly.
  • This patch adds the following public classes (experimental):
  • class HasItemsCol(Params):
  • class FPGrowthModel(JavaModel, JavaMLWritable, JavaMLReadable):
  • class FPGrowth(JavaEstimator, HasItemsCol, HasPredictionCol,

@SparkQA
Copy link

SparkQA commented Mar 20, 2017

Test build #74898 has finished for PR 17218 at commit 5f4673e.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@zero323
Copy link
Member Author

zero323 commented Mar 21, 2017

@jkbradley I think this is ready for review.

@jkbradley
Copy link
Member

Sure, I can take a look. Let me ping @MLnick too since he marked himself as shepherd

Copy link
Member

@jkbradley jkbradley left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm only partly done reviewing, but I'll go ahead and send some comments. Thanks for the PR!

"""
Sets the value of :py:attr:`minSupport`.
"""
if not 0 <= value <= 1:
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This check happens on the Scala side; let's not replicate it here.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

To be honest I don't like this approach so I'll try to make the case for keeping this "as-is".

If we depend on Scala checks we fail late by delaying this to the point where transform is called. If this happens in the middle of a complex pipeline then it is simply expensive so my opinion is that if we can fail early without significant overhead then we should.


class HasConfidence(Params):
"""
Mixin for param confidence: [0.0, 1.0].
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

omit range here too

"""
Sets the value of :py:attr:`minConfidence`.
"""
if not 0 <= value <= 1:
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ditto

minConfidence = Param(
Params._dummy(),
"minConfidence",
"Minimal confidence for generating Association Rule. [0.0, 1.0]",
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Match Scala doc: "Note that minConfidence has no effect during fitting."

"""

itemsCol = Param(Params._dummy(), "itemsCol",
"items column name.", typeConverter=TypeConverters.toString)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

remove period "." from end of doc string here

itemsCol = Param(Params._dummy(), "itemsCol",
"items column name.", typeConverter=TypeConverters.toString)

def __init__(self):
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

No need for this. The default will be set in FPGrowth

return self.getOrDefault(self.itemsCol)


class FPGrowthModel(JavaModel, JavaMLWritable, JavaMLReadable):
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Mark Experimental

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Also, it'd be good to be able to set minConfidence, itemsCol and predictionCol (for associationRules and transform)

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I pushed my first attempt but I think will require a bit more discussion. If enable this here should we do the same for the rest of Python models?

return self._call_java("associationRules")


class FPGrowth(JavaEstimator, HasItemsCol, HasPredictionCol,
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Mark Experimental

@property
@since("2.2.0")
def freqItemsets(self):
"""DataFrame with two columns:
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Python style: put triple-quotes on a line by themselves (here and elsewhere below)

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done.

Side note: Should we add it to https://spark.apache.org/contributing.html (PEP8 recommends only the closing quote to be placed in a separate line).

@jkbradley
Copy link
Member

Issue this PR brought up:

  • Background: AssociationRules currently return a 1-element array for the consequent (predicted item). This makes sense b/c, even though multiple consequents could be predicted for a given itemset, they belong in different rules because they have different confidences.
  • Question: Should we change the schema for "consequent" to be a single item, rather than an array of a single item?
  • CCing people who have worked on this: @zero323 @MLnick @hhbyyh @felixcheung

@SparkQA
Copy link

SparkQA commented Mar 25, 2017

Test build #75219 has finished for PR 17218 at commit 21a3606.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

Copy link
Member

@jkbradley jkbradley left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the updates!

Also, it'd be good to be able to set minConfidence, itemsCol and predictionCol (for associationRules and transform)
I pushed my first attempt but I think will require a bit more discussion. If enable this here should we do the same for the rest of Python models?

True, we should do it for all models. And you're right that it's more involved than I was thinking. Specifically, rather than calling setParams from _create_model, I'd want us to call _copyValues from fit() in order to eliminate duplicate code. Would you mind removing the Params from the model, and we can work on adding them in more carefully for the next release? Thanks a lot!

I dug up the existing JIRA for this issue: https://issues.apache.org/jira/browse/SPARK-10931

Side note: Should we add it to https://spark.apache.org/contributing.html (PEP8 recommends only the closing quote to be placed in a separate line).

I would say yes...except I see it is inconsistent elsewhere in Spark. I guess I won't push for it anymore.

"pyspark.ml.classification",
"pyspark.ml.clustering",
"pyspark.ml.evaluation",
"pyspark.ml.feature",
"pyspark.ml.fpm",
"pyspark.ml.linalg.__init__",
"pyspark.ml.recommendation",
"pyspark.ml.regression",
"pyspark.ml.tuning",
"pyspark.ml.tests",
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

As long as you're at it, switch tuning & tests to alphabetize them

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sure thing, I thought there is some logic in putting tests last. Should I reorder the other modules as well?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Interesting...maybe? I guess it doesn't really matter, so no need to rearrange more.

"""
Sets the value of :py:attr:`minSupport`.
"""
if not (0 <= value <= 1):
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

On this topic, I agree with you that not checking here could currently cause late failures in a Pipeline. However, I think the right fix for this is to add PipelineStage and transformSchema() to Python. I just made a JIRA for it: https://issues.apache.org/jira/browse/SPARK-20099

minConfidence = Param(
Params._dummy(),
"minConfidence",
""""Minimal confidence for generating Association Rule. [0.0, 1.0]
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Extra quotes here. Does this come out formatted correctly?

HasConfidence, HasItemsCol, HasPredictionCol):
"""Model fitted by FPGrowth.

.. note:: Experimental
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Put first in doc string (See examples elsewhere)

"""
Data with three columns:
* `antecedent` - Array of the same type as the input column.
* `consequent` - Single element array of the same type as the input column.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I just realized: If we're leaving open the possibility of returning multiple elements here in the future, then let's not document that this has a single element (else it effectively becomes a guarantee in the API).

.. [LI2008] http://dx.doi.org/10.1145/1454008.1454027
.. [HAN2000] http://dx.doi.org/10.1145/335191.335372

.. note:: Experimental
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I didn't see this before, so now this is noted twice. Just put it once at the beginning of the docstring.

@SparkQA
Copy link

SparkQA commented Mar 26, 2017

Test build #75230 has finished for PR 17218 at commit deb2ce7.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

typeConverter=TypeConverters.toFloat)

def setMinSupport(self, value):
"""
Sets the value of :py:attr:`minSupport`.
"""
if not (0 <= value <= 1):
raise ValueError("Support must be in range [0, 1]")
return self._set(minSupport=value)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This removed too much! This line should remain

@SparkQA
Copy link

SparkQA commented Mar 26, 2017

Test build #75237 has finished for PR 17218 at commit 66b85e5.

  • This patch fails Spark unit tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@zero323
Copy link
Member Author

zero323 commented Mar 26, 2017

Jenkins retest this please.

@zero323
Copy link
Member Author

zero323 commented Mar 26, 2017

True, we should do it for all models. And you're right that it's more involved than I was thinking. Specifically, rather than calling setParams from _create_model, I'd want us to call _copyValues from fit() in order to eliminate duplicate code. Would you mind removing the Params from the model, and we can work on adding them in more carefully for the next release? Thanks a lot!

I dug up the existing JIRA for this issue: https://issues.apache.org/jira/browse/SPARK-10931

I removed the code and I'll be following SPARK-10931. One possible challenge (here and for parameters validation) is high latency of Py4j calls. With large pipelines it can build up pretty fast.

@SparkQA
Copy link

SparkQA commented Mar 26, 2017

Test build #75241 has finished for PR 17218 at commit 66b85e5.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@jkbradley
Copy link
Member

LGTM
Merging with master
Thanks a lot!

@jkbradley
Copy link
Member

@indyragandy What do you mean by "get directly"?

@asfgit asfgit closed this in 0bc8847 Mar 26, 2017
@zero323 zero323 deleted the SPARK-19281 branch April 6, 2017 10:55
asfgit pushed a commit that referenced this pull request May 25, 2017
## What changes were proposed in this pull request?
Follow-up for #17218, some minor fix for PySpark ```FPGrowth```.

## How was this patch tested?
Existing UT.

Author: Yanbo Liang <ybliang8@gmail.com>

Closes #18089 from yanboliang/spark-19281.

(cherry picked from commit 913a6bf)
Signed-off-by: Yanbo Liang <ybliang8@gmail.com>
ghost pushed a commit to dbtsai/spark that referenced this pull request May 25, 2017
## What changes were proposed in this pull request?
Follow-up for apache#17218, some minor fix for PySpark ```FPGrowth```.

## How was this patch tested?
Existing UT.

Author: Yanbo Liang <ybliang8@gmail.com>

Closes apache#18089 from yanboliang/spark-19281.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants