[SPARK-19281][PYTHON][ML] spark.ml Python API for FPGrowth #17218

zero323 · 2017-03-09T00:03:14Z

What changes were proposed in this pull request?

Add HasSupport and HasConfidence Params.
Add new module pyspark.ml.fpm.
Add FPGrowth / FPGrowthModel wrappers.
Provide tests for new features.

How was this patch tested?

Unit tests.

SparkQA · 2017-03-09T00:22:29Z

Test build #74230 has finished for PR 17218 at commit 3b10a30.

This patch passes all tests.
This patch merges cleanly.
This patch adds the following public classes (experimental):
class FPGrowthModel(JavaModel, JavaMLWritable, JavaMLReadable):
class FPGrowth(JavaEstimator, HasFeaturesCol, HasPredictionCol,
class HasSupport(Params):
class HasConfidence(Params):

SparkQA · 2017-03-10T23:16:00Z

Test build #74354 has finished for PR 17218 at commit e599f31.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

SparkQA · 2017-03-11T00:11:38Z

Test build #74356 has finished for PR 17218 at commit 4fe6257.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

jkbradley · 2017-03-13T16:17:37Z

Thanks for the PR! I'll wait until this isn't "WIP" to review it thoroughly, but I'll make two comments now:

The params should not be added to shared.py since they are not shared by any other algorithm. They can be added later if needed, but I expect them not to be since the documentation for these in particular should be specialized for FPM algorithms.
For future reference: Never add stuff directly to shared.py; it should go in the generating file in the same folder.

zero323 · 2017-03-13T16:36:58Z

@jkbradley Thanks for the comment. I thought about PrefixSpan in the future so I wanted to avoid embedding this in FPGrowth. I'll put it in the fpm.

jkbradley · 2017-03-13T22:51:01Z

True, if minSupport can be shared, then that's OK. confidence won't be shared though.

zero323 · 2017-03-13T23:13:51Z

@jkbradley As far as I remember some variants of PrefixSpan use confidence but I doubt we'll encounter this problem any time soon :)

Somewhat related - could you take a look at SPARK-19899?

SparkQA · 2017-03-14T16:02:22Z

Test build #74534 has finished for PR 17218 at commit c9ab242.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

SparkQA · 2017-03-16T00:33:44Z

Test build #74630 has finished for PR 17218 at commit 9074312.

This patch fails PySpark unit tests.
This patch merges cleanly.
This patch adds the following public classes (experimental):
class HasSupport(Params):
class HasConfidence(Params):

SparkQA · 2017-03-16T00:51:15Z

Test build #74632 has finished for PR 17218 at commit a2afb74.

This patch passes all tests.
This patch merges cleanly.
This patch adds the following public classes (experimental):
class HasSupport(Params):
class HasConfidence(Params):

zero323 · 2017-03-19T15:43:54Z

Note: should be retested after #17321 is resolved.

SparkQA · 2017-03-19T15:48:39Z

Test build #74827 has finished for PR 17218 at commit 0a3798d.

This patch fails Python style tests.
This patch merges cleanly.
This patch adds the following public classes (experimental):
class HasItemsCol(Params):
class FPGrowthModel(JavaModel, JavaMLWritable, JavaMLReadable):
class FPGrowth(JavaEstimator, HasItemsCol, HasPredictionCol,

zero323 · 2017-03-20T18:30:39Z

Jenkins retest this please.

SparkQA · 2017-03-20T18:33:47Z

Test build #74895 has finished for PR 17218 at commit 0a3798d.

This patch fails Python style tests.
This patch merges cleanly.
This patch adds the following public classes (experimental):
class HasItemsCol(Params):
class FPGrowthModel(JavaModel, JavaMLWritable, JavaMLReadable):
class FPGrowth(JavaEstimator, HasItemsCol, HasPredictionCol,

SparkQA · 2017-03-20T19:11:27Z

Test build #74898 has finished for PR 17218 at commit 5f4673e.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

zero323 · 2017-03-21T14:22:06Z

@jkbradley I think this is ready for review.

jkbradley · 2017-03-23T18:16:34Z

Sure, I can take a look. Let me ping @MLnick too since he marked himself as shepherd

jkbradley

I'm only partly done reviewing, but I'll go ahead and send some comments. Thanks for the PR!

jkbradley · 2017-03-23T19:04:32Z

python/pyspark/ml/fpm.py

+        """
+        Sets the value of :py:attr:`minSupport`.
+        """
+        if not 0 <= value <= 1:


This check happens on the Scala side; let's not replicate it here.

To be honest I don't like this approach so I'll try to make the case for keeping this "as-is".

If we depend on Scala checks we fail late by delaying this to the point where transform is called. If this happens in the middle of a complex pipeline then it is simply expensive so my opinion is that if we can fail early without significant overhead then we should.

jkbradley · 2017-03-23T19:05:17Z

python/pyspark/ml/fpm.py

+
+class HasConfidence(Params):
+    """
+    Mixin for param confidence: [0.0, 1.0].


omit range here too

jkbradley · 2017-03-23T19:05:26Z

python/pyspark/ml/fpm.py

+        """
+        Sets the value of :py:attr:`minConfidence`.
+        """
+        if not 0 <= value <= 1:


jkbradley · 2017-03-23T19:05:59Z

python/pyspark/ml/fpm.py

+    minConfidence = Param(
+        Params._dummy(),
+        "minConfidence",
+        "Minimal confidence for generating Association Rule. [0.0, 1.0]",


Match Scala doc: "Note that minConfidence has no effect during fitting."

jkbradley · 2017-03-23T19:06:49Z

python/pyspark/ml/fpm.py

+    """
+
+    itemsCol = Param(Params._dummy(), "itemsCol",
+                     "items column name.", typeConverter=TypeConverters.toString)


remove period "." from end of doc string here

jkbradley · 2017-03-23T19:07:33Z

python/pyspark/ml/fpm.py

+    itemsCol = Param(Params._dummy(), "itemsCol",
+                     "items column name.", typeConverter=TypeConverters.toString)
+
+    def __init__(self):


No need for this. The default will be set in FPGrowth

jkbradley · 2017-03-23T19:09:49Z

python/pyspark/ml/fpm.py

+        return self.getOrDefault(self.itemsCol)
+
+
+class FPGrowthModel(JavaModel, JavaMLWritable, JavaMLReadable):


Mark Experimental

Also, it'd be good to be able to set minConfidence, itemsCol and predictionCol (for associationRules and transform)

I pushed my first attempt but I think will require a bit more discussion. If enable this here should we do the same for the rest of Python models?

jkbradley · 2017-03-23T19:09:53Z

python/pyspark/ml/fpm.py

+        return self._call_java("associationRules")
+
+
+class FPGrowth(JavaEstimator, HasItemsCol, HasPredictionCol,


Mark Experimental

jkbradley · 2017-03-23T19:21:47Z

python/pyspark/ml/fpm.py

+    @property
+    @since("2.2.0")
+    def freqItemsets(self):
+        """DataFrame with two columns:


Python style: put triple-quotes on a line by themselves (here and elsewhere below)

Done.

Side note: Should we add it to https://spark.apache.org/contributing.html (PEP8 recommends only the closing quote to be placed in a separate line).

jkbradley · 2017-03-23T20:12:41Z

Issue this PR brought up:

Background: AssociationRules currently return a 1-element array for the consequent (predicted item). This makes sense b/c, even though multiple consequents could be predicted for a given itemset, they belong in different rules because they have different confidences.
Question: Should we change the schema for "consequent" to be a single item, rather than an array of a single item?
CCing people who have worked on this: @zero323 @MLnick @hhbyyh @felixcheung

SparkQA · 2017-03-25T16:33:30Z

Test build #75219 has finished for PR 17218 at commit 21a3606.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

jkbradley

Thanks for the updates!

Also, it'd be good to be able to set minConfidence, itemsCol and predictionCol (for associationRules and transform)
I pushed my first attempt but I think will require a bit more discussion. If enable this here should we do the same for the rest of Python models?

True, we should do it for all models. And you're right that it's more involved than I was thinking. Specifically, rather than calling setParams from _create_model, I'd want us to call _copyValues from fit() in order to eliminate duplicate code. Would you mind removing the Params from the model, and we can work on adding them in more carefully for the next release? Thanks a lot!

I dug up the existing JIRA for this issue: https://issues.apache.org/jira/browse/SPARK-10931

Side note: Should we add it to https://spark.apache.org/contributing.html (PEP8 recommends only the closing quote to be placed in a separate line).

I would say yes...except I see it is inconsistent elsewhere in Spark. I guess I won't push for it anymore.

jkbradley · 2017-03-26T00:10:42Z

dev/sparktestsupport/modules.py

        "pyspark.ml.classification",
        "pyspark.ml.clustering",
+        "pyspark.ml.evaluation",
+        "pyspark.ml.feature",
+        "pyspark.ml.fpm",
        "pyspark.ml.linalg.__init__",
        "pyspark.ml.recommendation",
        "pyspark.ml.regression",
        "pyspark.ml.tuning",
        "pyspark.ml.tests",


As long as you're at it, switch tuning & tests to alphabetize them

Sure thing, I thought there is some logic in putting tests last. Should I reorder the other modules as well?

Interesting...maybe? I guess it doesn't really matter, so no need to rearrange more.

jkbradley · 2017-03-26T00:23:21Z

python/pyspark/ml/fpm.py

+        """
+        Sets the value of :py:attr:`minSupport`.
+        """
+        if not (0 <= value <= 1):


On this topic, I agree with you that not checking here could currently cause late failures in a Pipeline. However, I think the right fix for this is to add PipelineStage and transformSchema() to Python. I just made a JIRA for it: https://issues.apache.org/jira/browse/SPARK-20099

jkbradley · 2017-03-26T00:24:13Z

python/pyspark/ml/fpm.py

+    minConfidence = Param(
+        Params._dummy(),
+        "minConfidence",
+        """"Minimal confidence for generating Association Rule. [0.0, 1.0]


Extra quotes here. Does this come out formatted correctly?

jkbradley · 2017-03-26T00:25:40Z

python/pyspark/ml/fpm.py

+                    HasConfidence, HasItemsCol, HasPredictionCol):
+    """Model fitted by FPGrowth.
+
+    .. note:: Experimental


Put first in doc string (See examples elsewhere)

jkbradley · 2017-03-26T00:28:10Z

python/pyspark/ml/fpm.py

+        """
+        Data with three columns:
+        * `antecedent`  - Array of the same type as the input column.
+        * `consequent`  - Single element array of the same type as the input column.


I just realized: If we're leaving open the possibility of returning multiple elements here in the future, then let's not document that this has a single element (else it effectively becomes a guarantee in the API).

jkbradley · 2017-03-26T00:32:09Z

python/pyspark/ml/fpm.py

+    .. [LI2008] http://dx.doi.org/10.1145/1454008.1454027
+    .. [HAN2000] http://dx.doi.org/10.1145/335191.335372
+
+    .. note:: Experimental


I didn't see this before, so now this is noted twice. Just put it once at the beginning of the docstring.

SparkQA · 2017-03-26T06:33:31Z

Test build #75230 has finished for PR 17218 at commit deb2ce7.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

jkbradley · 2017-03-26T07:47:33Z

python/pyspark/ml/fpm.py

        typeConverter=TypeConverters.toFloat)

    def setMinSupport(self, value):
        """
        Sets the value of :py:attr:`minSupport`.
        """
-        if not (0 <= value <= 1):
-            raise ValueError("Support must be in range [0, 1]")
-        return self._set(minSupport=value)


This removed too much! This line should remain

SparkQA · 2017-03-26T14:11:25Z

Test build #75237 has finished for PR 17218 at commit 66b85e5.

This patch fails Spark unit tests.
This patch merges cleanly.
This patch adds no public classes.

zero323 · 2017-03-26T16:17:07Z

Jenkins retest this please.

zero323 · 2017-03-26T16:28:56Z

True, we should do it for all models. And you're right that it's more involved than I was thinking. Specifically, rather than calling setParams from _create_model, I'd want us to call _copyValues from fit() in order to eliminate duplicate code. Would you mind removing the Params from the model, and we can work on adding them in more carefully for the next release? Thanks a lot!

I dug up the existing JIRA for this issue: https://issues.apache.org/jira/browse/SPARK-10931

I removed the code and I'll be following SPARK-10931. One possible challenge (here and for parameters validation) is high latency of Py4j calls. With large pipelines it can build up pretty fast.

SparkQA · 2017-03-26T19:00:08Z

Test build #75241 has finished for PR 17218 at commit 66b85e5.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

jkbradley · 2017-03-26T23:47:50Z

LGTM
Merging with master
Thanks a lot!

jkbradley · 2017-03-26T23:50:10Z

@indyragandy What do you mean by "get directly"?

## What changes were proposed in this pull request? Follow-up for #17218, some minor fix for PySpark ```FPGrowth```. ## How was this patch tested? Existing UT. Author: Yanbo Liang <ybliang8@gmail.com> Closes #18089 from yanboliang/spark-19281. (cherry picked from commit 913a6bf) Signed-off-by: Yanbo Liang <ybliang8@gmail.com>

## What changes were proposed in this pull request? Follow-up for apache#17218, some minor fix for PySpark ```FPGrowth```. ## How was this patch tested? Existing UT. Author: Yanbo Liang <ybliang8@gmail.com> Closes apache#18089 from yanboliang/spark-19281.

zero323 force-pushed the SPARK-19281 branch from aaabd06 to 3b10a30 Compare March 9, 2017 00:03

zero323 force-pushed the SPARK-19281 branch from c75d830 to e599f31 Compare March 10, 2017 22:58

zero323 force-pushed the SPARK-19281 branch from e599f31 to 4fe6257 Compare March 10, 2017 23:53

zero323 force-pushed the SPARK-19281 branch 2 times, most recently from bebb363 to 9074312 Compare March 16, 2017 00:21

zero323 force-pushed the SPARK-19281 branch from 9074312 to a2afb74 Compare March 16, 2017 00:36

zero323 changed the title ~~[SPARK-19281][WIP][PYTHON][ML] spark.ml Python API for FPGrowth~~ [SPARK-19281][PYTHON][ML] spark.ml Python API for FPGrowth Mar 17, 2017

zero323 mentioned this pull request Mar 18, 2017

[SPARK-19825][R][ML] spark.ml R API for FPGrowth #17170

Closed

zero323 force-pushed the SPARK-19281 branch 2 times, most recently from 9bde018 to 0a3798d Compare March 19, 2017 15:42

zero323 force-pushed the SPARK-19281 branch from 0a3798d to 5f4673e Compare March 20, 2017 18:53

jkbradley reviewed Mar 23, 2017

View reviewed changes

zero323 added 6 commits March 25, 2017 14:49

Add experimental notes

43c9dcc

Place docstring quotes in separate lines

aa45479

Add since import

d4ae39a

Remove set spark.sql.shuffle.partition from FPGrowthTests

6740581

Add super().setUp() in FPGrowthTests

33c8971

Sort imports in ml.tests

eb4ec26

zero323 force-pushed the SPARK-19281 branch from f90b71e to 21a3606 Compare March 25, 2017 13:49

jkbradley reviewed Mar 26, 2017

View reviewed changes

zero323 added 4 commits March 26, 2017 05:03

Copy Scala docs and add doc entry

3c7f4f7

Add ml.fpm to sparktestsupport/modules

3521d40

Drop range tests and clean docstrings

bdea0ff

Move experimental annotation to the top of the docstrings

deb2ce7

zero323 force-pushed the SPARK-19281 branch from 21a3606 to deb2ce7 Compare March 26, 2017 03:51

jkbradley reviewed Mar 26, 2017

View reviewed changes

zero323 added 2 commits March 26, 2017 14:01

Fix getMinSupport

bf0a285

Remove 'single item' note from consequent description

66b85e5

asfgit closed this in 0bc8847 Mar 26, 2017

zero323 deleted the SPARK-19281 branch April 6, 2017 10:55

zero323 mentioned this pull request Apr 9, 2017

[SPARK-16931][PYTHON][SQL] Add Python wrapper for bucketBy #17077

Closed

yanboliang mentioned this pull request May 24, 2017

[SPARK-19281][FOLLOWUP][ML] Minor fix for PySpark FPGrowth. #18089

Closed

		return self.getOrDefault(self.itemsCol)


		class FPGrowthModel(JavaModel, JavaMLWritable, JavaMLReadable):

		return self._call_java("associationRules")


		class FPGrowth(JavaEstimator, HasItemsCol, HasPredictionCol,

[SPARK-19281][PYTHON][ML] spark.ml Python API for FPGrowth #17218

[SPARK-19281][PYTHON][ML] spark.ml Python API for FPGrowth #17218

Uh oh!

Conversation

zero323 commented Mar 9, 2017 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

What changes were proposed in this pull request?

How was this patch tested?

Uh oh!

SparkQA commented Mar 9, 2017

Uh oh!

SparkQA commented Mar 10, 2017

Uh oh!

SparkQA commented Mar 11, 2017

Uh oh!

jkbradley commented Mar 13, 2017

Uh oh!

zero323 commented Mar 13, 2017

Uh oh!

jkbradley commented Mar 13, 2017

Uh oh!

zero323 commented Mar 13, 2017

Uh oh!

SparkQA commented Mar 14, 2017

Uh oh!

SparkQA commented Mar 16, 2017

Uh oh!

SparkQA commented Mar 16, 2017

Uh oh!

zero323 commented Mar 19, 2017 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

SparkQA commented Mar 19, 2017

Uh oh!

zero323 commented Mar 20, 2017

Uh oh!

SparkQA commented Mar 20, 2017

Uh oh!

SparkQA commented Mar 20, 2017

Uh oh!

zero323 commented Mar 21, 2017

Uh oh!

jkbradley commented Mar 23, 2017

Uh oh!

jkbradley left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

jkbradley commented Mar 23, 2017

Uh oh!

SparkQA commented Mar 25, 2017

Uh oh!

jkbradley left a comment

Choose a reason for hiding this comment

Uh oh!

zero323 commented Mar 9, 2017 •

edited

Loading

zero323 commented Mar 19, 2017 •

edited

Loading