-
Notifications
You must be signed in to change notification settings - Fork 28.6k
[SPARK-19281][PYTHON][ML] spark.ml Python API for FPGrowth #17218
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
Test build #74230 has finished for PR 17218 at commit
|
Test build #74354 has finished for PR 17218 at commit
|
Test build #74356 has finished for PR 17218 at commit
|
Thanks for the PR! I'll wait until this isn't "WIP" to review it thoroughly, but I'll make two comments now:
|
@jkbradley Thanks for the comment. I thought about |
True, if minSupport can be shared, then that's OK. confidence won't be shared though. |
@jkbradley As far as I remember some variants of Somewhat related - could you take a look at SPARK-19899? |
Test build #74534 has finished for PR 17218 at commit
|
bebb363
to
9074312
Compare
Test build #74630 has finished for PR 17218 at commit
|
Test build #74632 has finished for PR 17218 at commit
|
9bde018
to
0a3798d
Compare
Note: should be retested after #17321 is resolved. |
Test build #74827 has finished for PR 17218 at commit
|
Jenkins retest this please. |
Test build #74895 has finished for PR 17218 at commit
|
Test build #74898 has finished for PR 17218 at commit
|
@jkbradley I think this is ready for review. |
Sure, I can take a look. Let me ping @MLnick too since he marked himself as shepherd |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm only partly done reviewing, but I'll go ahead and send some comments. Thanks for the PR!
python/pyspark/ml/fpm.py
Outdated
""" | ||
Sets the value of :py:attr:`minSupport`. | ||
""" | ||
if not 0 <= value <= 1: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This check happens on the Scala side; let's not replicate it here.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
To be honest I don't like this approach so I'll try to make the case for keeping this "as-is".
If we depend on Scala checks we fail late by delaying this to the point where transform
is called. If this happens in the middle of a complex pipeline then it is simply expensive so my opinion is that if we can fail early without significant overhead then we should.
python/pyspark/ml/fpm.py
Outdated
|
||
class HasConfidence(Params): | ||
""" | ||
Mixin for param confidence: [0.0, 1.0]. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
omit range here too
python/pyspark/ml/fpm.py
Outdated
""" | ||
Sets the value of :py:attr:`minConfidence`. | ||
""" | ||
if not 0 <= value <= 1: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
ditto
python/pyspark/ml/fpm.py
Outdated
minConfidence = Param( | ||
Params._dummy(), | ||
"minConfidence", | ||
"Minimal confidence for generating Association Rule. [0.0, 1.0]", |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Match Scala doc: "Note that minConfidence has no effect during fitting."
python/pyspark/ml/fpm.py
Outdated
""" | ||
|
||
itemsCol = Param(Params._dummy(), "itemsCol", | ||
"items column name.", typeConverter=TypeConverters.toString) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
remove period "." from end of doc string here
python/pyspark/ml/fpm.py
Outdated
itemsCol = Param(Params._dummy(), "itemsCol", | ||
"items column name.", typeConverter=TypeConverters.toString) | ||
|
||
def __init__(self): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
No need for this. The default will be set in FPGrowth
return self.getOrDefault(self.itemsCol) | ||
|
||
|
||
class FPGrowthModel(JavaModel, JavaMLWritable, JavaMLReadable): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Mark Experimental
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Also, it'd be good to be able to set minConfidence, itemsCol and predictionCol (for associationRules and transform)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I pushed my first attempt but I think will require a bit more discussion. If enable this here should we do the same for the rest of Python models?
python/pyspark/ml/fpm.py
Outdated
return self._call_java("associationRules") | ||
|
||
|
||
class FPGrowth(JavaEstimator, HasItemsCol, HasPredictionCol, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Mark Experimental
python/pyspark/ml/fpm.py
Outdated
@property | ||
@since("2.2.0") | ||
def freqItemsets(self): | ||
"""DataFrame with two columns: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Python style: put triple-quotes on a line by themselves (here and elsewhere below)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Done.
Side note: Should we add it to https://spark.apache.org/contributing.html (PEP8 recommends only the closing quote to be placed in a separate line).
Issue this PR brought up:
|
Test build #75219 has finished for PR 17218 at commit
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for the updates!
Also, it'd be good to be able to set minConfidence, itemsCol and predictionCol (for associationRules and transform)
I pushed my first attempt but I think will require a bit more discussion. If enable this here should we do the same for the rest of Python models?
True, we should do it for all models. And you're right that it's more involved than I was thinking. Specifically, rather than calling setParams from _create_model, I'd want us to call _copyValues from fit() in order to eliminate duplicate code. Would you mind removing the Params from the model, and we can work on adding them in more carefully for the next release? Thanks a lot!
I dug up the existing JIRA for this issue: https://issues.apache.org/jira/browse/SPARK-10931
Side note: Should we add it to https://spark.apache.org/contributing.html (PEP8 recommends only the closing quote to be placed in a separate line).
I would say yes...except I see it is inconsistent elsewhere in Spark. I guess I won't push for it anymore.
"pyspark.ml.classification", | ||
"pyspark.ml.clustering", | ||
"pyspark.ml.evaluation", | ||
"pyspark.ml.feature", | ||
"pyspark.ml.fpm", | ||
"pyspark.ml.linalg.__init__", | ||
"pyspark.ml.recommendation", | ||
"pyspark.ml.regression", | ||
"pyspark.ml.tuning", | ||
"pyspark.ml.tests", |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
As long as you're at it, switch tuning & tests to alphabetize them
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sure thing, I thought there is some logic in putting tests last. Should I reorder the other modules as well?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Interesting...maybe? I guess it doesn't really matter, so no need to rearrange more.
python/pyspark/ml/fpm.py
Outdated
""" | ||
Sets the value of :py:attr:`minSupport`. | ||
""" | ||
if not (0 <= value <= 1): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
On this topic, I agree with you that not checking here could currently cause late failures in a Pipeline. However, I think the right fix for this is to add PipelineStage and transformSchema() to Python. I just made a JIRA for it: https://issues.apache.org/jira/browse/SPARK-20099
python/pyspark/ml/fpm.py
Outdated
minConfidence = Param( | ||
Params._dummy(), | ||
"minConfidence", | ||
""""Minimal confidence for generating Association Rule. [0.0, 1.0] |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Extra quotes here. Does this come out formatted correctly?
python/pyspark/ml/fpm.py
Outdated
HasConfidence, HasItemsCol, HasPredictionCol): | ||
"""Model fitted by FPGrowth. | ||
|
||
.. note:: Experimental |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Put first in doc string (See examples elsewhere)
python/pyspark/ml/fpm.py
Outdated
""" | ||
Data with three columns: | ||
* `antecedent` - Array of the same type as the input column. | ||
* `consequent` - Single element array of the same type as the input column. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I just realized: If we're leaving open the possibility of returning multiple elements here in the future, then let's not document that this has a single element (else it effectively becomes a guarantee in the API).
python/pyspark/ml/fpm.py
Outdated
.. [LI2008] http://dx.doi.org/10.1145/1454008.1454027 | ||
.. [HAN2000] http://dx.doi.org/10.1145/335191.335372 | ||
|
||
.. note:: Experimental |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I didn't see this before, so now this is noted twice. Just put it once at the beginning of the docstring.
Test build #75230 has finished for PR 17218 at commit
|
python/pyspark/ml/fpm.py
Outdated
typeConverter=TypeConverters.toFloat) | ||
|
||
def setMinSupport(self, value): | ||
""" | ||
Sets the value of :py:attr:`minSupport`. | ||
""" | ||
if not (0 <= value <= 1): | ||
raise ValueError("Support must be in range [0, 1]") | ||
return self._set(minSupport=value) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This removed too much! This line should remain
Test build #75237 has finished for PR 17218 at commit
|
Jenkins retest this please. |
I removed the code and I'll be following SPARK-10931. One possible challenge (here and for parameters validation) is high latency of Py4j calls. With large pipelines it can build up pretty fast. |
Test build #75241 has finished for PR 17218 at commit
|
LGTM |
@indyragandy What do you mean by "get directly"? |
## What changes were proposed in this pull request? Follow-up for #17218, some minor fix for PySpark ```FPGrowth```. ## How was this patch tested? Existing UT. Author: Yanbo Liang <ybliang8@gmail.com> Closes #18089 from yanboliang/spark-19281. (cherry picked from commit 913a6bf) Signed-off-by: Yanbo Liang <ybliang8@gmail.com>
## What changes were proposed in this pull request? Follow-up for apache#17218, some minor fix for PySpark ```FPGrowth```. ## How was this patch tested? Existing UT. Author: Yanbo Liang <ybliang8@gmail.com> Closes apache#18089 from yanboliang/spark-19281.
What changes were proposed in this pull request?
HasSupport
andHasConfidence
Params
.pyspark.ml.fpm
.FPGrowth
/FPGrowthModel
wrappers.How was this patch tested?
Unit tests.