-
Notifications
You must be signed in to change notification settings - Fork 28.6k
[SPARK-19825][R][ML] spark.ml R API for FPGrowth #17170
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
Test build #73956 has finished for PR 17170 at commit
|
Test build #73957 has finished for PR 17170 at commit
|
Test build #73959 has finished for PR 17170 at commit
|
Test build #73961 has finished for PR 17170 at commit
|
Test build #73962 has finished for PR 17170 at commit
|
Test build #73963 has finished for PR 17170 at commit
|
Test build #73966 has finished for PR 17170 at commit
|
Test build #74021 has finished for PR 17170 at commit
|
Test build #74022 has finished for PR 17170 at commit
|
Test build #74030 has finished for PR 17170 at commit
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
thanks, I think a couple of big changes are in order. Let me know when it is ready for review again
R/pkg/DESCRIPTION
Outdated
@@ -54,5 +55,5 @@ Collate: | |||
'types.R' | |||
'utils.R' | |||
'window.R' | |||
RoxygenNote: 5.0.1 | |||
RoxygenNote: 6.0.1 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
let's revert this - new roxygen2 seems to have some new features we are not ready for yet
R/pkg/R/generics.R
Outdated
|
||
#' @rdname spark.fpGrowth | ||
#' @export | ||
setGeneric("freqItemsets", function(object) { standardGeneric("freqItemsets") }) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
we seems to follow the pattern spark.something
- see LDA. do you think it makes sense here too?
R/pkg/R/mllib_fpm.R
Outdated
#' } | ||
#' @note spark.fpGrowth since 2.2.0 | ||
setMethod("spark.fpGrowth", signature(data = "SparkDataFrame"), | ||
function(data, minSupport = 0.3, minConfidence = 0.8, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
should it have numPartitions
?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Done.
R/pkg/R/mllib_fpm.R
Outdated
#' @note spark.fpGrowth since 2.2.0 | ||
setMethod("spark.fpGrowth", signature(data = "SparkDataFrame"), | ||
function(data, minSupport = 0.3, minConfidence = 0.8, | ||
featuresCol = "features", predictionCol = "prediction") { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
instead of features
it should take a formula?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
we generally avoid allow setting predictionCol
too
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
about here- thought?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
To be honest I am not sure. If you think that setting predictionCol
should be disabled I am fine with that but I don't see how formulas could be useful here. FPGrowth
doesn't really conform to the conventions used in other ML algorithms. It doesn't use vectors and fixed size buckets are unlikely to happen.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I believe predictionCol param only allow you to change the name of the column - prediction is always still going to be there, no?
R/pkg/R/mllib_fpm.R
Outdated
} | ||
|
||
jobj <- callJStatic("org.apache.spark.ml.r.FPGrowthWrapper", "fit", | ||
data@sdf, minSupport, minConfidence, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
you may want to as.numeric
on minSupport
, minConfidence
in case someone is passing in an integer and callJStatic
would fail to match the wrapper method
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Done.
R/pkg/R/mllib_fpm.R
Outdated
#' @note FPGrowthModel since 2.2.0 | ||
setClass("FPGrowthModel", slots = list(jobj = "jobj")) | ||
|
||
#' FPGrowth Model |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
could you use the long form name (eg. look at LDA) and drop the word "Model" which we avoid using
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Do you mean spark.FPGrowth
? I can but as far as I can tell all classes use Model
suffix (GeneralizedLinearRegressionModel
, GaussianMixtureModel
LDAModel
and so on) and none is using spark
prefix.
Or do you mean representation
instead of slots
? I believe that representation is no longer recommended.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I mean this
https://github.com/apache/spark/blob/master/R/pkg/R/mllib_clustering.R#L467
https://github.com/apache/spark/blob/master/R/pkg/R/mllib_clustering.R#L316
which may or may not include the word model
R/pkg/R/mllib_fpm.R
Outdated
#' @examples | ||
#' \dontrun{ | ||
#' itemsets <- data.frame(features = c("a,b", "a,b,c", "c,d")) | ||
#' data <- selectExpr(createDataFrame(itemsets), "split(features, ',') as features") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
instead of duplicating createDataFrame
, set itemsets <- createDataFrame(data.frame(features = c("a,b", "a,b,c", "c,d")))
btw, do we have real data to use instead?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, we do. Adjusted.
R/pkg/R/mllib_fpm.R
Outdated
#' | ||
#' # Show frequent itemsets | ||
#' frequent_itemsets <- freqItemsets(model) | ||
#' showDF(frequent_itemsets) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
collapse this to head(freqItemsets(model))
freq = c(2, 2, 3, 3, 4) | ||
) | ||
|
||
expect_equivalent(expected_itemsets, collect(freqItemsets(model))) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
don't repeat freqItemsets(model)
- use itemsets
from above
import org.apache.hadoop.fs.Path | ||
import org.json4s._ | ||
import org.json4s.JsonDSL._ | ||
import org.json4s.jackson.JsonMethods._ |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
do we need these?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We can skip import org.json4s._
if won't do any parsing, but import org.json4s.jackson.JsonMethods._provide both
renderand
compact` which are used to create JSON metadata.
Test build #74111 has finished for PR 17170 at commit
|
Test build #74113 has finished for PR 17170 at commit
|
Test build #74114 has finished for PR 17170 at commit
|
Test build #74115 has finished for PR 17170 at commit
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Test build #74119 has finished for PR 17170 at commit
|
Test build #74123 has finished for PR 17170 at commit
|
Test build #74132 has finished for PR 17170 at commit
|
Test build #74145 has finished for PR 17170 at commit
|
@felixcheung I think i addressed all the issues excluding
|
Sure - it's a single columns called |
I think that ALS sets a precedence for using |
To clarify with your example, with ALS we have userCol, ratingCol - these matches the API names in spark.ml, and I think we need to do the same here. What don't you like about |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
generally good, a few last comments
R/pkg/R/mllib_fpm.R
Outdated
#' @note FPGrowthModel since 2.2.0 | ||
setClass("FPGrowthModel", slots = list(jobj = "jobj")) | ||
|
||
#' FPGrowth |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think we discussed this - let's make it FP-Growth
or Frequent Pattern Mining
(https://spark.apache.org/docs/latest/mllib-frequent-pattern-mining.html) as the title
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
was #17170 (comment)
R/pkg/R/mllib_fpm.R
Outdated
#' | ||
#' A parallel FP-growth algorithm to mine frequent itemsets. The algorithm is described in | ||
#' Li et al., PFP: Parallel FP-Growth for Query | ||
#' Recommendation <\url{http://dx.doi.org/10.1145/1454008.1454027}>. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
can you check if this generate the doc properly
<\url{http://dx.doi.org/10.1145/1454008.1454027}>
generally it should be
\href{http://...}{Text}
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It does render the link as expected, but linking ML docs is indeed a better choice.
R/pkg/R/mllib_fpm.R
Outdated
#' PFP distributes computation in such a way that each worker executes an | ||
#' independent group of mining tasks. The FP-Growth algorithm is described in | ||
#' Han et al., Mining frequent patterns without | ||
#' candidate generation <\url{http://dx.doi.org/10.1145/335191.335372}>. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
ditto here for url.
In fact, I'm not sure we need to include all the links here but instead link to
https://spark.apache.org/docs/latest/mllib-frequent-pattern-mining.html
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sounds good. I'll link the docs.
R/pkg/R/mllib_fpm.R
Outdated
#' another_model <- spark.fpGrowth(data, minSupport = 0.1, minConfidence = 0.5 | ||
#' itemsCol = "baskets", numPartitions = 10) | ||
#' } | ||
#' @references \url{http://en.wikipedia.org/wiki/Association_rule_learning} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
we don't generally use this tag. Do you want to move to @Seealso, or just link to in the description above
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'll remove it completely and just link to the docs.
R/pkg/R/mllib_fpm.R
Outdated
#' @note spark.fpGrowth since 2.2.0 | ||
setMethod("spark.fpGrowth", signature(data = "SparkDataFrame"), | ||
function(data, minSupport = 0.3, minConfidence = 0.8, | ||
itemsCol = "items", numPartitions = -1) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
numPartitions
by default is not set in Scala - let's default this to NULL instead here
but do not as.integer if value is NULL - something like
numPartitions <- if (is.null(numPartitions)) NULL else as.integer(numPartitions)
before passing to JVM side
.setMinConfidence(minConfidence) | ||
.setItemsCol(itemsCol) | ||
|
||
if (numPartitions != null && numPartitions > 0) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
given the earlier suggestion, we should also check numPartition > 0 in R before passing to here
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If you feel it is necessary. Personally I wanted to treat any non-strictly positive number as null
.
class FPGrowthWrapperWriter(instance: FPGrowthWrapper) extends MLWriter { | ||
override protected def saveImpl(path: String): Unit = { | ||
val modelPath = new Path(path, "model").toString | ||
val rMetadataPath = new Path(path, "rMetadata").toString |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
anything else we could add as metadata that is not in the model already?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't think so. Model captures all the parameters.
Test build #75007 has finished for PR 17170 at commit
|
Test build #75008 has finished for PR 17170 at commit
|
R/pkg/R/mllib_fpm.R
Outdated
stop("minConfidence should be a number [0, 1].") | ||
} | ||
|
||
numPartitions <- if (is.null(numPartitions)) NULL else as.integer(numPartitions) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
as this 6522916#r107011745 we should check numPartitions too?
How about changing it to
if (!is.null(numPartitions)) {
numPartitions <- as.integer(numPartitions)
stopifnot(numPartitions > 0)
}
also would you be updating the R vignettes, ML programming guide and example? |
Let's make it a separate task. For ML guide we have to wait for #17130 anyway. |
R/pkg/R/mllib_fpm.R
Outdated
|
||
# Get association rules. | ||
|
||
#' @return A DataFrame with association rules. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
let's document the list of column like in Python: https://github.com/apache/spark/pull/17218/files#diff-b6dbf16870bd2cca9b4140df8aebd681R121
for reference, see https://github.com/apache/spark/blob/master/R/pkg/R/mllib_clustering.R#L249
"1,3" | ||
))), "split(items, ',') as items") | ||
|
||
model <- spark.fpGrowth(data, minSupport = 0.3, minConfidence = 0.8, numPartitions = 1) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
we need to add a test when numPartitions is not set...
.setMinConfidence(minConfidence) | ||
.setItemsCol(itemsCol) | ||
|
||
if (numPartitions != null && numPartitions > 0) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
and this comment #17170 (comment)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
and #17170 (comment)
Test build #75275 has finished for PR 17170 at commit
|
Test build #75276 has finished for PR 17170 at commit
|
@felixcheung Looks like some issue with the structured streaming: https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/75276/ |
Jenkins, retest this please |
@zero323 how about #17170 (comment) |
Test build #75293 has finished for PR 17170 at commit
|
Test build #75320 has finished for PR 17170 at commit
|
R/pkg/R/mllib_fpm.R
Outdated
@@ -99,7 +99,10 @@ setMethod("spark.fpGrowth", signature(data = "SparkDataFrame"), | |||
# Get frequent itemsets. | |||
|
|||
#' @param object a fitted FPGrowth model. | |||
#' @return A DataFrame with frequent itemsets. | |||
#' @return A \code{DataFrame} with frequent itemsets. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actually, sorry we need to change DataFrame
to SparkDataFrame
in R
Test build #75366 has finished for PR 17170 at commit
|
merged to master. |
Of course. Do we have / need a JIRA ticket for that? |
for this, it's optional, but I opened one for tracking purpose https://issues.apache.org/jira/browse/SPARK-20208 |
What changes were proposed in this pull request?
Adds SparkR API for FPGrowth: SPARK-19825:
spark.fpGrowth
-model training.freqItemsets
andassociationRules
methods with new corresponding generics.org.apache.spark.ml.r. FPGrowthWrapper
How was this patch tested?
Feature specific unit tests.