[SPARK-19825][R][ML] spark.ml R API for FPGrowth #17170

zero323 · 2017-03-06T02:45:34Z

What changes were proposed in this pull request?

Adds SparkR API for FPGrowth: SPARK-19825:

spark.fpGrowth -model training.
freqItemsets and associationRules methods with new corresponding generics.
Scala helper: org.apache.spark.ml.r. FPGrowthWrapper
unit tests.

How was this patch tested?

Feature specific unit tests.

SparkQA · 2017-03-06T02:53:36Z

Test build #73956 has finished for PR 17170 at commit 641fe70.

This patch fails R style tests.
This patch merges cleanly.
This patch adds the following public classes (experimental):
class FPGrowthWrapperReader extends MLReader[FPGrowthWrapper]
class FPGrowthWrapperWriter(instance: FPGrowthWrapper) extends MLWriter

SparkQA · 2017-03-06T03:12:13Z

Test build #73957 has finished for PR 17170 at commit b53963a.

This patch fails R style tests.
This patch merges cleanly.
This patch adds no public classes.

SparkQA · 2017-03-06T04:15:18Z

Test build #73959 has finished for PR 17170 at commit 021cd9b.

This patch fails SparkR unit tests.
This patch merges cleanly.
This patch adds no public classes.

SparkQA · 2017-03-06T04:19:08Z

Test build #73961 has finished for PR 17170 at commit b198dfa.

This patch fails Scala style tests.
This patch merges cleanly.
This patch adds no public classes.

SparkQA · 2017-03-06T04:24:05Z

Test build #73962 has finished for PR 17170 at commit 03789d6.

This patch fails Scala style tests.
This patch merges cleanly.
This patch adds no public classes.

SparkQA · 2017-03-06T04:31:51Z

Test build #73963 has finished for PR 17170 at commit d50f917.

This patch fails R style tests.
This patch merges cleanly.
This patch adds no public classes.

SparkQA · 2017-03-06T05:44:44Z

Test build #73966 has finished for PR 17170 at commit 6554384.

This patch fails SparkR unit tests.
This patch merges cleanly.
This patch adds the following public classes (experimental):
class FPGrowthWrapperReader extends MLReader[FPGrowthWrapper]
class FPGrowthWrapperWriter(instance: FPGrowthWrapper) extends MLWriter

SparkQA · 2017-03-06T17:01:41Z

Test build #74021 has finished for PR 17170 at commit 0b34d03.

This patch fails R style tests.
This patch merges cleanly.
This patch adds the following public classes (experimental):
class FPGrowthWrapperReader extends MLReader[FPGrowthWrapper]
class FPGrowthWrapperWriter(instance: FPGrowthWrapper) extends MLWriter

SparkQA · 2017-03-06T18:02:14Z

Test build #74022 has finished for PR 17170 at commit 86f96a5.

This patch passes all tests.
This patch merges cleanly.
This patch adds the following public classes (experimental):
class FPGrowthWrapperReader extends MLReader[FPGrowthWrapper]
class FPGrowthWrapperWriter(instance: FPGrowthWrapper) extends MLWriter

SparkQA · 2017-03-06T20:24:24Z

Test build #74030 has finished for PR 17170 at commit 6c0aea9.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

felixcheung

thanks, I think a couple of big changes are in order. Let me know when it is ready for review again

felixcheung · 2017-03-07T05:59:44Z

R/pkg/DESCRIPTION

@@ -54,5 +55,5 @@ Collate:
    'types.R'
    'utils.R'
    'window.R'
-RoxygenNote: 5.0.1
+RoxygenNote: 6.0.1


let's revert this - new roxygen2 seems to have some new features we are not ready for yet

felixcheung · 2017-03-07T06:01:48Z

R/pkg/R/generics.R

+
+#' @rdname spark.fpGrowth
+#' @export
+setGeneric("freqItemsets", function(object) { standardGeneric("freqItemsets") })


we seems to follow the pattern spark.something - see LDA. do you think it makes sense here too?

felixcheung · 2017-03-07T06:03:10Z

R/pkg/R/mllib_fpm.R

+#' }
+#' @note spark.fpGrowth since 2.2.0
+setMethod("spark.fpGrowth", signature(data = "SparkDataFrame"),
+          function(data, minSupport = 0.3, minConfidence = 0.8,


should it have numPartitions?

felixcheung · 2017-03-07T06:03:39Z

R/pkg/R/mllib_fpm.R

+#' @note spark.fpGrowth since 2.2.0
+setMethod("spark.fpGrowth", signature(data = "SparkDataFrame"),
+          function(data, minSupport = 0.3, minConfidence = 0.8,
+                   featuresCol = "features", predictionCol = "prediction") {


instead of features it should take a formula?

we generally avoid allow setting predictionCol too

about here- thought?

To be honest I am not sure. If you think that setting predictionCol should be disabled I am fine with that but I don't see how formulas could be useful here. FPGrowth doesn't really conform to the conventions used in other ML algorithms. It doesn't use vectors and fixed size buckets are unlikely to happen.

I believe predictionCol param only allow you to change the name of the column - prediction is always still going to be there, no?

felixcheung · 2017-03-07T06:06:38Z

R/pkg/R/mllib_fpm.R

+            }
+
+            jobj <- callJStatic("org.apache.spark.ml.r.FPGrowthWrapper", "fit",
+                                data@sdf, minSupport, minConfidence,


you may want to as.numeric on minSupport, minConfidence in case someone is passing in an integer and callJStatic would fail to match the wrapper method

felixcheung · 2017-03-07T06:11:07Z

R/pkg/R/mllib_fpm.R

+#' @note FPGrowthModel since 2.2.0
+setClass("FPGrowthModel", slots = list(jobj = "jobj"))
+
+#' FPGrowth Model


could you use the long form name (eg. look at LDA) and drop the word "Model" which we avoid using

Do you mean spark.FPGrowth? I can but as far as I can tell all classes use Model suffix (GeneralizedLinearRegressionModel, GaussianMixtureModel LDAModel and so on) and none is using spark prefix.

Or do you mean representation instead of slots? I believe that representation is no longer recommended.

I mean this
https://github.com/apache/spark/blob/master/R/pkg/R/mllib_clustering.R#L467
https://github.com/apache/spark/blob/master/R/pkg/R/mllib_clustering.R#L316
which may or may not include the word model

felixcheung · 2017-03-07T06:12:13Z

R/pkg/R/mllib_fpm.R

+#' @examples
+#' \dontrun{
+#' itemsets <- data.frame(features = c("a,b", "a,b,c", "c,d"))
+#' data <- selectExpr(createDataFrame(itemsets), "split(features, ',') as features")


instead of duplicating createDataFrame, set itemsets <- createDataFrame(data.frame(features = c("a,b", "a,b,c", "c,d")))

btw, do we have real data to use instead?

Yes, we do. Adjusted.

felixcheung · 2017-03-07T06:12:39Z

R/pkg/R/mllib_fpm.R

+#' 
+#' # Show frequent itemsets
+#' frequent_itemsets <- freqItemsets(model)
+#' showDF(frequent_itemsets)


collapse this to head(freqItemsets(model))

felixcheung · 2017-03-07T06:15:13Z

R/pkg/inst/tests/testthat/test_mllib_fpm.R

+    freq = c(2, 2, 3, 3, 4)
+  )
+
+  expect_equivalent(expected_itemsets, collect(freqItemsets(model)))


don't repeat freqItemsets(model) - use itemsets from above

felixcheung · 2017-03-07T06:16:01Z

mllib/src/main/scala/org/apache/spark/ml/r/FPGrowthWrapper.scala

+import org.apache.hadoop.fs.Path
+import org.json4s._
+import org.json4s.JsonDSL._
+import org.json4s.jackson.JsonMethods._


do we need these?

We can skip import org.json4s._ if won't do any parsing, but import org.json4s.jackson.JsonMethods._provide bothrenderandcompact` which are used to create JSON metadata.

SparkQA · 2017-03-07T16:06:41Z

Test build #74111 has finished for PR 17170 at commit 1014902.

This patch fails R style tests.
This patch merges cleanly.
This patch adds no public classes.

SparkQA · 2017-03-07T17:50:35Z

Test build #74113 has finished for PR 17170 at commit eb39222.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

SparkQA · 2017-03-07T17:54:12Z

Test build #74114 has finished for PR 17170 at commit bf26f79.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

SparkQA · 2017-03-07T18:01:17Z

Test build #74115 has finished for PR 17170 at commit 956a36a.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

felixcheung

https://github.com/apache/spark/pull/17170/files/6c0aea9ffc66bf525f61efca14c8630dbb940d52#r104594539

SparkQA · 2017-03-07T19:21:59Z

Test build #74119 has finished for PR 17170 at commit 6be7f13.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

SparkQA · 2017-03-07T19:59:42Z

Test build #74123 has finished for PR 17170 at commit 1949da3.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

SparkQA · 2017-03-07T22:36:53Z

Test build #74132 has finished for PR 17170 at commit 71f23ee.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

SparkQA · 2017-03-08T00:23:15Z

Test build #74145 has finished for PR 17170 at commit 3db1413.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

zero323 · 2017-03-08T16:30:38Z

@felixcheung I think i addressed all the issues excluding inputCol and predictionCol. In general:

Using formulaas an input doesn't make sense in my opinion.
Personally I would allow users to set column names. Both features and prediction are a bit vague in the context of the algorithm.

felixcheung · 2017-03-09T06:45:07Z

Sure - it's a single columns called features so I'm fine with it as a parameter.
I'm not sure about inputCol though, it's a different nomenclature in R that is different from the rest of mllib features.

zero323 · 2017-03-09T10:59:41Z

I think that ALS sets a precedence for using somethingCol but I don't like 'features" part here. Maybe basketsCol, what you think?

felixcheung · 2017-03-10T01:48:43Z

To clarify with your example, with ALS we have userCol, ratingCol - these matches the API names in spark.ml, and I think we need to do the same here.

What don't you like about featuresCol which is also in the Scala API? featuresCol is really a standard trait with HasFeaturesCol used in different ml model across Spark.

felixcheung

generally good, a few last comments

felixcheung · 2017-03-20T20:35:28Z

R/pkg/R/mllib_fpm.R

+#' @note FPGrowthModel since 2.2.0
+setClass("FPGrowthModel", slots = list(jobj = "jobj"))
+
+#' FPGrowth


I think we discussed this - let's make it FP-Growth or Frequent Pattern Mining (https://spark.apache.org/docs/latest/mllib-frequent-pattern-mining.html) as the title

was #17170 (comment)

felixcheung · 2017-03-20T20:36:37Z

R/pkg/R/mllib_fpm.R

+#' 
+#' A parallel FP-growth algorithm to mine frequent itemsets. The algorithm is described in
+#' Li et al., PFP: Parallel FP-Growth for Query
+#' Recommendation <\url{http://dx.doi.org/10.1145/1454008.1454027}>. 


can you check if this generate the doc properly
<\url{http://dx.doi.org/10.1145/1454008.1454027}>
generally it should be
\href{http://...}{Text}

It does render the link as expected, but linking ML docs is indeed a better choice.

felixcheung · 2017-03-20T20:37:12Z

R/pkg/R/mllib_fpm.R

+#' PFP distributes computation in such a way that each worker executes an
+#' independent group of mining tasks. The FP-Growth algorithm is described in
+#' Han et al., Mining frequent patterns without
+#' candidate generation <\url{http://dx.doi.org/10.1145/335191.335372}>.


ditto here for url.
In fact, I'm not sure we need to include all the links here but instead link to
https://spark.apache.org/docs/latest/mllib-frequent-pattern-mining.html

Sounds good. I'll link the docs.

felixcheung · 2017-03-20T20:39:00Z

R/pkg/R/mllib_fpm.R

+#' another_model <- spark.fpGrowth(data, minSupport = 0.1, minConfidence = 0.5
+#'                                 itemsCol = "baskets", numPartitions = 10)
+#' }
+#' @references \url{http://en.wikipedia.org/wiki/Association_rule_learning}


we don't generally use this tag. Do you want to move to @Seealso, or just link to in the description above

I'll remove it completely and just link to the docs.

felixcheung · 2017-03-20T20:41:51Z

R/pkg/R/mllib_fpm.R

+#' @note spark.fpGrowth since 2.2.0
+setMethod("spark.fpGrowth", signature(data = "SparkDataFrame"),
+          function(data, minSupport = 0.3, minConfidence = 0.8,
+                   itemsCol = "items", numPartitions = -1) {


numPartitions by default is not set in Scala - let's default this to NULL instead here

but do not as.integer if value is NULL - something like
numPartitions <- if (is.null(numPartitions)) NULL else as.integer(numPartitions)
before passing to JVM side

felixcheung · 2017-03-20T20:45:44Z

mllib/src/main/scala/org/apache/spark/ml/r/FPGrowthWrapper.scala

+      .setMinConfidence(minConfidence)
+      .setItemsCol(itemsCol)
+
+    if (numPartitions != null && numPartitions > 0) {


given the earlier suggestion, we should also check numPartition > 0 in R before passing to here

If you feel it is necessary. Personally I wanted to treat any non-strictly positive number as null.

felixcheung · 2017-03-20T20:46:41Z

mllib/src/main/scala/org/apache/spark/ml/r/FPGrowthWrapper.scala

+  class FPGrowthWrapperWriter(instance: FPGrowthWrapper) extends MLWriter {
+    override protected def saveImpl(path: String): Unit = {
+      val modelPath = new Path(path, "model").toString
+      val rMetadataPath = new Path(path, "rMetadata").toString


anything else we could add as metadata that is not in the model already?

I don't think so. Model captures all the parameters.

SparkQA · 2017-03-21T22:42:44Z

Test build #75007 has finished for PR 17170 at commit 999aa7a.

This patch fails R style tests.
This patch merges cleanly.
This patch adds the following public classes (experimental):
class FPGrowthWrapperReader extends MLReader[FPGrowthWrapper]
class FPGrowthWrapperWriter(instance: FPGrowthWrapper) extends MLWriter

SparkQA · 2017-03-21T23:57:46Z

Test build #75008 has finished for PR 17170 at commit 6522916.

This patch passes all tests.
This patch merges cleanly.
This patch adds the following public classes (experimental):
class FPGrowthWrapperReader extends MLReader[FPGrowthWrapper]
class FPGrowthWrapperWriter(instance: FPGrowthWrapper) extends MLWriter

felixcheung · 2017-03-22T07:57:37Z

R/pkg/R/mllib_fpm.R

+              stop("minConfidence should be a number [0, 1].")
+            }
+
+            numPartitions <- if (is.null(numPartitions)) NULL else as.integer(numPartitions)


as this 6522916#r107011745 we should check numPartitions too?
How about changing it to

if (!is.null(numPartitions)) { numPartitions <- as.integer(numPartitions) stopifnot(numPartitions > 0) }

felixcheung · 2017-03-22T07:59:09Z

also would you be updating the R vignettes, ML programming guide and example?

zero323 · 2017-03-22T14:03:23Z

Let's make it a separate task. For ML guide we have to wait for #17130 anyway.

felixcheung · 2017-03-27T01:44:30Z

R/pkg/R/mllib_fpm.R

+
+# Get association rules.
+
+#' @return A DataFrame with association rules.


let's document the list of column like in Python: https://github.com/apache/spark/pull/17218/files#diff-b6dbf16870bd2cca9b4140df8aebd681R121

for reference, see https://github.com/apache/spark/blob/master/R/pkg/R/mllib_clustering.R#L249

felixcheung · 2017-03-27T01:50:31Z

R/pkg/inst/tests/testthat/test_mllib_fpm.R

+    "1,3"
+  ))), "split(items, ',') as items")
+
+  model <- spark.fpGrowth(data, minSupport = 0.3, minConfidence = 0.8, numPartitions = 1)


we need to add a test when numPartitions is not set...

felixcheung · 2017-03-27T01:53:37Z

mllib/src/main/scala/org/apache/spark/ml/r/FPGrowthWrapper.scala

+      .setMinConfidence(minConfidence)
+      .setItemsCol(itemsCol)
+
+    if (numPartitions != null && numPartitions > 0) {


and this comment #17170 (comment)

and #17170 (comment)

SparkQA · 2017-03-27T19:33:06Z

Test build #75275 has finished for PR 17170 at commit 2f49f98.

This patch fails R style tests.
This patch merges cleanly.
This patch adds no public classes.

SparkQA · 2017-03-27T20:33:29Z

Test build #75276 has finished for PR 17170 at commit 8f0e578.

This patch fails SparkR unit tests.
This patch merges cleanly.
This patch adds no public classes.

zero323 · 2017-03-27T21:07:40Z

@felixcheung Looks like some issue with the structured streaming: https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/75276/

felixcheung · 2017-03-28T04:18:57Z

Jenkins, retest this please

felixcheung · 2017-03-28T04:37:06Z

@zero323 how about #17170 (comment)

SparkQA · 2017-03-28T05:18:19Z

Test build #75293 has finished for PR 17170 at commit 8f0e578.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

SparkQA · 2017-03-28T16:54:51Z

Test build #75320 has finished for PR 17170 at commit 797d68d.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

felixcheung · 2017-03-29T17:20:18Z

R/pkg/R/mllib_fpm.R

@@ -99,7 +99,10 @@ setMethod("spark.fpGrowth", signature(data = "SparkDataFrame"),
 # Get frequent itemsets.

 #' @param object a fitted FPGrowth model.
-#' @return A DataFrame with frequent itemsets.
+#' @return A \code{DataFrame} with frequent itemsets.


Actually, sorry we need to change DataFrame to SparkDataFrame in R

SparkQA · 2017-03-29T22:00:58Z

Test build #75366 has finished for PR 17170 at commit 64c07aa.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

felixcheung · 2017-04-04T06:43:30Z

merged to master.
@zero323 could you follow up with vignettes and programming guide update please - we need them for the 2.2.0 release.

zero323 · 2017-04-04T07:43:06Z

Of course. Do we have / need a JIRA ticket for that?

felixcheung · 2017-04-04T07:47:54Z

for this, it's optional, but I opened one for tracking purpose https://issues.apache.org/jira/browse/SPARK-20208

zero323 force-pushed the SPARK-19825 branch from b53963a to 021cd9b Compare March 6, 2017 03:21

zero323 changed the title ~~[SPARK-19825][R][ML] spark.ml R API for FPGrowth~~ [SPARK-19825][WIP][R][ML] spark.ml R API for FPGrowth Mar 6, 2017

zero323 force-pushed the SPARK-19825 branch from d50f917 to 6554384 Compare March 6, 2017 04:47

zero323 force-pushed the SPARK-19825 branch from 6554384 to 0b34d03 Compare March 6, 2017 16:53

zero323 force-pushed the SPARK-19825 branch from 0b34d03 to 86f96a5 Compare March 6, 2017 17:03

zero323 changed the title ~~[SPARK-19825][WIP][R][ML] spark.ml R API for FPGrowth~~ [SPARK-19825][R][ML] spark.ml R API for FPGrowth Mar 6, 2017

felixcheung requested changes Mar 7, 2017

View reviewed changes

felixcheung reviewed Mar 7, 2017

View reviewed changes

felixcheung requested changes Mar 20, 2017

View reviewed changes

zero323 force-pushed the SPARK-19825 branch from 706514d to 999aa7a Compare March 21, 2017 22:33

Add R API for FPGrowth

6522916

zero323 force-pushed the SPARK-19825 branch from 999aa7a to 6522916 Compare March 21, 2017 22:55

felixcheung requested changes Mar 22, 2017

View reviewed changes

felixcheung requested changes Mar 27, 2017

View reviewed changes

zero323 added 2 commits March 27, 2017 21:00

Stop if numPartitions is not null and not strictly positive

254e8a4

Add test for training without numPartitions

8f0e578

zero323 force-pushed the SPARK-19825 branch from 2f49f98 to 8f0e578 Compare March 27, 2017 19:32

Add column description for freqItemsets and associationRules

797d68d

felixcheung requested changes Mar 29, 2017

View reviewed changes

Correct names DataFrame -> SparkDataFrame

64c07aa

felixcheung approved these changes Apr 4, 2017

View reviewed changes

asfgit closed this in b34f766 Apr 4, 2017

zero323 deleted the SPARK-19825 branch April 6, 2017 10:54


		# Get association rules.

		#' @return A DataFrame with association rules.

[SPARK-19825][R][ML] spark.ml R API for FPGrowth #17170

[SPARK-19825][R][ML] spark.ml R API for FPGrowth #17170

Uh oh!

Conversation

zero323 commented Mar 6, 2017 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

What changes were proposed in this pull request?

How was this patch tested?

Uh oh!

SparkQA commented Mar 6, 2017

Uh oh!

SparkQA commented Mar 6, 2017

Uh oh!

SparkQA commented Mar 6, 2017

Uh oh!

SparkQA commented Mar 6, 2017

Uh oh!

SparkQA commented Mar 6, 2017

Uh oh!

SparkQA commented Mar 6, 2017

Uh oh!

SparkQA commented Mar 6, 2017

Uh oh!

SparkQA commented Mar 6, 2017

Uh oh!

SparkQA commented Mar 6, 2017

Uh oh!

SparkQA commented Mar 6, 2017

Uh oh!

felixcheung left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

SparkQA commented Mar 7, 2017

Uh oh!

SparkQA commented Mar 7, 2017

Uh oh!

SparkQA commented Mar 7, 2017

Uh oh!

SparkQA commented Mar 7, 2017

Uh oh!

zero323 commented Mar 6, 2017 •

edited

Loading