Skip to content

[SPARK-19825][R][ML] spark.ml R API for FPGrowth #17170

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
wants to merge 5 commits into from

Conversation

zero323
Copy link
Member

@zero323 zero323 commented Mar 6, 2017

What changes were proposed in this pull request?

Adds SparkR API for FPGrowth: SPARK-19825:

  • spark.fpGrowth -model training.
  • freqItemsets and associationRules methods with new corresponding generics.
  • Scala helper: org.apache.spark.ml.r. FPGrowthWrapper
  • unit tests.

How was this patch tested?

Feature specific unit tests.

@SparkQA
Copy link

SparkQA commented Mar 6, 2017

Test build #73956 has finished for PR 17170 at commit 641fe70.

  • This patch fails R style tests.
  • This patch merges cleanly.
  • This patch adds the following public classes (experimental):
  • class FPGrowthWrapperReader extends MLReader[FPGrowthWrapper]
  • class FPGrowthWrapperWriter(instance: FPGrowthWrapper) extends MLWriter

@SparkQA
Copy link

SparkQA commented Mar 6, 2017

Test build #73957 has finished for PR 17170 at commit b53963a.

  • This patch fails R style tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@zero323 zero323 changed the title [SPARK-19825][R][ML] spark.ml R API for FPGrowth [SPARK-19825][WIP][R][ML] spark.ml R API for FPGrowth Mar 6, 2017
@SparkQA
Copy link

SparkQA commented Mar 6, 2017

Test build #73959 has finished for PR 17170 at commit 021cd9b.

  • This patch fails SparkR unit tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@SparkQA
Copy link

SparkQA commented Mar 6, 2017

Test build #73961 has finished for PR 17170 at commit b198dfa.

  • This patch fails Scala style tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@SparkQA
Copy link

SparkQA commented Mar 6, 2017

Test build #73962 has finished for PR 17170 at commit 03789d6.

  • This patch fails Scala style tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@SparkQA
Copy link

SparkQA commented Mar 6, 2017

Test build #73963 has finished for PR 17170 at commit d50f917.

  • This patch fails R style tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@SparkQA
Copy link

SparkQA commented Mar 6, 2017

Test build #73966 has finished for PR 17170 at commit 6554384.

  • This patch fails SparkR unit tests.
  • This patch merges cleanly.
  • This patch adds the following public classes (experimental):
  • class FPGrowthWrapperReader extends MLReader[FPGrowthWrapper]
  • class FPGrowthWrapperWriter(instance: FPGrowthWrapper) extends MLWriter

@SparkQA
Copy link

SparkQA commented Mar 6, 2017

Test build #74021 has finished for PR 17170 at commit 0b34d03.

  • This patch fails R style tests.
  • This patch merges cleanly.
  • This patch adds the following public classes (experimental):
  • class FPGrowthWrapperReader extends MLReader[FPGrowthWrapper]
  • class FPGrowthWrapperWriter(instance: FPGrowthWrapper) extends MLWriter

@SparkQA
Copy link

SparkQA commented Mar 6, 2017

Test build #74022 has finished for PR 17170 at commit 86f96a5.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds the following public classes (experimental):
  • class FPGrowthWrapperReader extends MLReader[FPGrowthWrapper]
  • class FPGrowthWrapperWriter(instance: FPGrowthWrapper) extends MLWriter

@SparkQA
Copy link

SparkQA commented Mar 6, 2017

Test build #74030 has finished for PR 17170 at commit 6c0aea9.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@zero323 zero323 changed the title [SPARK-19825][WIP][R][ML] spark.ml R API for FPGrowth [SPARK-19825][R][ML] spark.ml R API for FPGrowth Mar 6, 2017
Copy link
Member

@felixcheung felixcheung left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

thanks, I think a couple of big changes are in order. Let me know when it is ready for review again

@@ -54,5 +55,5 @@ Collate:
'types.R'
'utils.R'
'window.R'
RoxygenNote: 5.0.1
RoxygenNote: 6.0.1
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

let's revert this - new roxygen2 seems to have some new features we are not ready for yet


#' @rdname spark.fpGrowth
#' @export
setGeneric("freqItemsets", function(object) { standardGeneric("freqItemsets") })
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

we seems to follow the pattern spark.something - see LDA. do you think it makes sense here too?

#' }
#' @note spark.fpGrowth since 2.2.0
setMethod("spark.fpGrowth", signature(data = "SparkDataFrame"),
function(data, minSupport = 0.3, minConfidence = 0.8,
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

should it have numPartitions?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done.

#' @note spark.fpGrowth since 2.2.0
setMethod("spark.fpGrowth", signature(data = "SparkDataFrame"),
function(data, minSupport = 0.3, minConfidence = 0.8,
featuresCol = "features", predictionCol = "prediction") {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

instead of features it should take a formula?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

we generally avoid allow setting predictionCol too

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

about here- thought?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

To be honest I am not sure. If you think that setting predictionCol should be disabled I am fine with that but I don't see how formulas could be useful here. FPGrowth doesn't really conform to the conventions used in other ML algorithms. It doesn't use vectors and fixed size buckets are unlikely to happen.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I believe predictionCol param only allow you to change the name of the column - prediction is always still going to be there, no?

}

jobj <- callJStatic("org.apache.spark.ml.r.FPGrowthWrapper", "fit",
data@sdf, minSupport, minConfidence,
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

you may want to as.numeric on minSupport, minConfidence in case someone is passing in an integer and callJStatic would fail to match the wrapper method

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done.

#' @note FPGrowthModel since 2.2.0
setClass("FPGrowthModel", slots = list(jobj = "jobj"))

#' FPGrowth Model
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

could you use the long form name (eg. look at LDA) and drop the word "Model" which we avoid using

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do you mean spark.FPGrowth? I can but as far as I can tell all classes use Model suffix (GeneralizedLinearRegressionModel, GaussianMixtureModel LDAModel and so on) and none is using spark prefix.

Or do you mean representation instead of slots? I believe that representation is no longer recommended.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

#' @examples
#' \dontrun{
#' itemsets <- data.frame(features = c("a,b", "a,b,c", "c,d"))
#' data <- selectExpr(createDataFrame(itemsets), "split(features, ',') as features")
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

instead of duplicating createDataFrame, set itemsets <- createDataFrame(data.frame(features = c("a,b", "a,b,c", "c,d")))

btw, do we have real data to use instead?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, we do. Adjusted.

#'
#' # Show frequent itemsets
#' frequent_itemsets <- freqItemsets(model)
#' showDF(frequent_itemsets)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

collapse this to head(freqItemsets(model))

freq = c(2, 2, 3, 3, 4)
)

expect_equivalent(expected_itemsets, collect(freqItemsets(model)))
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

don't repeat freqItemsets(model) - use itemsets from above

import org.apache.hadoop.fs.Path
import org.json4s._
import org.json4s.JsonDSL._
import org.json4s.jackson.JsonMethods._
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

do we need these?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We can skip import org.json4s._ if won't do any parsing, but import org.json4s.jackson.JsonMethods._provide bothrenderandcompact` which are used to create JSON metadata.

@SparkQA
Copy link

SparkQA commented Mar 7, 2017

Test build #74111 has finished for PR 17170 at commit 1014902.

  • This patch fails R style tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@SparkQA
Copy link

SparkQA commented Mar 7, 2017

Test build #74113 has finished for PR 17170 at commit eb39222.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@SparkQA
Copy link

SparkQA commented Mar 7, 2017

Test build #74114 has finished for PR 17170 at commit bf26f79.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@SparkQA
Copy link

SparkQA commented Mar 7, 2017

Test build #74115 has finished for PR 17170 at commit 956a36a.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

Copy link
Member

@felixcheung felixcheung left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@SparkQA
Copy link

SparkQA commented Mar 7, 2017

Test build #74119 has finished for PR 17170 at commit 6be7f13.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@SparkQA
Copy link

SparkQA commented Mar 7, 2017

Test build #74123 has finished for PR 17170 at commit 1949da3.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@SparkQA
Copy link

SparkQA commented Mar 7, 2017

Test build #74132 has finished for PR 17170 at commit 71f23ee.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@SparkQA
Copy link

SparkQA commented Mar 8, 2017

Test build #74145 has finished for PR 17170 at commit 3db1413.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@zero323
Copy link
Member Author

zero323 commented Mar 8, 2017

@felixcheung I think i addressed all the issues excluding inputCol and predictionCol. In general:

  • Using formulaas an input doesn't make sense in my opinion.
  • Personally I would allow users to set column names. Both features and prediction are a bit vague in the context of the algorithm.

@felixcheung
Copy link
Member

Sure - it's a single columns called features so I'm fine with it as a parameter.
I'm not sure about inputCol though, it's a different nomenclature in R that is different from the rest of mllib features.

@zero323
Copy link
Member Author

zero323 commented Mar 9, 2017

I think that ALS sets a precedence for using somethingCol but I don't like 'features" part here. Maybe basketsCol, what you think?

@felixcheung
Copy link
Member

To clarify with your example, with ALS we have userCol, ratingCol - these matches the API names in spark.ml, and I think we need to do the same here.

What don't you like about featuresCol which is also in the Scala API? featuresCol is really a standard trait with HasFeaturesCol used in different ml model across Spark.

Copy link
Member

@felixcheung felixcheung left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

generally good, a few last comments

#' @note FPGrowthModel since 2.2.0
setClass("FPGrowthModel", slots = list(jobj = "jobj"))

#' FPGrowth
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think we discussed this - let's make it FP-Growth or Frequent Pattern Mining (https://spark.apache.org/docs/latest/mllib-frequent-pattern-mining.html) as the title

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

#'
#' A parallel FP-growth algorithm to mine frequent itemsets. The algorithm is described in
#' Li et al., PFP: Parallel FP-Growth for Query
#' Recommendation <\url{http://dx.doi.org/10.1145/1454008.1454027}>.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

can you check if this generate the doc properly
<\url{http://dx.doi.org/10.1145/1454008.1454027}>
generally it should be
\href{http://...}{Text}

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It does render the link as expected, but linking ML docs is indeed a better choice.

#' PFP distributes computation in such a way that each worker executes an
#' independent group of mining tasks. The FP-Growth algorithm is described in
#' Han et al., Mining frequent patterns without
#' candidate generation <\url{http://dx.doi.org/10.1145/335191.335372}>.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ditto here for url.
In fact, I'm not sure we need to include all the links here but instead link to
https://spark.apache.org/docs/latest/mllib-frequent-pattern-mining.html

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sounds good. I'll link the docs.

#' another_model <- spark.fpGrowth(data, minSupport = 0.1, minConfidence = 0.5
#' itemsCol = "baskets", numPartitions = 10)
#' }
#' @references \url{http://en.wikipedia.org/wiki/Association_rule_learning}
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

we don't generally use this tag. Do you want to move to @Seealso, or just link to in the description above

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'll remove it completely and just link to the docs.

#' @note spark.fpGrowth since 2.2.0
setMethod("spark.fpGrowth", signature(data = "SparkDataFrame"),
function(data, minSupport = 0.3, minConfidence = 0.8,
itemsCol = "items", numPartitions = -1) {
Copy link
Member

@felixcheung felixcheung Mar 20, 2017

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

numPartitions by default is not set in Scala - let's default this to NULL instead here

but do not as.integer if value is NULL - something like
numPartitions <- if (is.null(numPartitions)) NULL else as.integer(numPartitions)
before passing to JVM side

.setMinConfidence(minConfidence)
.setItemsCol(itemsCol)

if (numPartitions != null && numPartitions > 0) {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

given the earlier suggestion, we should also check numPartition > 0 in R before passing to here

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If you feel it is necessary. Personally I wanted to treat any non-strictly positive number as null.

class FPGrowthWrapperWriter(instance: FPGrowthWrapper) extends MLWriter {
override protected def saveImpl(path: String): Unit = {
val modelPath = new Path(path, "model").toString
val rMetadataPath = new Path(path, "rMetadata").toString
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

anything else we could add as metadata that is not in the model already?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't think so. Model captures all the parameters.

@SparkQA
Copy link

SparkQA commented Mar 21, 2017

Test build #75007 has finished for PR 17170 at commit 999aa7a.

  • This patch fails R style tests.
  • This patch merges cleanly.
  • This patch adds the following public classes (experimental):
  • class FPGrowthWrapperReader extends MLReader[FPGrowthWrapper]
  • class FPGrowthWrapperWriter(instance: FPGrowthWrapper) extends MLWriter

@SparkQA
Copy link

SparkQA commented Mar 21, 2017

Test build #75008 has finished for PR 17170 at commit 6522916.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds the following public classes (experimental):
  • class FPGrowthWrapperReader extends MLReader[FPGrowthWrapper]
  • class FPGrowthWrapperWriter(instance: FPGrowthWrapper) extends MLWriter

stop("minConfidence should be a number [0, 1].")
}

numPartitions <- if (is.null(numPartitions)) NULL else as.integer(numPartitions)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

as this 6522916#r107011745 we should check numPartitions too?
How about changing it to

if (!is.null(numPartitions)) {
  numPartitions <- as.integer(numPartitions)
  stopifnot(numPartitions > 0)
}

@felixcheung
Copy link
Member

also would you be updating the R vignettes, ML programming guide and example?

@zero323
Copy link
Member Author

zero323 commented Mar 22, 2017

Let's make it a separate task. For ML guide we have to wait for #17130 anyway.


# Get association rules.

#' @return A DataFrame with association rules.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

"1,3"
))), "split(items, ',') as items")

model <- spark.fpGrowth(data, minSupport = 0.3, minConfidence = 0.8, numPartitions = 1)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

we need to add a test when numPartitions is not set...

.setMinConfidence(minConfidence)
.setItemsCol(itemsCol)

if (numPartitions != null && numPartitions > 0) {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

and this comment #17170 (comment)

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@SparkQA
Copy link

SparkQA commented Mar 27, 2017

Test build #75275 has finished for PR 17170 at commit 2f49f98.

  • This patch fails R style tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@SparkQA
Copy link

SparkQA commented Mar 27, 2017

Test build #75276 has finished for PR 17170 at commit 8f0e578.

  • This patch fails SparkR unit tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@zero323
Copy link
Member Author

zero323 commented Mar 27, 2017

@felixcheung Looks like some issue with the structured streaming: https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/75276/

@felixcheung
Copy link
Member

Jenkins, retest this please

@felixcheung
Copy link
Member

@zero323 how about #17170 (comment)

@SparkQA
Copy link

SparkQA commented Mar 28, 2017

Test build #75293 has finished for PR 17170 at commit 8f0e578.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@SparkQA
Copy link

SparkQA commented Mar 28, 2017

Test build #75320 has finished for PR 17170 at commit 797d68d.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@@ -99,7 +99,10 @@ setMethod("spark.fpGrowth", signature(data = "SparkDataFrame"),
# Get frequent itemsets.

#' @param object a fitted FPGrowth model.
#' @return A DataFrame with frequent itemsets.
#' @return A \code{DataFrame} with frequent itemsets.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actually, sorry we need to change DataFrame to SparkDataFrame in R

@SparkQA
Copy link

SparkQA commented Mar 29, 2017

Test build #75366 has finished for PR 17170 at commit 64c07aa.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@felixcheung
Copy link
Member

merged to master.
@zero323 could you follow up with vignettes and programming guide update please - we need them for the 2.2.0 release.

@asfgit asfgit closed this in b34f766 Apr 4, 2017
@zero323
Copy link
Member Author

zero323 commented Apr 4, 2017

Of course. Do we have / need a JIRA ticket for that?

@felixcheung
Copy link
Member

felixcheung commented Apr 4, 2017

for this, it's optional, but I opened one for tracking purpose https://issues.apache.org/jira/browse/SPARK-20208

@zero323 zero323 deleted the SPARK-19825 branch April 6, 2017 10:54
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants