[SPARK-1655][MLLIB] WIP Add option for distributed naive bayes model. #2491

staple · 2014-09-22T18:09:41Z

Adds an option to store a naive bayes model distributively. The default behavior, in which the whole model is stored on the driver node, remains unchanged. NaiveBayes.train’s new distMode parameter can be used to request that a model be distributed. (This is in the spirit of RowMatrix.computeSVD's mode parameter.)

staple · 2014-09-22T18:10:10Z

mllib/src/main/scala/org/apache/spark/mllib/api/python/PythonMLLibAPI.scala

@@ -232,11 +232,11 @@ class PythonMLLibAPI extends Serializable {
  def trainNaiveBayes(
      data: JavaRDD[LabeledPoint],
      lambda: Double): java.util.List[java.lang.Object] = {
-    val model = NaiveBayes.train(data.rdd, lambda)
+    // val model = NaiveBayes.train(data.rdd, lambda, "local")


I disabled the python interface in this PR for now. Let’s figure out the scala implementation first.

staple · 2014-09-22T18:11:06Z

Hi - Does this seem like a reasonable approach for SPARK-1655?

SparkQA · 2014-09-22T18:14:34Z

QA tests have started for PR 2491 at commit 4594761.

This patch merges cleanly.

SparkQA · 2014-09-22T19:12:13Z

QA tests have finished for PR 2491 at commit 4594761.

This patch fails unit tests.
This patch merges cleanly.
This patch adds the following public classes (experimental):
- case class LabelAggregate(label: Double, numDocuments: Long, sumFeatures: BDV[Double])

staple · 2014-09-22T19:18:59Z

Hi - The QA tests failed in python because I disabled the naive bayes python api in order to focus on approval of the scala implementation first. (I mentioned this in a comment above as well.)

staple · 2014-10-03T17:04:51Z

@mengxr - I’m sure you have a lot on your plate right now, but I wanted to check in on this PR. Overall, how do you feel about this approach for SPARK-1655 (distributed naive bayes model)? I’m happy to implement it differently if you’d prefer.

mengxr · 2014-10-03T19:07:31Z

@staple Sorry for late response and thank you for working on this JIRA! For the best practice, before you start working on a JIRA, please first ask on the JIRA page and see whether someone else is working or plans to work on the same JIRA, to avoid duplicate effort. Discussing the design before sending out PR is also encouraged. I just assigned you to the JIRA.

The algorithm looks good to me. Some general comments:

The data to NB is usually sparse. I'm not sure whether grouping the conditional probabilities helps performance.
In the implementation of predict, the output RDD[Double] doesn't have the same partitioner as the input data. Though the ordering doesn't change, it is still hard to inspect the result. I suggested adding predictValues, which takes RDD[(K, Vector)] and output RDD[(K, Double)], so user can put either id or label in the key, and we can preserve the input partitioner.

staple · 2014-10-03T20:28:59Z

@mengxr Sorry about that, in the future I’ll follow the best practice you’ve outlined.

Here are the take-aways from my perspective:

Investigate use of sparse storage for the conditional distribution. I believe the existing implementation in master uses dense conditional distribution matrices, but sparse is obviously possible.
Remove grouping of conditional probabilities, as it adds complexity and you mentioned you aren’t sure if it will help performance.
Add support for predictValues with consistent partitioning.

I’ll look into all these. Thanks for your feedback!

mengxr · 2014-10-03T21:29:59Z

@staple The conditional distribution matrix may not be sparse. That is why we use dense format to store it. Maybe we can do a hard thresholding to make it parse, but this should be in a separate PR. Let's focus on the second and the third in this PR.

staple · 2014-10-03T21:50:54Z

@mengxr Sorry I misunderstood your comment on that first point. I'll just do the 2nd and 3rd.

staple · 2014-10-04T21:37:05Z

@mengxr Ok, updated to address your suggestions.

SparkQA · 2014-10-04T21:39:37Z

QA tests have started for PR 2491 at commit e535d8b.

This patch merges cleanly.

SparkQA · 2014-10-04T22:37:56Z

QA tests have finished for PR 2491 at commit e535d8b.

This patch fails unit tests.
This patch merges cleanly.
This patch adds the following public classes (experimental):
- abstract class NaiveBayesModel extends ClassificationModel with Serializable

staple · 2014-10-04T22:45:02Z

Again, python tests failed because the python interface is disabled in order to focus on the scala implementation first.

davies · 2014-10-07T18:26:28Z

It's better to have WIP in title if you still work in progress.

staple · 2014-10-07T18:45:18Z

@davies, sure changed the title

srowen · 2015-03-02T23:12:16Z

@staple is this still an active PR? just trying to figure out if it's stale and can be closed.

staple · 2015-03-03T02:24:05Z

@srowen I think this stalled because I was anticipating some additional feedback on the scala implementation before adding python compatibility. But looking things over I think I should just go ahead and add the rest of the implementation to move from WIP to formal PR. And I will have time to do that in the near future, so let's keep this PR open for now please. Thanks for the ping!

srowen · 2015-07-28T15:11:24Z

If this still hasn't progressed ~5 months later, do you mind closing this PR?

[SPARK-1655][MLLIB] Add option for distributed naive bayes model.

4594761

staple reviewed Sep 22, 2014
View reviewed changes

Remove model batching, add predictValues.

e535d8b

staple changed the title ~~[SPARK-1655][MLLIB] Add option for distributed naive bayes model.~~ [SPARK-1655][MLLIB] WIP Add option for distributed naive bayes model. Oct 7, 2014

asfgit closed this in 0d9ab01 Sep 15, 2015

[SPARK-1655][MLLIB] WIP Add option for distributed naive bayes model. #2491

[SPARK-1655][MLLIB] WIP Add option for distributed naive bayes model. #2491

Uh oh!

Conversation

staple commented Sep 22, 2014

Uh oh!

staple Sep 22, 2014

Choose a reason for hiding this comment

Uh oh!

staple commented Sep 22, 2014

Uh oh!

SparkQA commented Sep 22, 2014

Uh oh!

SparkQA commented Sep 22, 2014

Uh oh!

staple commented Sep 22, 2014

Uh oh!

staple commented Oct 3, 2014

Uh oh!

mengxr commented Oct 3, 2014

Uh oh!

staple commented Oct 3, 2014

Uh oh!

mengxr commented Oct 3, 2014

Uh oh!

staple commented Oct 3, 2014

Uh oh!

staple commented Oct 4, 2014

Uh oh!

SparkQA commented Oct 4, 2014

Uh oh!

SparkQA commented Oct 4, 2014

Uh oh!

staple commented Oct 4, 2014

Uh oh!

davies commented Oct 7, 2014

Uh oh!

staple commented Oct 7, 2014

Uh oh!

srowen commented Mar 2, 2015

Uh oh!

staple commented Mar 3, 2015

Uh oh!

srowen commented Jul 28, 2015

Uh oh!

Uh oh!