Skip to content

[SPARK-1655][MLLIB] WIP Add option for distributed naive bayes model. #2491

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
wants to merge 2 commits into from

Conversation

staple
Copy link
Contributor

@staple staple commented Sep 22, 2014

Adds an option to store a naive bayes model distributively. The default behavior, in which the whole model is stored on the driver node, remains unchanged. NaiveBayes.train’s new distMode parameter can be used to request that a model be distributed. (This is in the spirit of RowMatrix.computeSVD's mode parameter.)

@@ -232,11 +232,11 @@ class PythonMLLibAPI extends Serializable {
def trainNaiveBayes(
data: JavaRDD[LabeledPoint],
lambda: Double): java.util.List[java.lang.Object] = {
val model = NaiveBayes.train(data.rdd, lambda)
// val model = NaiveBayes.train(data.rdd, lambda, "local")
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I disabled the python interface in this PR for now. Let’s figure out the scala implementation first.

@staple
Copy link
Contributor Author

staple commented Sep 22, 2014

Hi - Does this seem like a reasonable approach for SPARK-1655?

@SparkQA
Copy link

SparkQA commented Sep 22, 2014

QA tests have started for PR 2491 at commit 4594761.

  • This patch merges cleanly.

@SparkQA
Copy link

SparkQA commented Sep 22, 2014

QA tests have finished for PR 2491 at commit 4594761.

  • This patch fails unit tests.
  • This patch merges cleanly.
  • This patch adds the following public classes (experimental):
    • case class LabelAggregate(label: Double, numDocuments: Long, sumFeatures: BDV[Double])

@staple
Copy link
Contributor Author

staple commented Sep 22, 2014

Hi - The QA tests failed in python because I disabled the naive bayes python api in order to focus on approval of the scala implementation first. (I mentioned this in a comment above as well.)

@staple
Copy link
Contributor Author

staple commented Oct 3, 2014

@mengxr - I’m sure you have a lot on your plate right now, but I wanted to check in on this PR. Overall, how do you feel about this approach for SPARK-1655 (distributed naive bayes model)? I’m happy to implement it differently if you’d prefer.

@mengxr
Copy link
Contributor

mengxr commented Oct 3, 2014

@staple Sorry for late response and thank you for working on this JIRA! For the best practice, before you start working on a JIRA, please first ask on the JIRA page and see whether someone else is working or plans to work on the same JIRA, to avoid duplicate effort. Discussing the design before sending out PR is also encouraged. I just assigned you to the JIRA.

The algorithm looks good to me. Some general comments:

  1. The data to NB is usually sparse. I'm not sure whether grouping the conditional probabilities helps performance.
  2. In the implementation of predict, the output RDD[Double] doesn't have the same partitioner as the input data. Though the ordering doesn't change, it is still hard to inspect the result. I suggested adding predictValues, which takes RDD[(K, Vector)] and output RDD[(K, Double)], so user can put either id or label in the key, and we can preserve the input partitioner.

@staple
Copy link
Contributor Author

staple commented Oct 3, 2014

@mengxr Sorry about that, in the future I’ll follow the best practice you’ve outlined.

Here are the take-aways from my perspective:

  • Investigate use of sparse storage for the conditional distribution. I believe the existing implementation in master uses dense conditional distribution matrices, but sparse is obviously possible.
  • Remove grouping of conditional probabilities, as it adds complexity and you mentioned you aren’t sure if it will help performance.
  • Add support for predictValues with consistent partitioning.

I’ll look into all these. Thanks for your feedback!

@mengxr
Copy link
Contributor

mengxr commented Oct 3, 2014

@staple The conditional distribution matrix may not be sparse. That is why we use dense format to store it. Maybe we can do a hard thresholding to make it parse, but this should be in a separate PR. Let's focus on the second and the third in this PR.

@staple
Copy link
Contributor Author

staple commented Oct 3, 2014

@mengxr Sorry I misunderstood your comment on that first point. I'll just do the 2nd and 3rd.

@staple
Copy link
Contributor Author

staple commented Oct 4, 2014

@mengxr Ok, updated to address your suggestions.

@SparkQA
Copy link

SparkQA commented Oct 4, 2014

QA tests have started for PR 2491 at commit e535d8b.

  • This patch merges cleanly.

@SparkQA
Copy link

SparkQA commented Oct 4, 2014

QA tests have finished for PR 2491 at commit e535d8b.

  • This patch fails unit tests.
  • This patch merges cleanly.
  • This patch adds the following public classes (experimental):
    • abstract class NaiveBayesModel extends ClassificationModel with Serializable

@staple
Copy link
Contributor Author

staple commented Oct 4, 2014

Again, python tests failed because the python interface is disabled in order to focus on the scala implementation first.

@davies
Copy link
Contributor

davies commented Oct 7, 2014

It's better to have WIP in title if you still work in progress.

@staple staple changed the title [SPARK-1655][MLLIB] Add option for distributed naive bayes model. [SPARK-1655][MLLIB] WIP Add option for distributed naive bayes model. Oct 7, 2014
@staple
Copy link
Contributor Author

staple commented Oct 7, 2014

@davies, sure changed the title

@srowen
Copy link
Member

srowen commented Mar 2, 2015

@staple is this still an active PR? just trying to figure out if it's stale and can be closed.

@staple
Copy link
Contributor Author

staple commented Mar 3, 2015

@srowen I think this stalled because I was anticipating some additional feedback on the scala implementation before adding python compatibility. But looking things over I think I should just go ahead and add the rest of the implementation to move from WIP to formal PR. And I will have time to do that in the near future, so let's keep this PR open for now please. Thanks for the ping!

@srowen
Copy link
Member

srowen commented Jul 28, 2015

If this still hasn't progressed ~5 months later, do you mind closing this PR?

@asfgit asfgit closed this in 0d9ab01 Sep 15, 2015
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants