-
Notifications
You must be signed in to change notification settings - Fork 28.6k
[SPARK-1655][MLLIB] WIP Add option for distributed naive bayes model. #2491
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
@@ -232,11 +232,11 @@ class PythonMLLibAPI extends Serializable { | |||
def trainNaiveBayes( | |||
data: JavaRDD[LabeledPoint], | |||
lambda: Double): java.util.List[java.lang.Object] = { | |||
val model = NaiveBayes.train(data.rdd, lambda) | |||
// val model = NaiveBayes.train(data.rdd, lambda, "local") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I disabled the python interface in this PR for now. Let’s figure out the scala implementation first.
Hi - Does this seem like a reasonable approach for SPARK-1655? |
QA tests have started for PR 2491 at commit
|
QA tests have finished for PR 2491 at commit
|
Hi - The QA tests failed in python because I disabled the naive bayes python api in order to focus on approval of the scala implementation first. (I mentioned this in a comment above as well.) |
@mengxr - I’m sure you have a lot on your plate right now, but I wanted to check in on this PR. Overall, how do you feel about this approach for SPARK-1655 (distributed naive bayes model)? I’m happy to implement it differently if you’d prefer. |
@staple Sorry for late response and thank you for working on this JIRA! For the best practice, before you start working on a JIRA, please first ask on the JIRA page and see whether someone else is working or plans to work on the same JIRA, to avoid duplicate effort. Discussing the design before sending out PR is also encouraged. I just assigned you to the JIRA. The algorithm looks good to me. Some general comments:
|
@mengxr Sorry about that, in the future I’ll follow the best practice you’ve outlined. Here are the take-aways from my perspective:
I’ll look into all these. Thanks for your feedback! |
@staple The conditional distribution matrix may not be sparse. That is why we use dense format to store it. Maybe we can do a hard thresholding to make it parse, but this should be in a separate PR. Let's focus on the second and the third in this PR. |
@mengxr Sorry I misunderstood your comment on that first point. I'll just do the 2nd and 3rd. |
@mengxr Ok, updated to address your suggestions. |
QA tests have started for PR 2491 at commit
|
QA tests have finished for PR 2491 at commit
|
Again, python tests failed because the python interface is disabled in order to focus on the scala implementation first. |
It's better to have |
@davies, sure changed the title |
@staple is this still an active PR? just trying to figure out if it's stale and can be closed. |
@srowen I think this stalled because I was anticipating some additional feedback on the scala implementation before adding python compatibility. But looking things over I think I should just go ahead and add the rest of the implementation to move from WIP to formal PR. And I will have time to do that in the near future, so let's keep this PR open for now please. Thanks for the ping! |
If this still hasn't progressed ~5 months later, do you mind closing this PR? |
Adds an option to store a naive bayes model distributively. The default behavior, in which the whole model is stored on the driver node, remains unchanged. NaiveBayes.train’s new distMode parameter can be used to request that a model be distributed. (This is in the spirit of RowMatrix.computeSVD's mode parameter.)