Skip to content

Commit d01a6d8

Browse files
leahmcguirejkbradley
authored andcommitted
[SPARK-4894][mllib] Added Bernoulli option to NaiveBayes model in mllib
Added optional model type parameter for NaiveBayes training. Can be either Multinomial or Bernoulli. When Bernoulli is given the Bernoulli smoothing is used for fitting and for prediction as per: http://nlp.stanford.edu/IR-book/html/htmledition/the-bernoulli-model-1.html. Default for model is original Multinomial fit and predict. Added additional testing for Bernoulli and Multinomial models. Author: leahmcguire <lmcguire@salesforce.com> Author: Joseph K. Bradley <joseph@databricks.com> Author: Leah McGuire <lmcguire@salesforce.com> Closes apache#4087 from leahmcguire/master and squashes the following commits: f3c8994 [leahmcguire] changed checks on model type to requires acb69af [leahmcguire] removed enum type and replaces all modelType parameters with strings 2224b15 [Leah McGuire] Merge pull request palantir#2 from jkbradley/leahmcguire-master 9ad89ca [Joseph K. Bradley] removed old code 6a8f383 [Joseph K. Bradley] Added new model save/load format 2.0 for NaiveBayesModel after modelType parameter was added. Updated tests. Also updated ModelType enum-like type. 852a727 [leahmcguire] merged with upstream master a22d670 [leahmcguire] changed NaiveBayesModel modelType parameter back to NaiveBayes.ModelType, made NaiveBayes.ModelType serializable, fixed getter method in NavieBayes 18f3219 [leahmcguire] removed private from naive bayes constructor for lambda only bea62af [leahmcguire] put back in constructor for NaiveBayes 01baad7 [leahmcguire] made fixes from code review fb0a5c7 [leahmcguire] removed typo e2d925e [leahmcguire] fixed nonserializable error that was causing naivebayes test failures 2d0c1ba [leahmcguire] fixed typo in NaiveBayes c298e78 [leahmcguire] fixed scala style errors b85b0c9 [leahmcguire] Merge remote-tracking branch 'upstream/master' 900b586 [leahmcguire] fixed model call so that uses type argument ea09b28 [leahmcguire] Merge remote-tracking branch 'upstream/master' e016569 [leahmcguire] updated test suite with model type fix 85f298f [leahmcguire] Merge remote-tracking branch 'upstream/master' dc65374 [leahmcguire] integrated model type fix 7622b0c [leahmcguire] added comments and fixed style as per rb b93aaf6 [Leah McGuire] Merge pull request #1 from jkbradley/nb-model-type 3730572 [Joseph K. Bradley] modified NB model type to be more Java-friendly b61b5e2 [leahmcguire] added back compatable constructor to NaiveBayesModel to fix MIMA test failure 5a4a534 [leahmcguire] fixed scala style error in NaiveBayes 3891bf2 [leahmcguire] synced with apache spark and resolved merge conflict d9477ed [leahmcguire] removed old inaccurate comment from test suite for mllib naive bayes 76e5b0f [leahmcguire] removed unnecessary sort from test 0313c0c [leahmcguire] fixed style error in NaiveBayes.scala 4a3676d [leahmcguire] Updated changes re-comments. Got rid of verbose populateMatrix method. Public api now has string instead of enumeration. Docs are updated." ce73c63 [leahmcguire] added Bernoulli option to niave bayes model in mllib, added optional model type parameter for training. When Bernoulli is given the Bernoulli smoothing is used for fitting and for prediction http://nlp.stanford.edu/IR-book/html/htmledition/the-bernoulli-model-1.html
1 parent a05835b commit d01a6d8

File tree

4 files changed

+322
-91
lines changed

4 files changed

+322
-91
lines changed

docs/mllib-naive-bayes.md

Lines changed: 10 additions & 7 deletions
Original file line numberDiff line numberDiff line change
@@ -13,12 +13,15 @@ compute the conditional probability distribution of label given an observation
1313
and use it for prediction.
1414

1515
MLlib supports [multinomial naive
16-
Bayes](http://en.wikipedia.org/wiki/Naive_Bayes_classifier#Multinomial_naive_Bayes),
17-
which is typically used for [document
18-
classification](http://nlp.stanford.edu/IR-book/html/htmledition/naive-bayes-text-classification-1.html).
16+
Bayes](http://en.wikipedia.org/wiki/Naive_Bayes_classifier#Multinomial_naive_Bayes)
17+
and [Bernoulli naive Bayes] (http://nlp.stanford.edu/IR-book/html/htmledition/the-bernoulli-model-1.html).
18+
These models are typically used for [document classification]
19+
(http://nlp.stanford.edu/IR-book/html/htmledition/naive-bayes-text-classification-1.html).
1920
Within that context, each observation is a document and each
20-
feature represents a term whose value is the frequency of the term.
21-
Feature values must be nonnegative to represent term frequencies.
21+
feature represents a term whose value is the frequency of the term (in multinomial naive Bayes) or
22+
a zero or one indicating whether the term was found in the document (in Bernoulli naive Bayes).
23+
Feature values must be nonnegative. The model type is selected with an optional parameter
24+
"Multinomial" or "Bernoulli" with "Multinomial" as the default.
2225
[Additive smoothing](http://en.wikipedia.org/wiki/Lidstone_smoothing) can be used by
2326
setting the parameter $\lambda$ (default to $1.0$). For document classification, the input feature
2427
vectors are usually sparse, and sparse vectors should be supplied as input to take advantage of
@@ -32,7 +35,7 @@ sparsity. Since the training data is only used once, it is not necessary to cach
3235
[NaiveBayes](api/scala/index.html#org.apache.spark.mllib.classification.NaiveBayes$) implements
3336
multinomial naive Bayes. It takes an RDD of
3437
[LabeledPoint](api/scala/index.html#org.apache.spark.mllib.regression.LabeledPoint) and an optional
35-
smoothing parameter `lambda` as input, and output a
38+
smoothing parameter `lambda` as input, an optional model type parameter (default is Multinomial), and outputs a
3639
[NaiveBayesModel](api/scala/index.html#org.apache.spark.mllib.classification.NaiveBayesModel), which
3740
can be used for evaluation and prediction.
3841

@@ -51,7 +54,7 @@ val splits = parsedData.randomSplit(Array(0.6, 0.4), seed = 11L)
5154
val training = splits(0)
5255
val test = splits(1)
5356

54-
val model = NaiveBayes.train(training, lambda = 1.0)
57+
val model = NaiveBayes.train(training, lambda = 1.0, model = "Multinomial")
5558

5659
val predictionAndLabel = test.map(p => (model.predict(p.features), p.label))
5760
val accuracy = 1.0 * predictionAndLabel.filter(x => x._1 == x._2).count() / test.count()

0 commit comments

Comments
 (0)