A simple email cassification model using Naive Bayes Algorithme
Suppose that you have a dataset . Each
is a separate email for this assignment. Each of the
N data points
Pattern space = X where
are called features.
You extract features from each data point. Features in an email may include the list of "words" (tokens) in the message
body. The goal in Bayesian classification is to compute two probabilities P(Spam|x) and P(NonSpam|x) for each
email. It classifies each email as "spam" or "not spam" by choosing the hypothesis with higher probability.
Naive Bayes assumes that features for x are independent given its class.
P(Spamjx) is difficult to compute in general.
Expand with the definition of conditional probability
Look at the denominator P(x). P(x) equals the probability of a particular email given the universe of all
possible emails. This is very difficult to calculate. But it is just a number between 0 and 1 since it is a
probability. It just "normalizes" . Now look at the numerator
.
First expand x into its features
Each feature is an event that can occur or not (i.e. the word is in an email or not). So
Apply the multiplication theorem (HW2, 1.c) to the second term to give
But now you are still stuck computing a big product of complicated conditional probabilities. Naive Bayes classification makes an assumption that features are conditionally independent. This means that
if . This means that the probability you observe one feature (i.e. word) is independent of observing another word given the email is spam. This is a naive assumption and weakens your model. But you can
now simplify the above to
You can ignore the P(x) normalizing term since you only care which probability is larger and it is the same
in both cases. This leads to the naive Bayesian rule (called the maximum a posteriori (MAP) estimator)
between the two hypotheses (e.g. {Spam;NonSpamg}):
I have used the curated "pre-processed" files of the Enron accounting scandal Each archive contains two folders: "spam" and "ham". Each of folder contains thousands emails each stored in its own file.
