Skip to content

A simple email cassification model using Naive Bayes Algorithme

License

houssam2293/email_spam_classification_ML

Repository files navigation

email_spam_classification_ML

A simple email cassification model using Naive Bayes Algorithme

Naive (conditionally independent) classification

Suppose that you have a dataset equation. Each equation is a separate email for this assignment. Each of the N data points equation Pattern space = X where equation are called features. You extract features from each data point. Features in an email may include the list of "words" (tokens) in the message body. The goal in Bayesian classification is to compute two probabilities P(Spam|x) and P(NonSpam|x) for each email. It classifies each email as "spam" or "not spam" by choosing the hypothesis with higher probability. Naive Bayes assumes that features for x are independent given its class. P(Spamjx) is difficult to compute in general. Expand with the definition of conditional probability equation

Look at the denominator P(x). P(x) equals the probability of a particular email given the universe of all possible emails. This is very difficult to calculate. But it is just a number between 0 and 1 since it is a probability. It just "normalizes" equation. Now look at the numerator equation. First expand x into its features equation Each feature is an event that can occur or not (i.e. the word is in an email or not). So equation

Apply the multiplication theorem (HW2, 1.c) to the second term to give

equation

But now you are still stuck computing a big product of complicated conditional probabilities. Naive Bayes classification makes an assumption that features are conditionally independent. This means that

equation

if equation. This means that the probability you observe one feature (i.e. word) is independent of observing another word given the email is spam. This is a naive assumption and weakens your model. But you can now simplify the above to

equation where k starts from 1.

You can ignore the P(x) normalizing term since you only care which probability is larger and it is the same in both cases. This leads to the naive Bayesian rule (called the maximum a posteriori (MAP) estimator) between the two hypotheses equation (e.g. {Spam;NonSpamg}):

equation

Datasets

I have used the curated "pre-processed" files of the Enron accounting scandal Each archive contains two folders: "spam" and "ham". Each of folder contains thousands emails each stored in its own file.

screenshot

About

A simple email cassification model using Naive Bayes Algorithme

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published