email_spam_classification_ML

A simple email cassification model using Naive Bayes Algorithme

Naive (conditionally independent) classification

Suppose that you have a dataset . Each is a separate email for this assignment. Each of the N data points Pattern space = X where are called features. You extract features from each data point. Features in an email may include the list of "words" (tokens) in the message body. The goal in Bayesian classification is to compute two probabilities P(Spam|x) and P(NonSpam|x) for each email. It classifies each email as "spam" or "not spam" by choosing the hypothesis with higher probability. Naive Bayes assumes that features for x are independent given its class. P(Spamjx) is difficult to compute in general. Expand with the definition of conditional probability

Look at the denominator P(x). P(x) equals the probability of a particular email given the universe of all possible emails. This is very difficult to calculate. But it is just a number between 0 and 1 since it is a probability. It just "normalizes" . Now look at the numerator . First expand x into its features Each feature is an event that can occur or not (i.e. the word is in an email or not). So

Apply the multiplication theorem (HW2, 1.c) to the second term to give

But now you are still stuck computing a big product of complicated conditional probabilities. Naive Bayes classification makes an assumption that features are conditionally independent. This means that

if . This means that the probability you observe one feature (i.e. word) is independent of observing another word given the email is spam. This is a naive assumption and weakens your model. But you can now simplify the above to

where k starts from 1.

You can ignore the P(x) normalizing term since you only care which probability is larger and it is the same in both cases. This leads to the naive Bayesian rule (called the maximum a posteriori (MAP) estimator) between the two hypotheses (e.g. {Spam;NonSpamg}):

Datasets

I have used the curated "pre-processed" files of the Enron accounting scandal Each archive contains two folders: "spam" and "ham". Each of folder contains thousands emails each stored in its own file.

Name		Name	Last commit message	Last commit date
Latest commit History 8 Commits
emails		emails
screenshot		screenshot
LICENSE		LICENSE
README.md		README.md
email_clssification.ipynb		email_clssification.ipynb
email_clssification.py		email_clssification.py
requirements.txt		requirements.txt

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Uh oh!

Uh oh!

Repository files navigation

email_spam_classification_ML

Naive (conditionally independent) classification

Datasets

screenshot

About

Uh oh!

Releases

Packages

Uh oh!

Languages

Uh oh!

License

Uh oh!

houssam2293/email_spam_classification_ML

Folders and files

Latest commit

History

Repository files navigation

email_spam_classification_ML

Naive (conditionally independent) classification

Datasets

screenshot

About

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Languages

Packages