Skip to content

Latest commit

 

History

History
5 lines (3 loc) · 1.55 KB

README.md

File metadata and controls

5 lines (3 loc) · 1.55 KB

Newsgroups Naive Bayes

This project seeks to build a multinomial naive Bayes model for text classification. Rather than rely on the pre-built sklearn.naive_bayes.MultinomialNB module for the bulk of my work, I will instead be constructing a .ipynb notebook to mimic said classifier's behaviour. The predictions themselves require two main calculations: the label priors for all classes in the dataset, and the word probabilities per class. These two data structures, after some manipulations, will combine to form values proportional to the posterior probabilities for each class. The final act in classification will be to choose the most probable posterior hypothesis and that class will become the prediction.

The first attempt at classification of my test dataset yielded an error rate of 20.92%, not terrible but certainly could see improvement. Since my application of the naive Bayes theorem already utilizes log smoothing and log frequencies to compensate for zero probabilities and word burstiness, the next reasonable approach was to downweight word probabiliies with their inverse document frequencies. Admittedly this didn't yield as large of an improvement as expected, it did still manage to drop my error rate to 18.09%. Following this, at perhaps another time, it might be useful to also explore completely removing certain words from the dataset vocabulary that are "particularly misleading" (appear at similar frequencies in multiple classes). Making the list of words used in classification more unique to each label should also further decrease my error rate.