Skip to content

A project to utilize NLTK toolkit to classify spam comment on Youtube

Notifications You must be signed in to change notification settings

ThomasWongHY/NLP_Project

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

3 Commits
 
 
 
 
 
 
 
 

Repository files navigation

NLP_Project

Steps to run the codes

  1. Change the path to your own file path

Screenshot 2023-08-28 at 15 39 46

  1. Explore the dataset
image

According to the results, it is shown there is one content existed twice in the comments. Also, both spam and non-spam classes are half of the entire class.

  1. Preprocess the databy NLTK toolkit

In data pre-processing, we first retain alphabet letters only and convert all the letters to lowercase. Then, we carry out tokenization, remove English stop words and apply Porter stemmer into text by NLTK toolkit.

image

Although Stemming and Lemmatization aim to reduce inflectional forms and derivationally related forms of a word to a common base form, they are slightly different. Typically, stemming refers to a basic heuristic method that removes derivational affixes from words in the hopes of attaining this aim most of the time. Lemmatization normally refers to carrying out tasks correctly using a vocabulary and morphological analysis of words, generally with the goal of removing only inflectional ends and returning the base or dictionary form of a word which is called the lemma. Applying both Stemming and Lemmatization is time-consuming but improve slightly in accuracy. Therefore, we only use Stemming in data preprocessing because it does not need to specify part of speech which means more general.

  1. Visualize the most frequent used words
image

The bigger word it is, the more frequent it appears in the comments in the spam class. Therefore, when we need to implement the new comments in the trained model in order to test the model performance, we can choose the wording appeared in the word cloud.

  1. Train the Naive Bayes classifier model

In model training, we first need to shuffle, split the dataset into 75% for training and 25% for testing. Therefore, the number of training data is 262 and that of testing data is 88 respectively. Next, we fit the training data into a Naive Bayes classifier and cross validate the model on the training data.

image

The mean accuracy is above 90%, which means it is quite accurate to classify the comments whether it is spam or not based on the training data.

  1. Test the Naive Bayes classifier model
image

The mean accuracy for testing data is similar to that for training data in this scenario, both of them are around 93%. Based on the confusion matrix shown in the graph, the number of true positives is 42, the number of false positives is 1, the number of false negatives is 6, the number of true negatives is 39.

  1. Classify with new data

In the final classification, we create some new data which include 2 spam comments and 4 non-spam comments and implement them into the trained classifier.

image

The results show that the classifier perfectly classify the spam comments. It also shows the input comments and the predicted class.

Releases

No releases published

Packages

No packages published

Languages