NLP_Project

Steps to run the codes

Change the path to your own file path

Explore the dataset

According to the results, it is shown there is one content existed twice in the comments. Also, both spam and non-spam classes are half of the entire class.

Preprocess the databy NLTK toolkit

In data pre-processing, we first retain alphabet letters only and convert all the letters to lowercase. Then, we carry out tokenization, remove English stop words and apply Porter stemmer into text by NLTK toolkit.

Although Stemming and Lemmatization aim to reduce inflectional forms and derivationally related forms of a word to a common base form, they are slightly different. Typically, stemming refers to a basic heuristic method that removes derivational affixes from words in the hopes of attaining this aim most of the time. Lemmatization normally refers to carrying out tasks correctly using a vocabulary and morphological analysis of words, generally with the goal of removing only inflectional ends and returning the base or dictionary form of a word which is called the lemma. Applying both Stemming and Lemmatization is time-consuming but improve slightly in accuracy. Therefore, we only use Stemming in data preprocessing because it does not need to specify part of speech which means more general.

Visualize the most frequent used words

The bigger word it is, the more frequent it appears in the comments in the spam class. Therefore, when we need to implement the new comments in the trained model in order to test the model performance, we can choose the wording appeared in the word cloud.

Train the Naive Bayes classifier model

In model training, we first need to shuffle, split the dataset into 75% for training and 25% for testing. Therefore, the number of training data is 262 and that of testing data is 88 respectively. Next, we fit the training data into a Naive Bayes classifier and cross validate the model on the training data.

The mean accuracy is above 90%, which means it is quite accurate to classify the comments whether it is spam or not based on the training data.

Test the Naive Bayes classifier model

The mean accuracy for testing data is similar to that for training data in this scenario, both of them are around 93%. Based on the confusion matrix shown in the graph, the number of true positives is 42, the number of false positives is 1, the number of false negatives is 6, the number of true negatives is 39.

Classify with new data

In the final classification, we create some new data which include 2 spam comments and 4 non-spam comments and implement them into the trained classifier.

The results show that the classifier perfectly classify the spam comments. It also shows the input comments and the predicted class.

Name		Name	Last commit message	Last commit date
Latest commit History 3 Commits
.gitignore		.gitignore
AI NLP Project_Group01.py		AI NLP Project_Group01.py
README.md		README.md
Youtube01-Psy.csv		Youtube01-Psy.csv

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

NLP_Project

About

Uh oh!

Releases

Packages

Languages

ThomasWongHY/NLP_Project

Folders and files

Latest commit

History

Repository files navigation

NLP_Project

About

Topics

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages