Performance evaluation of sentiment classification in movie reviews
Given the availability of a large volume of online review data (Amazon, IMDB, etc.), sentiment analysis becomes increasingly important. In this project, a sentiment sentiment classification is evaluated using ensemble methods.
This can also be downloaded from: http://ai.stanford.edu/~amaas/data/sentiment/aclImdb_v1.tar.gz.
The training dataset in aclImdb folder has two sub-directories pos/ for positive texts and neg/ for negative ones. Use only these two directories. The first task is to combine both of them to a single csv file, “imdb_tr.csv”. The csv file has three columns,"row_number" and “text” and “polarity”. The column “text” contains review texts from the aclImdb database and the column “polarity” consists of sentiment labels, 1 for positive and 0 for negative. The file imdb_tr.csv is an output of this preprocessing. In addition, common English stopwords should be removed. An English stopwords reference ('stopwords.en') is given in the code for reference.
Vectorization methods: Unigram , Bigram
Feature Extraction: TF-IDF
In this project, we will train ensemble methods and evaluate the optimized combination:
http://scikit-learn.org/stable/modules/ensemble.html
imdb_data_preprocess : Explores the neg and pos folders from aclImdb/train and creates a imdb_tr.csv file in the required format
remove_stopwords : Takes a sentence and the stopwords as inputs and returns the sentence without any stopwords
unigram_process : Takes the data to be fit as the input and returns a vectorizer of the unigram as output
bigram_process : Takes the data to be fit as the input and returns a vectorizer of the bigram as output
tfidf_process : Takes the data to be fit as the input and returns a vectorizer of the tfidf as output
retrieve_data : Takes a CSV file as the input and returns the corresponding arrays of labels and data as output
random_forest_classifier : Applies Random Forest on the training data and returns the predicted labels
extra_tree_classifier : Applies Extra Tree on the training data and returns the predicted labels
bagging_decision_tree : Applies Bagged Decision Tree on the training data and returns the predicted labels
ada_boost_classifier : Applies ADA Boost on the training data and returns the predicted labels
gradient_boost_classifier : Applies Gradient Boost on the training data and returns the predicted labels
accuracy : Finds the accuracy in percentage given the training and test labels
OS: Linux Mint
Language : Python 3
Libraries : Scikit, Pandas
Run python sentimentalAnalysis.py
Check Result in ScreenShot folder
Supervised Ensemble Machine Learning Aided Performance Evaluation of Sentiment Classification
Sheikh Shah Mohammad Motiur Rahman,Md. Habibur Rahman,Kaushik Sarker,Md. Samadur Rahman, Nazmul Ahsan,M. Mesbahuddin Sarker
2nd International Conference on Data Mining, Communications and Information Technology (DMCIT 2018), Shanghai, China