Skip to content

Comparing 5 sentiment analysis models on various datasets of different domains.

Notifications You must be signed in to change notification settings

Shr2020/Sentiment-Analysis-Models-Comparative-Analysis

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

53 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

Comparative Analysis of Models for Sentiment Analysis

1.) This project aims to compare 5 models and test the reliability of hybrid sentiment analysis models on various datasets of different domains.

2.) Determining whether it is possible to produce hybrid models that outperform single models with different domains and types of datasets.

Datasets

Two datasets are used for comparison:

1.) IMDb Movie review dataset: 50000 reviews

2.) Amazon Product review dataset: ~100000 reviews

These datasets are divided into two labels for sentiment: Positive and Negative.

Models

We have used five models for comparison:

1.) LSTM

2.) CNN

3.) QRNN

4.) CNN-LSTM

5.) LSTM-CNN

LSTM is a network model that solves the problems of gradient explosion and vanishing gradient in RNN. It is widely used in sentiment analysis and text analysis, as it has its own memory and therefore has the advantages of analyzing relationships among time series data through its memory function. It can also make relatively accurate forecasting.

LSTM has a memory cell at the top which helps to carry the information from a particular time instance to the next time instance in an efficient manner. Therefore it can able to remember a lot of information from previous states. This information can be added or removed from the memory cell using valves. In the LSTM network, the total input is the input data from the current time instance and the output of hidden layer from the previous time instance. These are then passed through activation functions and valves in the network and then output is produced.

CNN can also be used for text analysis. CNN's can process local features such as n-grams in text. We can use CNNs for sentiment analysis by using the word to int vectors.The word vectors act like Images as some points in space. By using word-to-Int vectors of fixed size and stacking them on top of each other, we get an “image”. Therefore we can use CNN on text similar to how we use image classification. For NLP, we typically use filters that slide over word embeddings — matrix rows.

Text data needs to be analysed in the same way as people read the text: word by word, from the beginning to the end, taking context into account as one go. This conventional one-word-at-a-time approach is called a "recurrent neural network". Unlike the conventional RNN, in the "quasi-recurrent neural network," or QRNN most of the computation happens all at once parallelly which can make it up to 16 times faster than the old approach. QRNN approach can also help with the interpretability of neurons, because each neuron's activation doesn't depend on the past history of any other neurons. The neurons thus can have (although not guaranteed) independent and well-defined meanings and these meanings are more likely to be simpler and more human-interpretable.

Using Combinations of CNN and LSTM can mean that we used the best of both worlds. CNN is able to learn the features that contribute significantly to the sentiment of the sentence using overlapping sequences resulting in a matrix of word vectors. This matrix can then be fed to LSTM. The LSTM layer can then focus on more important data rather than memorizing the whole sentence. This can lead to an increase in the efficiency of the model.

Preprocessing

The first step is to pre-process the data. The preprocessing steps include:

1.) Removing stop words, punctuations, html parsers etc.

2.) Converting to lower case.

3.) Stemming/Lemmatization.

The files for preprocessing in the repo are:

1.)amazon_reviews_preprocess.ipynb

2.)imdb_preprocess.ipynb

These files will:

1.) create a file preprocessed_data.pkl (IMDb dataset) and preprocessed_data_ammazon.pkl (Amazon dataset) after text cleaning in cache/sentiment_analysis/ directory.

2.) Create dictionary for both the datasets(word_dict_qrnn.pkl for IMDb and word_dict_amazon.pkl for Amazon dataset).

3.) Perform word to int mapping for both the datasets:

IMDb: train_qrnn.csv, test_qrnn.csv

Amazon: train_amazon.csv, test_amazon.csv

4.) Dictionary and Word to Int Mappings can be found in /data/pytorch directory

After these files are created we can proceed to training.

Model Training and Testing

Training files for all models in repo

(First run the preprocessing step, then run the files from the marked cell)

Experimental Setup:

  1. Used HPC Cluster. GPU: RTX8000
  2. Train and test data is mixed and separated to 80%(train), 10%(val) and 10%(test)
  3. Ran the training for 10 Epochs for all runs (to maintain uniformity)

First we trained and tested the models on both the datasets using the 5 models specified above with some default parameter.

Model parameters for IMDb and Amazon Dataset:

LSTM QRNN CNN CNN-LSTM LSTM-CNN
Optimizer Adam Adam Adam Adam Adam
LR 0.001 0.001 0.001 0.001 0.001
Dropout 0 0.3 0.5 0.5 0.5
Hidden size/Num Layers(qrnn) 100 2 - 100 100
Embed_size 32 32 32 32 32
Filter size/num_filters(qrnn) - 64 100 100 100

We then used Hyperband optimizations to get the optimum values of parameters. We tested the optimized models and compared the results.

Model Parameters for IMDb Dataset

LSTM QRNN CNN CNN-LSTM LSTM-CNN
Optimizer RMSProp RMSprop Adam Adam RMSprop
LR 0.0038 0.0003 0.0004 0.0091 0.145
Dropout 0.3 0.189 0.3764 0.367 0.381
Hidden size/Num Layers(qrnn) 215 3 - 117 56
Embed_size 112 93 54 31 119
Filter size/num_filters(qrnn) - 57 115 103 172

Model Parameters for Amazon Dataset

LSTM QRNN CNN CNN-LSTM LSTM-CNN
Optimizer Adam Adam RMSprop RMSprop Adam
LR 0.8 0.0001 0.00301 4.69 0.354
Dropout 0.14 0.24 0.5373 0.535 0.148
Hidden size/Num Layers(qrnn) 247 4 159 188
Embed_size 55 52 73 45 40
Filter size/num_filters(qrnn) - 176 56 93 92

## Results

Results are plotted in the file:

  1. Amazon_graphs-Hyperband.ipynb (Amazon dataset, with Hyperband Optimization)
  2. Amazon_graphs.ipynb (Amazon dataset, without Hyperband Optimization)
  3. IMDB_graphs-FINAL.ipynb (IMDb dataset, without Hyperband Optimization)
  4. IMDB_graphs-Hyperband.ipynb (IMDb dataset, with Hyperband Optimization)
  5. Hyperband vs Non-Hyperband AMAZON.ipynb (Amazon dataset used)
  6. Hyperband vs Non-Hyperband IMDB.ipynb (IMDb dataset used)

IMDB Dataset:

alt text

Amazon Dataset

alt text

IMDb Dataset: With Hyperband Optimization

alt text

Amazon Dataset: With Hyperband Optimization

alt text

Comparing the performance of Two approaches (Hyperband and Non-Hyperband)

IMDb Dataset

alt text

Amazon dataset

alt text

Observations:

  • The Accuracy for all models is the same more or less. CNN-LSTM performed the best looking at the graphs for the IMDb dataset. For the Amazon Dataset, LSTM performed the best.
  • LSTM-CNN suffers from overfitting. Hence its Accuracy is the least for both the Datasets.
  • QRNN takes less time than half time than LSTM as expected.
  • CNN and CNN-LSTM were the most time consuming when run without Hyperband optimisation.
  • The Accuracy for all the models increases or remains the same after Hyperband Optimization.
  • The train time for CNN and CNN-LSTM decreases with Hyperband Optimization. This may be because the hyperband chooses the optimisers and the right optimisers can reduce the training time significantly.

Performace Observations:

  • With hyper parameter optimization, although LSTM has higher accuracy (88.331) compared to CNN-LSTM accuracy (88.216), training time of LSTM model was almost double compared to CNN-LSTM
  • Considering the tradeoff between accuracy and training time,CNN-LSTM performs better on amazon data set.
  • We also observed that the effectiveness of the algorithms depends largely on the characteristics and quality of the datasets
  • We observed overfitting for most of the models with imdb data-sets due to the small size of data
  • As the accuracy was almost same for all models with or without hyper-parameter optimization, we think that our input feature vectors should get better before feeding to Deep Learning Classifier.
  • Hence, using robust preprocessing techniques using deep learning like BERT might help us to reduce overfitting and also to increase the accuracy