Multi-class sentiment analysis problem to classify texts into five emotion categories: joy, sadness, anger, fear, neutral. A fun weekend project to go through different text classification techniques. This includes dataset preparation, traditional machine learning with scikit-learn, LSTM neural networks and transfer learning using BERT (tensorflow keras).
Summary Table
| Dataset | Year | Content | Size | Emotion categories | Balanced |
|---|---|---|---|---|---|
| dailydialog | 2017 | dialogues | 102k sentences | neutral, joy, surprise, sadness, anger, disgust, fear | No |
| emotion-stimulus | 2015 | dialogues | 2.5k sentences | sadness, joy, anger, fear, surprise, disgust | No |
| isear | 1990 | emotional situations | 7.5k sentences | joy, fear, anger, sadness, disgust, shame, guilt | Yes |
links: dailydialog, emotion-stimulus, isear
Dataset was combined from dailydialog, isear, and emotion-stimulus to create a balanced dataset with 5 labels: joy, sad, anger, fear, and neutral. The texts mainly consist of short messages and dialog utterances.
- Data preprocessing: noise and punctuation removal, tokenization, stemming
- Text Representation: TF-IDF
- Classifiers: Naive Bayes, Random Forrest, Logistic Regrassion, SVM
| Approach | F1-Score |
|---|---|
| Naive Bayes | 0.6702 |
| Random Forrest | 0.6372 |
| Logistic Regression | 0.6935 |
| SVM | 0.7271 |
- Data preprocessing: noise and punctuation removal, tokenization
- Word Embeddings: pretrained 300 dimensional word2vec (link)
- Deep Network: LSTM, biLSTM, CNN
| Approach | F1-Score |
|---|---|
| LSTM + w2v_wiki | 0.7395 |
| biLSTM + w2v_wiki | 0.7414 |
| CNN + w2v_wiki | 0.7580 |
Finetuning BERT for text classification
| Approach | F1-Score |
|---|---|
| finetuned BERT | 0.8320 |