Name	Name	Last commit message	Last commit date
parent directory ..
Deep-learning	Deep-learning
Ensemble	Ensemble
LGB	LGB
NB-SVM	NB-SVM
Naive-Bayes	Naive-Bayes
SVM	SVM
XGB	XGB
preparation	preparation
README.md	README.md

Emotion-Classification-Comparison

Classification comparison between machine learning models and techniques on emotion data-set.

Preparation

Change directory to preparation/
unzip data.zip. You will get data/
run prepare.ipynb to get dataset-emotion.p
run word-vectior.ipynb to get vector-emotion.p
run prepare-vocab.ipynb to get dataset-dictionary.p and dataset-dictionary-reverse.p

On how to prepare your own dataset

dataset folder must name 'data'
split the documents based on classes, and create sub-folders based on that classes

data
|-- positive
	|-- data.txt
|-- negative
	|-- data.txt

redo step 3-5 above.

Vectorization Techniques

Bag Of Word / Unigram
Tfidf
Timestamp based on dictionary position
SVD / LSA
Word Vector
Hashing Vectorization

Models

Ensemble

featuring (stop word counts, special character counts, Character SVD, Multinomial output) ensembling on LGB
Oracle

Light Gradient Boosting

TFIDF
TFIDF-SVD 50 Components
Dictionary timestamp average sequences
NCE-Vector

eXtreme Gradient Boosting

BOW
TFIDF
TFIDF-SVD 50 Components
Dictionary timestamp 50 sequences
Dictionary timestamp average sequences

Support Vector Machine

BOW
TFIDF
Hashing

Naive Bayes

Multinomial BOW
Multinomial TFIDF
Multinomial Hashing

NB-SVM

BOW
TFIDF
Hashing

Deep Learning

Bidirectional LSTM RNN on Word Vector
CNN-LSTM RNN on Word Vector
CNN on Word Vector
Feedforward average Word Vector
LSTM RNN dictionary timestamp average sequences
LSTM RNN on Word Vector
LSTM RNN Hinge on Word Vector
LSTM RNN Huber on Word Vector
LSTM RNN Stack Hinge + Huber + Cross Entropy on Word Vector
Self-optimized using Bayesian Feedforward average Word Vector
LSTM Attention RNN on Word Vector
LSTM Seq-to-Seq Attention RNN on Word Vector
LSTM Seq-to-Seq RNN on Word Vector
Layer-Norm LSTM RNN on Word Vector
Neural Turing Machine on Word Vector
Only Attention Neural Network on Word Vector
Multi Attention Neural Network on Word Vector
K-max Conv1D on Word Vector

Assumptions

All deep learning will used Word Vector generated by word.vector.ipynb

All the notebooks applied pre-processing text cleaning using Regex. re.sub('[^A-Za-z0-9 ]+', '', string).

All the models applied early-stopping to prevent overfit.

Assuming BOW and TFIDF generated all are the same.

All the models trained 80% of the dataset, validated 20% of the dataset.

Some comparisons are not consistent, example in Neural Network based, I do not calculate recall, and f1.

Results

Oracle

Oracle classifier accuracy=0.9312446022454659 #cv1
Oracle classifier accuracy=0.9294453507340946 #cv2
Oracle classifier accuracy=0.9295859123842426 #cv3
Oracle classifier accuracy=0.9295859123842426 #cv4
Oracle classifier accuracy=0.9305438929008422 #cv5
Oracle classifier accuracy=0.9279287924953816 #cv6
Oracle classifier accuracy=0.9291523715841751 #cv7
Oracle classifier accuracy=0.9278516243581746 #cv8
Oracle classifier accuracy=0.9277316569892989 #cv9
Oracle classifier accuracy=0.9281395460434761 #cv10

BOW / Unigram

Naive Bayes

accuracy validation set:  0.859072479067
             precision    recall  f1-score   support

      anger       0.90      0.84      0.87     11464
       fear       0.84      0.81      0.82      9455
        joy       0.85      0.93      0.89     28246
       love       0.82      0.61      0.70      6920
    sadness       0.87      0.94      0.91     24263
   surprise       0.84      0.34      0.49      3014

avg / total       0.86      0.86      0.85     83362

SVM Kernel based

accuracy validation set:  0.898586886111
             precision    recall  f1-score   support

      anger       0.91      0.88      0.90     11422
       fear       0.84      0.87      0.86      9495
        joy       0.90      0.94      0.92     28138
       love       0.84      0.74      0.79      6970
    sadness       0.93      0.94      0.94     24380
   surprise       0.85      0.65      0.73      2957

avg / total       0.90      0.90      0.90     83362

accuracy validation set:  0.895132074566
             precision    recall  f1-score   support

      anger       0.88      0.92      0.90     11421
       fear       0.83      0.84      0.84      9505
        joy       0.93      0.91      0.92     28132
       love       0.76      0.79      0.78      6801
    sadness       0.95      0.94      0.94     24481
   surprise       0.70      0.72      0.71      3022

avg / total       0.90      0.90      0.90     83362

TFIDF

Naive bayes

accuracy validation set:  0.734855209808
             precision    recall  f1-score   support

      anger       0.93      0.54      0.69     11336
       fear       0.91      0.37      0.53      9603
        joy       0.68      0.98      0.80     28062
       love       0.96      0.16      0.27      7085
    sadness       0.74      0.94      0.83     24278
   surprise       0.94      0.04      0.08      2998

avg / total       0.79      0.73      0.69     83362

SVM Kernel based

accuracy validation set:  0.850915285142
             precision    recall  f1-score   support

      anger       0.93      0.75      0.83     11542
       fear       0.88      0.73      0.79      9610
        joy       0.79      0.97      0.87     28110
       love       0.92      0.55      0.69      6883
    sadness       0.88      0.94      0.91     24230
   surprise       0.91      0.46      0.61      2987

avg / total       0.86      0.85      0.84     83362

accuracy validation set:  0.885415417097
             precision    recall  f1-score   support

      anger       0.88      0.90      0.89     11414
       fear       0.81      0.83      0.82      9584
        joy       0.92      0.91      0.91     28269
       love       0.74      0.77      0.75      6878
    sadness       0.95      0.93      0.94     24196
   surprise       0.66      0.67      0.67      3021

avg / total       0.89      0.89      0.89     83362

accuracy validation set:  0.902245627504
             precision    recall  f1-score   support

      anger       0.91      0.91      0.91     11550
       fear       0.84      0.89      0.86      9455
        joy       0.92      0.92      0.92     28299
       love       0.77      0.88      0.82      6910
    sadness       0.96      0.92      0.94     24111
   surprise       0.82      0.70      0.75      3037

avg / total       0.90      0.90      0.90     83362

Timestamp based on Dictionary position

LGB average word length

precision    recall  f1-score   support

      anger       0.53      0.22      0.31     11587
       fear       0.50      0.18      0.27      9504
        joy       0.46      0.73      0.57     28074
       love       0.30      0.08      0.13      6949
    sadness       0.49      0.57      0.53     24293
   surprise       0.26      0.09      0.13      2955

avg / total       0.46      0.47      0.43     83362

RNN average word length

epoch: 74 , training loss:  1.56926864815 , train acc:  0.344564461275
epoch: 75 , training loss:  1.56929268564 , train acc:  0.345266404707
epoch: 76 , training loss:  1.56907495159 , train acc:  0.344702450399
epoch: 77 , training loss:  1.56922572671 , train acc:  0.344630455659
epoch: 78 , training loss:  1.56905308527 , train acc:  0.345173412831
epoch: 79 , training loss:  1.56924159172 , train acc:  0.34504442315

Self optimized Feed-forward Neural Network average word length

epoch: 1 , pass acc: 0.305403 , current acc: 0.335717
epoch: 2 , pass acc: 0.335717 , current acc: 0.343142
epoch: 3 , pass acc: 0.343142 , current acc: 0.345385
epoch: 4 , pass acc: 0.345385 , current acc: 0.345745
epoch: 5 , pass acc: 0.345745 , current acc: 0.346273
epoch: 8 , pass acc: 0.346273 , current acc: 0.347029
break epoch: 107

XGB average word length

precision    recall  f1-score   support

      anger       0.48      0.22      0.31     11390
       fear       0.46      0.20      0.28      9759
        joy       0.48      0.72      0.58     27981
       love       0.26      0.08      0.12      6838
    sadness       0.49      0.58      0.53     24395
   surprise       0.21      0.07      0.11      2999

avg / total       0.45      0.47      0.44     83362

XGB 50 word length

precision    recall  f1-score   support

      anger       0.48      0.21      0.30     11320
       fear       0.45      0.18      0.25      9658
        joy       0.47      0.72      0.57     28342
       love       0.27      0.08      0.12      6901
    sadness       0.48      0.57      0.52     24103
   surprise       0.22      0.07      0.11      3038

avg / total       0.45      0.47      0.43     83362

SVD / LSA

XGB 50 dimensions

precision    recall  f1-score   support

      anger       0.34      0.07      0.11     11336
       fear       0.30      0.06      0.10      9694
        joy       0.46      0.73      0.56     28068
       love       0.17      0.01      0.03      6987
    sadness       0.40      0.54      0.46     24277
   surprise       0.07      0.00      0.01      3000

avg / total       0.37      0.42      0.35     83362

LGB 50 dimensions

precision    recall  f1-score   support

      anger       0.38      0.05      0.09     11460
       fear       0.32      0.06      0.10      9545
        joy       0.44      0.73      0.55     28052
       love       0.17      0.01      0.02      7015
    sadness       0.39      0.54      0.45     24291
   surprise       0.09      0.01      0.01      2999

avg / total       0.37      0.42      0.34     83362

Word Vector

Feed-forward word vector

epoch: 220 , training loss: 0.536347100064 , training acc: 0.88623161306 , valid loss: 0.796489044223 , valid acc: 0.808128580994
epoch: 221 , training loss: 0.535965369 , training acc: 0.886359573154 , valid loss: 0.796616834379 , valid acc: 0.808147773232
epoch: 222 , training loss: 0.535587115036 , training acc: 0.886493929708 , valid loss: 0.796744917725 , valid acc: 0.808205356811
epoch: 223 , training loss: 0.535212274276 , training acc: 0.886564308321 , valid loss: 0.796873355467 , valid acc: 0.808291729718
break epoch: 223

Recurrent Neural Network LSTM

epoch: 43 , training loss: 0.293494431497 , training acc: 0.91647871451 , valid loss: 0.301500787934 , valid acc: 0.911824742285
'unwarrentedly'
time taken: 201.13964009284973
epoch: 44 , training loss: 0.292884760499 , training acc: 0.916391732865 , valid loss: 0.316479788662 , valid acc: 0.908499411174
'unwarrentedly'
time taken: 201.1136453151703
epoch: 45 , training loss: 0.292009347957 , training acc: 0.916490714602 , valid loss: 0.301596783737 , valid acc: 0.910792332308

Convolutional Neural Network

epoch: 45 , training loss: 0.406819455151 , training acc: 0.900872836373 , valid loss: 0.413255342881 , valid acc: 0.896866754467
epoch: 46 , training loss: 0.406384702676 , training acc: 0.900506910515 , valid loss: 0.412755171971 , valid acc: 0.896986806521
epoch: 47 , training loss: 0.406851520408 , training acc: 0.901097791847 , valid loss: 0.414064356509 , valid acc: 0.895942385028

CNN + RNN

epoch: 11 , training loss: 0.89633845398 , training acc: 0.742852410318 , valid loss: 1.31553278515 , valid acc: 0.62818725459
'unwarrentedly'
epoch: 12 , training loss: 0.870958457021 , training acc: 0.751430695276 , valid loss: 1.35667084581 , valid acc: 0.625294097582
'unwarrentedly'
epoch: 13 , training loss: 0.867803699844 , training acc: 0.751742633378 , valid loss: 1.37888059403 , valid acc: 0.626062403385
break epoch: 13

Bidirectional Recurrent Neural Network LSTM

epoch: 31 , training loss: 0.306716536762 , training acc: 0.915449927471 , valid loss: 0.305131908171 , valid acc: 0.911728706609
'unwarrentedly'
time taken: 633.0987968444824
epoch: 32 , training loss: 0.30510045163 , training acc: 0.915593898146 , valid loss: 0.306183017638 , valid acc: 0.911968804136
'unwarrentedly'
time taken: 633.1091139316559
epoch: 33 , training loss: 0.304126529089 , training acc: 0.915686878639 , valid loss: 0.305108016744 , valid acc: 0.911548635682

LSTM + RNN + Huber

epoch: 13 , training loss: 0.00908494556632 , training acc: 0.921331746462 , valid loss: 0.00935218644235 , valid acc: 0.916014416235
'unwarrentedly'
time taken: 192.9246437549591
epoch: 14 , training loss: 0.00894308844975 , training acc: 0.921859639933 , valid loss: 0.00925432719706 , valid acc: 0.915174080902
'unwarrentedly'
epoch: 14 , pass acc: 0.916014416235 , current acc: 0.91602642217

7.LSTM + RNN + Hinge

epoch: 22 , training loss: 0.0533290837786 , training acc: 0.919721068543 , valid loss: 0.0562712795169 , valid acc: 0.913313335445
epoch: 22 , pass acc: 0.913505411806 , current acc: 0.913613458785
time taken: 192.64984250068665
epoch: 23 , training loss: 0.0531139997573 , training acc: 0.920024007827 , valid loss: 0.0563346863817 , valid acc: 0.913613458785
time taken: 192.55219149589539

Hashing Vectorization

Naive Bayes

accuracy validation set:  0.578524987404
             precision    recall  f1-score   support

      anger       0.93      0.07      0.12     11449
       fear       0.96      0.03      0.05      9533
        joy       0.49      1.00      0.66     28047
       love       1.00      0.00      0.01      6967
    sadness       0.76      0.79      0.78     24408
   surprise       0.00      0.00      0.00      2958

avg / total       0.71      0.58      0.47     83362

SVM Kernel Based

accuracy validation set:  0.791163839639
             precision    recall  f1-score   support

      anger       0.92      0.64      0.76     11592
       fear       0.87      0.59      0.70      9557
        joy       0.71      0.97      0.82     28068
       love       0.94      0.40      0.56      6933
    sadness       0.83      0.90      0.87     24273
   surprise       0.91      0.34      0.49      2939

avg / total       0.82      0.79      0.78     83362

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

classification-comparison

classification-comparison

README.md

Emotion-Classification-Comparison

Preparation

On how to prepare your own dataset

Vectorization Techniques

Models

Ensemble

Light Gradient Boosting

eXtreme Gradient Boosting

Support Vector Machine

Naive Bayes

NB-SVM

Deep Learning

Assumptions

Results

Oracle

BOW / Unigram

TFIDF

Timestamp based on Dictionary position

SVD / LSA

Word Vector

Hashing Vectorization

Files

classification-comparison

Directory actions

More options

Directory actions

More options

Latest commit

History

classification-comparison

Folders and files

parent directory

README.md

Emotion-Classification-Comparison

Preparation

On how to prepare your own dataset

Vectorization Techniques

Models

Ensemble

Light Gradient Boosting

eXtreme Gradient Boosting

Support Vector Machine

Naive Bayes

NB-SVM

Deep Learning

Assumptions

Results

Oracle

BOW / Unigram

TFIDF

Timestamp based on Dictionary position

SVD / LSA

Word Vector

Hashing Vectorization