Classification comparison between machine learning models and techniques on emotion data-set.
- Change directory to preparation/
- unzip data.zip. You will get data/
- run prepare.ipynb to get dataset-emotion.p
- run word-vectior.ipynb to get vector-emotion.p
- run prepare-vocab.ipynb to get dataset-dictionary.p and dataset-dictionary-reverse.p
- dataset folder must name 'data'
- split the documents based on classes, and create sub-folders based on that classes
data
|-- positive
|-- data.txt
|-- negative
|-- data.txt
- redo step 3-5 above.
- Bag Of Word / Unigram
- Tfidf
- Timestamp based on dictionary position
- SVD / LSA
- Word Vector
- Hashing Vectorization
- featuring (stop word counts, special character counts, Character SVD, Multinomial output) ensembling on LGB
- Oracle
- TFIDF
- TFIDF-SVD 50 Components
- Dictionary timestamp average sequences
- NCE-Vector
- BOW
- TFIDF
- TFIDF-SVD 50 Components
- Dictionary timestamp 50 sequences
- Dictionary timestamp average sequences
- BOW
- TFIDF
- Hashing
- Multinomial BOW
- Multinomial TFIDF
- Multinomial Hashing
- BOW
- TFIDF
- Hashing
- Bidirectional LSTM RNN on Word Vector
- CNN-LSTM RNN on Word Vector
- CNN on Word Vector
- Feedforward average Word Vector
- LSTM RNN dictionary timestamp average sequences
- LSTM RNN on Word Vector
- LSTM RNN Hinge on Word Vector
- LSTM RNN Huber on Word Vector
- LSTM RNN Stack Hinge + Huber + Cross Entropy on Word Vector
- Self-optimized using Bayesian Feedforward average Word Vector
- LSTM Attention RNN on Word Vector
- LSTM Seq-to-Seq Attention RNN on Word Vector
- LSTM Seq-to-Seq RNN on Word Vector
- Layer-Norm LSTM RNN on Word Vector
- Neural Turing Machine on Word Vector
- Only Attention Neural Network on Word Vector
- Multi Attention Neural Network on Word Vector
- K-max Conv1D on Word Vector
All deep learning will used Word Vector generated by word.vector.ipynb
All the notebooks applied pre-processing text cleaning using Regex. re.sub('[^A-Za-z0-9 ]+', '', string).
All the models applied early-stopping to prevent overfit.
Assuming BOW and TFIDF generated all are the same.
All the models trained 80% of the dataset, validated 20% of the dataset.
Some comparisons are not consistent, example in Neural Network based, I do not calculate recall, and f1.
Oracle classifier accuracy=0.9312446022454659 #cv1
Oracle classifier accuracy=0.9294453507340946 #cv2
Oracle classifier accuracy=0.9295859123842426 #cv3
Oracle classifier accuracy=0.9295859123842426 #cv4
Oracle classifier accuracy=0.9305438929008422 #cv5
Oracle classifier accuracy=0.9279287924953816 #cv6
Oracle classifier accuracy=0.9291523715841751 #cv7
Oracle classifier accuracy=0.9278516243581746 #cv8
Oracle classifier accuracy=0.9277316569892989 #cv9
Oracle classifier accuracy=0.9281395460434761 #cv10
- Naive Bayes
accuracy validation set: 0.859072479067
precision recall f1-score support
anger 0.90 0.84 0.87 11464
fear 0.84 0.81 0.82 9455
joy 0.85 0.93 0.89 28246
love 0.82 0.61 0.70 6920
sadness 0.87 0.94 0.91 24263
surprise 0.84 0.34 0.49 3014
avg / total 0.86 0.86 0.85 83362
- SVM Kernel based
accuracy validation set: 0.898586886111
precision recall f1-score support
anger 0.91 0.88 0.90 11422
fear 0.84 0.87 0.86 9495
joy 0.90 0.94 0.92 28138
love 0.84 0.74 0.79 6970
sadness 0.93 0.94 0.94 24380
surprise 0.85 0.65 0.73 2957
avg / total 0.90 0.90 0.90 83362
- XGB
accuracy validation set: 0.895132074566
precision recall f1-score support
anger 0.88 0.92 0.90 11421
fear 0.83 0.84 0.84 9505
joy 0.93 0.91 0.92 28132
love 0.76 0.79 0.78 6801
sadness 0.95 0.94 0.94 24481
surprise 0.70 0.72 0.71 3022
avg / total 0.90 0.90 0.90 83362
- Naive bayes
accuracy validation set: 0.734855209808
precision recall f1-score support
anger 0.93 0.54 0.69 11336
fear 0.91 0.37 0.53 9603
joy 0.68 0.98 0.80 28062
love 0.96 0.16 0.27 7085
sadness 0.74 0.94 0.83 24278
surprise 0.94 0.04 0.08 2998
avg / total 0.79 0.73 0.69 83362
- SVM Kernel based
accuracy validation set: 0.850915285142
precision recall f1-score support
anger 0.93 0.75 0.83 11542
fear 0.88 0.73 0.79 9610
joy 0.79 0.97 0.87 28110
love 0.92 0.55 0.69 6883
sadness 0.88 0.94 0.91 24230
surprise 0.91 0.46 0.61 2987
avg / total 0.86 0.85 0.84 83362
- XGB
accuracy validation set: 0.885415417097
precision recall f1-score support
anger 0.88 0.90 0.89 11414
fear 0.81 0.83 0.82 9584
joy 0.92 0.91 0.91 28269
love 0.74 0.77 0.75 6878
sadness 0.95 0.93 0.94 24196
surprise 0.66 0.67 0.67 3021
avg / total 0.89 0.89 0.89 83362
- LGB
accuracy validation set: 0.902245627504
precision recall f1-score support
anger 0.91 0.91 0.91 11550
fear 0.84 0.89 0.86 9455
joy 0.92 0.92 0.92 28299
love 0.77 0.88 0.82 6910
sadness 0.96 0.92 0.94 24111
surprise 0.82 0.70 0.75 3037
avg / total 0.90 0.90 0.90 83362
- LGB average word length
precision recall f1-score support
anger 0.53 0.22 0.31 11587
fear 0.50 0.18 0.27 9504
joy 0.46 0.73 0.57 28074
love 0.30 0.08 0.13 6949
sadness 0.49 0.57 0.53 24293
surprise 0.26 0.09 0.13 2955
avg / total 0.46 0.47 0.43 83362
- RNN average word length
epoch: 74 , training loss: 1.56926864815 , train acc: 0.344564461275
epoch: 75 , training loss: 1.56929268564 , train acc: 0.345266404707
epoch: 76 , training loss: 1.56907495159 , train acc: 0.344702450399
epoch: 77 , training loss: 1.56922572671 , train acc: 0.344630455659
epoch: 78 , training loss: 1.56905308527 , train acc: 0.345173412831
epoch: 79 , training loss: 1.56924159172 , train acc: 0.34504442315
- Self optimized Feed-forward Neural Network average word length
epoch: 1 , pass acc: 0.305403 , current acc: 0.335717
epoch: 2 , pass acc: 0.335717 , current acc: 0.343142
epoch: 3 , pass acc: 0.343142 , current acc: 0.345385
epoch: 4 , pass acc: 0.345385 , current acc: 0.345745
epoch: 5 , pass acc: 0.345745 , current acc: 0.346273
epoch: 8 , pass acc: 0.346273 , current acc: 0.347029
break epoch: 107
- XGB average word length
precision recall f1-score support
anger 0.48 0.22 0.31 11390
fear 0.46 0.20 0.28 9759
joy 0.48 0.72 0.58 27981
love 0.26 0.08 0.12 6838
sadness 0.49 0.58 0.53 24395
surprise 0.21 0.07 0.11 2999
avg / total 0.45 0.47 0.44 83362
- XGB 50 word length
precision recall f1-score support
anger 0.48 0.21 0.30 11320
fear 0.45 0.18 0.25 9658
joy 0.47 0.72 0.57 28342
love 0.27 0.08 0.12 6901
sadness 0.48 0.57 0.52 24103
surprise 0.22 0.07 0.11 3038
avg / total 0.45 0.47 0.43 83362
- XGB 50 dimensions
precision recall f1-score support
anger 0.34 0.07 0.11 11336
fear 0.30 0.06 0.10 9694
joy 0.46 0.73 0.56 28068
love 0.17 0.01 0.03 6987
sadness 0.40 0.54 0.46 24277
surprise 0.07 0.00 0.01 3000
avg / total 0.37 0.42 0.35 83362
- LGB 50 dimensions
precision recall f1-score support
anger 0.38 0.05 0.09 11460
fear 0.32 0.06 0.10 9545
joy 0.44 0.73 0.55 28052
love 0.17 0.01 0.02 7015
sadness 0.39 0.54 0.45 24291
surprise 0.09 0.01 0.01 2999
avg / total 0.37 0.42 0.34 83362
- Feed-forward word vector
epoch: 220 , training loss: 0.536347100064 , training acc: 0.88623161306 , valid loss: 0.796489044223 , valid acc: 0.808128580994
epoch: 221 , training loss: 0.535965369 , training acc: 0.886359573154 , valid loss: 0.796616834379 , valid acc: 0.808147773232
epoch: 222 , training loss: 0.535587115036 , training acc: 0.886493929708 , valid loss: 0.796744917725 , valid acc: 0.808205356811
epoch: 223 , training loss: 0.535212274276 , training acc: 0.886564308321 , valid loss: 0.796873355467 , valid acc: 0.808291729718
break epoch: 223
- Recurrent Neural Network LSTM
epoch: 43 , training loss: 0.293494431497 , training acc: 0.91647871451 , valid loss: 0.301500787934 , valid acc: 0.911824742285
'unwarrentedly'
time taken: 201.13964009284973
epoch: 44 , training loss: 0.292884760499 , training acc: 0.916391732865 , valid loss: 0.316479788662 , valid acc: 0.908499411174
'unwarrentedly'
time taken: 201.1136453151703
epoch: 45 , training loss: 0.292009347957 , training acc: 0.916490714602 , valid loss: 0.301596783737 , valid acc: 0.910792332308
- Convolutional Neural Network
epoch: 45 , training loss: 0.406819455151 , training acc: 0.900872836373 , valid loss: 0.413255342881 , valid acc: 0.896866754467
epoch: 46 , training loss: 0.406384702676 , training acc: 0.900506910515 , valid loss: 0.412755171971 , valid acc: 0.896986806521
epoch: 47 , training loss: 0.406851520408 , training acc: 0.901097791847 , valid loss: 0.414064356509 , valid acc: 0.895942385028
- CNN + RNN
epoch: 11 , training loss: 0.89633845398 , training acc: 0.742852410318 , valid loss: 1.31553278515 , valid acc: 0.62818725459
'unwarrentedly'
epoch: 12 , training loss: 0.870958457021 , training acc: 0.751430695276 , valid loss: 1.35667084581 , valid acc: 0.625294097582
'unwarrentedly'
epoch: 13 , training loss: 0.867803699844 , training acc: 0.751742633378 , valid loss: 1.37888059403 , valid acc: 0.626062403385
break epoch: 13
- Bidirectional Recurrent Neural Network LSTM
epoch: 31 , training loss: 0.306716536762 , training acc: 0.915449927471 , valid loss: 0.305131908171 , valid acc: 0.911728706609
'unwarrentedly'
time taken: 633.0987968444824
epoch: 32 , training loss: 0.30510045163 , training acc: 0.915593898146 , valid loss: 0.306183017638 , valid acc: 0.911968804136
'unwarrentedly'
time taken: 633.1091139316559
epoch: 33 , training loss: 0.304126529089 , training acc: 0.915686878639 , valid loss: 0.305108016744 , valid acc: 0.911548635682
- LSTM + RNN + Huber
epoch: 13 , training loss: 0.00908494556632 , training acc: 0.921331746462 , valid loss: 0.00935218644235 , valid acc: 0.916014416235
'unwarrentedly'
time taken: 192.9246437549591
epoch: 14 , training loss: 0.00894308844975 , training acc: 0.921859639933 , valid loss: 0.00925432719706 , valid acc: 0.915174080902
'unwarrentedly'
epoch: 14 , pass acc: 0.916014416235 , current acc: 0.91602642217
7.LSTM + RNN + Hinge
epoch: 22 , training loss: 0.0533290837786 , training acc: 0.919721068543 , valid loss: 0.0562712795169 , valid acc: 0.913313335445
epoch: 22 , pass acc: 0.913505411806 , current acc: 0.913613458785
time taken: 192.64984250068665
epoch: 23 , training loss: 0.0531139997573 , training acc: 0.920024007827 , valid loss: 0.0563346863817 , valid acc: 0.913613458785
time taken: 192.55219149589539
- Naive Bayes
accuracy validation set: 0.578524987404
precision recall f1-score support
anger 0.93 0.07 0.12 11449
fear 0.96 0.03 0.05 9533
joy 0.49 1.00 0.66 28047
love 1.00 0.00 0.01 6967
sadness 0.76 0.79 0.78 24408
surprise 0.00 0.00 0.00 2958
avg / total 0.71 0.58 0.47 83362
- SVM Kernel Based
accuracy validation set: 0.791163839639
precision recall f1-score support
anger 0.92 0.64 0.76 11592
fear 0.87 0.59 0.70 9557
joy 0.71 0.97 0.82 28068
love 0.94 0.40 0.56 6933
sadness 0.83 0.90 0.87 24273
surprise 0.91 0.34 0.49 2939
avg / total 0.82 0.79 0.78 83362