Skip to content

Latest commit

 

History

History
 
 

classification-comparison

Folders and files

NameName
Last commit message
Last commit date

parent directory

..
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Emotion-Classification-Comparison

Classification comparison between machine learning models and techniques on emotion data-set.

Preparation

  1. Change directory to preparation/
  2. unzip data.zip. You will get data/
  3. run prepare.ipynb to get dataset-emotion.p
  4. run word-vectior.ipynb to get vector-emotion.p
  5. run prepare-vocab.ipynb to get dataset-dictionary.p and dataset-dictionary-reverse.p

On how to prepare your own dataset

  1. dataset folder must name 'data'
  2. split the documents based on classes, and create sub-folders based on that classes
data
|-- positive
	|-- data.txt
|-- negative
	|-- data.txt
  1. redo step 3-5 above.

Vectorization Techniques

  1. Bag Of Word / Unigram
  2. Tfidf
  3. Timestamp based on dictionary position
  4. SVD / LSA
  5. Word Vector
  6. Hashing Vectorization

Models

Ensemble

  1. featuring (stop word counts, special character counts, Character SVD, Multinomial output) ensembling on LGB
  2. Oracle

Light Gradient Boosting

  1. TFIDF
  2. TFIDF-SVD 50 Components
  3. Dictionary timestamp average sequences
  4. NCE-Vector

eXtreme Gradient Boosting

  1. BOW
  2. TFIDF
  3. TFIDF-SVD 50 Components
  4. Dictionary timestamp 50 sequences
  5. Dictionary timestamp average sequences

Support Vector Machine

  1. BOW
  2. TFIDF
  3. Hashing

Naive Bayes

  1. Multinomial BOW
  2. Multinomial TFIDF
  3. Multinomial Hashing

NB-SVM

  1. BOW
  2. TFIDF
  3. Hashing

Deep Learning

  1. Bidirectional LSTM RNN on Word Vector
  2. CNN-LSTM RNN on Word Vector
  3. CNN on Word Vector
  4. Feedforward average Word Vector
  5. LSTM RNN dictionary timestamp average sequences
  6. LSTM RNN on Word Vector
  7. LSTM RNN Hinge on Word Vector
  8. LSTM RNN Huber on Word Vector
  9. LSTM RNN Stack Hinge + Huber + Cross Entropy on Word Vector
  10. Self-optimized using Bayesian Feedforward average Word Vector
  11. LSTM Attention RNN on Word Vector
  12. LSTM Seq-to-Seq Attention RNN on Word Vector
  13. LSTM Seq-to-Seq RNN on Word Vector
  14. Layer-Norm LSTM RNN on Word Vector
  15. Neural Turing Machine on Word Vector
  16. Only Attention Neural Network on Word Vector
  17. Multi Attention Neural Network on Word Vector
  18. K-max Conv1D on Word Vector

Assumptions

All deep learning will used Word Vector generated by word.vector.ipynb

All the notebooks applied pre-processing text cleaning using Regex. re.sub('[^A-Za-z0-9 ]+', '', string).

All the models applied early-stopping to prevent overfit.

Assuming BOW and TFIDF generated all are the same.

All the models trained 80% of the dataset, validated 20% of the dataset.

Some comparisons are not consistent, example in Neural Network based, I do not calculate recall, and f1.

Results

Oracle

Oracle classifier accuracy=0.9312446022454659 #cv1
Oracle classifier accuracy=0.9294453507340946 #cv2
Oracle classifier accuracy=0.9295859123842426 #cv3
Oracle classifier accuracy=0.9295859123842426 #cv4
Oracle classifier accuracy=0.9305438929008422 #cv5
Oracle classifier accuracy=0.9279287924953816 #cv6
Oracle classifier accuracy=0.9291523715841751 #cv7
Oracle classifier accuracy=0.9278516243581746 #cv8
Oracle classifier accuracy=0.9277316569892989 #cv9
Oracle classifier accuracy=0.9281395460434761 #cv10

BOW / Unigram

  1. Naive Bayes
accuracy validation set:  0.859072479067
             precision    recall  f1-score   support

      anger       0.90      0.84      0.87     11464
       fear       0.84      0.81      0.82      9455
        joy       0.85      0.93      0.89     28246
       love       0.82      0.61      0.70      6920
    sadness       0.87      0.94      0.91     24263
   surprise       0.84      0.34      0.49      3014

avg / total       0.86      0.86      0.85     83362
  1. SVM Kernel based
accuracy validation set:  0.898586886111
             precision    recall  f1-score   support

      anger       0.91      0.88      0.90     11422
       fear       0.84      0.87      0.86      9495
        joy       0.90      0.94      0.92     28138
       love       0.84      0.74      0.79      6970
    sadness       0.93      0.94      0.94     24380
   surprise       0.85      0.65      0.73      2957

avg / total       0.90      0.90      0.90     83362
  1. XGB
accuracy validation set:  0.895132074566
             precision    recall  f1-score   support

      anger       0.88      0.92      0.90     11421
       fear       0.83      0.84      0.84      9505
        joy       0.93      0.91      0.92     28132
       love       0.76      0.79      0.78      6801
    sadness       0.95      0.94      0.94     24481
   surprise       0.70      0.72      0.71      3022

avg / total       0.90      0.90      0.90     83362

TFIDF

  1. Naive bayes
accuracy validation set:  0.734855209808
             precision    recall  f1-score   support

      anger       0.93      0.54      0.69     11336
       fear       0.91      0.37      0.53      9603
        joy       0.68      0.98      0.80     28062
       love       0.96      0.16      0.27      7085
    sadness       0.74      0.94      0.83     24278
   surprise       0.94      0.04      0.08      2998

avg / total       0.79      0.73      0.69     83362
  1. SVM Kernel based
accuracy validation set:  0.850915285142
             precision    recall  f1-score   support

      anger       0.93      0.75      0.83     11542
       fear       0.88      0.73      0.79      9610
        joy       0.79      0.97      0.87     28110
       love       0.92      0.55      0.69      6883
    sadness       0.88      0.94      0.91     24230
   surprise       0.91      0.46      0.61      2987

avg / total       0.86      0.85      0.84     83362
  1. XGB
accuracy validation set:  0.885415417097
             precision    recall  f1-score   support

      anger       0.88      0.90      0.89     11414
       fear       0.81      0.83      0.82      9584
        joy       0.92      0.91      0.91     28269
       love       0.74      0.77      0.75      6878
    sadness       0.95      0.93      0.94     24196
   surprise       0.66      0.67      0.67      3021

avg / total       0.89      0.89      0.89     83362
  1. LGB
accuracy validation set:  0.902245627504
             precision    recall  f1-score   support

      anger       0.91      0.91      0.91     11550
       fear       0.84      0.89      0.86      9455
        joy       0.92      0.92      0.92     28299
       love       0.77      0.88      0.82      6910
    sadness       0.96      0.92      0.94     24111
   surprise       0.82      0.70      0.75      3037

avg / total       0.90      0.90      0.90     83362

Timestamp based on Dictionary position

  1. LGB average word length
precision    recall  f1-score   support

      anger       0.53      0.22      0.31     11587
       fear       0.50      0.18      0.27      9504
        joy       0.46      0.73      0.57     28074
       love       0.30      0.08      0.13      6949
    sadness       0.49      0.57      0.53     24293
   surprise       0.26      0.09      0.13      2955

avg / total       0.46      0.47      0.43     83362
  1. RNN average word length
epoch: 74 , training loss:  1.56926864815 , train acc:  0.344564461275
epoch: 75 , training loss:  1.56929268564 , train acc:  0.345266404707
epoch: 76 , training loss:  1.56907495159 , train acc:  0.344702450399
epoch: 77 , training loss:  1.56922572671 , train acc:  0.344630455659
epoch: 78 , training loss:  1.56905308527 , train acc:  0.345173412831
epoch: 79 , training loss:  1.56924159172 , train acc:  0.34504442315
  1. Self optimized Feed-forward Neural Network average word length
epoch: 1 , pass acc: 0.305403 , current acc: 0.335717
epoch: 2 , pass acc: 0.335717 , current acc: 0.343142
epoch: 3 , pass acc: 0.343142 , current acc: 0.345385
epoch: 4 , pass acc: 0.345385 , current acc: 0.345745
epoch: 5 , pass acc: 0.345745 , current acc: 0.346273
epoch: 8 , pass acc: 0.346273 , current acc: 0.347029
break epoch: 107
  1. XGB average word length
precision    recall  f1-score   support

      anger       0.48      0.22      0.31     11390
       fear       0.46      0.20      0.28      9759
        joy       0.48      0.72      0.58     27981
       love       0.26      0.08      0.12      6838
    sadness       0.49      0.58      0.53     24395
   surprise       0.21      0.07      0.11      2999

avg / total       0.45      0.47      0.44     83362
  1. XGB 50 word length
precision    recall  f1-score   support

      anger       0.48      0.21      0.30     11320
       fear       0.45      0.18      0.25      9658
        joy       0.47      0.72      0.57     28342
       love       0.27      0.08      0.12      6901
    sadness       0.48      0.57      0.52     24103
   surprise       0.22      0.07      0.11      3038

avg / total       0.45      0.47      0.43     83362

SVD / LSA

  1. XGB 50 dimensions
precision    recall  f1-score   support

      anger       0.34      0.07      0.11     11336
       fear       0.30      0.06      0.10      9694
        joy       0.46      0.73      0.56     28068
       love       0.17      0.01      0.03      6987
    sadness       0.40      0.54      0.46     24277
   surprise       0.07      0.00      0.01      3000

avg / total       0.37      0.42      0.35     83362
  1. LGB 50 dimensions
precision    recall  f1-score   support

      anger       0.38      0.05      0.09     11460
       fear       0.32      0.06      0.10      9545
        joy       0.44      0.73      0.55     28052
       love       0.17      0.01      0.02      7015
    sadness       0.39      0.54      0.45     24291
   surprise       0.09      0.01      0.01      2999

avg / total       0.37      0.42      0.34     83362

Word Vector

  1. Feed-forward word vector
epoch: 220 , training loss: 0.536347100064 , training acc: 0.88623161306 , valid loss: 0.796489044223 , valid acc: 0.808128580994
epoch: 221 , training loss: 0.535965369 , training acc: 0.886359573154 , valid loss: 0.796616834379 , valid acc: 0.808147773232
epoch: 222 , training loss: 0.535587115036 , training acc: 0.886493929708 , valid loss: 0.796744917725 , valid acc: 0.808205356811
epoch: 223 , training loss: 0.535212274276 , training acc: 0.886564308321 , valid loss: 0.796873355467 , valid acc: 0.808291729718
break epoch: 223
  1. Recurrent Neural Network LSTM
epoch: 43 , training loss: 0.293494431497 , training acc: 0.91647871451 , valid loss: 0.301500787934 , valid acc: 0.911824742285
'unwarrentedly'
time taken: 201.13964009284973
epoch: 44 , training loss: 0.292884760499 , training acc: 0.916391732865 , valid loss: 0.316479788662 , valid acc: 0.908499411174
'unwarrentedly'
time taken: 201.1136453151703
epoch: 45 , training loss: 0.292009347957 , training acc: 0.916490714602 , valid loss: 0.301596783737 , valid acc: 0.910792332308
  1. Convolutional Neural Network
epoch: 45 , training loss: 0.406819455151 , training acc: 0.900872836373 , valid loss: 0.413255342881 , valid acc: 0.896866754467
epoch: 46 , training loss: 0.406384702676 , training acc: 0.900506910515 , valid loss: 0.412755171971 , valid acc: 0.896986806521
epoch: 47 , training loss: 0.406851520408 , training acc: 0.901097791847 , valid loss: 0.414064356509 , valid acc: 0.895942385028
  1. CNN + RNN
epoch: 11 , training loss: 0.89633845398 , training acc: 0.742852410318 , valid loss: 1.31553278515 , valid acc: 0.62818725459
'unwarrentedly'
epoch: 12 , training loss: 0.870958457021 , training acc: 0.751430695276 , valid loss: 1.35667084581 , valid acc: 0.625294097582
'unwarrentedly'
epoch: 13 , training loss: 0.867803699844 , training acc: 0.751742633378 , valid loss: 1.37888059403 , valid acc: 0.626062403385
break epoch: 13
  1. Bidirectional Recurrent Neural Network LSTM
epoch: 31 , training loss: 0.306716536762 , training acc: 0.915449927471 , valid loss: 0.305131908171 , valid acc: 0.911728706609
'unwarrentedly'
time taken: 633.0987968444824
epoch: 32 , training loss: 0.30510045163 , training acc: 0.915593898146 , valid loss: 0.306183017638 , valid acc: 0.911968804136
'unwarrentedly'
time taken: 633.1091139316559
epoch: 33 , training loss: 0.304126529089 , training acc: 0.915686878639 , valid loss: 0.305108016744 , valid acc: 0.911548635682
  1. LSTM + RNN + Huber
epoch: 13 , training loss: 0.00908494556632 , training acc: 0.921331746462 , valid loss: 0.00935218644235 , valid acc: 0.916014416235
'unwarrentedly'
time taken: 192.9246437549591
epoch: 14 , training loss: 0.00894308844975 , training acc: 0.921859639933 , valid loss: 0.00925432719706 , valid acc: 0.915174080902
'unwarrentedly'
epoch: 14 , pass acc: 0.916014416235 , current acc: 0.91602642217

7.LSTM + RNN + Hinge

epoch: 22 , training loss: 0.0533290837786 , training acc: 0.919721068543 , valid loss: 0.0562712795169 , valid acc: 0.913313335445
epoch: 22 , pass acc: 0.913505411806 , current acc: 0.913613458785
time taken: 192.64984250068665
epoch: 23 , training loss: 0.0531139997573 , training acc: 0.920024007827 , valid loss: 0.0563346863817 , valid acc: 0.913613458785
time taken: 192.55219149589539

Hashing Vectorization

  1. Naive Bayes
accuracy validation set:  0.578524987404
             precision    recall  f1-score   support

      anger       0.93      0.07      0.12     11449
       fear       0.96      0.03      0.05      9533
        joy       0.49      1.00      0.66     28047
       love       1.00      0.00      0.01      6967
    sadness       0.76      0.79      0.78     24408
   surprise       0.00      0.00      0.00      2958

avg / total       0.71      0.58      0.47     83362
  1. SVM Kernel Based
accuracy validation set:  0.791163839639
             precision    recall  f1-score   support

      anger       0.92      0.64      0.76     11592
       fear       0.87      0.59      0.70      9557
        joy       0.71      0.97      0.82     28068
       love       0.94      0.40      0.56      6933
    sadness       0.83      0.90      0.87     24273
   surprise       0.91      0.34      0.49      2939

avg / total       0.82      0.79      0.78     83362