My Trial to tackle the Kaggle Toxic Comment Classification Competition
I built a model that calculates the probability of a comment belonging to any of the mentioned classes, I used XGBoost after generating feature vectors using GLove and Google news Word2Vec
I got a total AUC of 0.82
Resources needed:
- Download data from kaggle competition page here
- Download GLove Word Vectors here, choose the 300d.480B model
- Download GoogleNews Word Vectors here
- To use the Keras model built in the file
example_to_clarify.py
, you need to download the 20 Newsgroup dataset
Note:
final_try.py
file is an implementation to XGBoost algorithm on the same data
To Do::
-
You definetly can make much more hyperparameter optimization epecially regarding the LSTM model. for example: You can try playing around with
max_features
,max_len
,Droupout_rate
,size
of theDense
layer, etc... -
You can try differnt feature engineering and normaization techniques for the text data
-
In general try playing around with parameters like
batch_size
,num_epochs
andlearning_rate
-
Try to use differnt optimization function, maybe
Adagrad
,Adadelta
orsgd