This repository contains an implementation of Yoon Kim's paper Convolutional Neural Networks for Sentence Classification http://www.people.fas.harvard.edu/~yoonkim/data/sent-cnn.pdf
We also compare the performance of the CNN network with a simple Bag of words model.
The classification model has been trained on a topic classification dataset in the following format.
TOPIC ||| TEXT
Use the following commands to recreate different training scenarios
train.py
is present in the nn folder.
python train.py \
--init random \
--train ../topicclass/topicclass_train.txt \
--test ../topicclass/topicclass_test.txt \
--valid ../topicclass/topicclass_valid.txt \
--batch_size 32 \
--output_path ./random_upper
python train.py \
--init random \
--train ../topicclass/topicclass_train.txt \
--test ../topicclass/topicclass_test.txt \
--valid ../topicclass/topicclass_valid.txt \
--batch_size 32 \
--output_path ./random_lower \
--to_lower
python train.py \
--init word2vec \
--pretrained_file ../word2vec_upper.json \
--train ../topicclass/topicclass_train.txt \
--test ../topicclass/topicclass_test.txt \
--valid ../topicclass/topicclass_valid.txt \
--batch_size 32 \
--output_path ./word2vec_upper_frozen \
--freeze_embeddings
python train.py \
--init word2vec \
--pretrained_file ../word2vec_upper.json \
--train ../topicclass/topicclass_train.txt \
--test ../topicclass/topicclass_test.txt \
--valid ../topicclass/topicclass_valid.txt \
--batch_size 32 \
--output_path ./word2vec_upper_nofreeze
python train.py \
--init word2vec \
--pretrained_file ../word2vec_lower.json \
--train ../topicclass/topicclass_train.txt \
--test ../topicclass/topicclass_test.txt \
--valid ../topicclass/topicclass_valid.txt \
--batch_size 32 \
--output_path ./word2vec_lower_frozen \
--freeze_embeddings \
--to_lower
python train.py \
--init word2vec \
--pretrained_file ../word2vec_lower.json \
--train ../topicclass/topicclass_train.txt \
--test ../topicclass/topicclass_test.txt \
--valid ../topicclass/topicclass_valid.txt \
--batch_size 32 \
--output_path ./word2vec_lower_nofreeze \
--to_lower
python train.py \
--init word2vec \
--pretrained_file ../word2vec_upper.json \
--train ../topicclass/topicclass_train.txt \
--test ../topicclass/topicclass_test.txt \
--valid ../topicclass/topicclass_valid.txt \
--batch_size 32 \
--output_path ./word2vec_upper_cbow \
--cbow
python train.py \
--init word2vec \
--pretrained_file ../word2vec_lower.json \
--train ../topicclass/topicclass_train.txt \
--test ../topicclass/topicclass_test.txt \
--valid ../topicclass/topicclass_valid.txt \
--batch_size 32 \
--output_path ./word2vec_lower_cbow \
--to_lower --cbow
word2vec_lower.json and word2vec_upper.json are pretrained word2vec embeddings. They've been extracted from embeddings trained on Google news dataset available here. They can be downloaded from the following links -
word2vec_lower.json - https://drive.google.com/file/d/1YpCbB0GoLf5iK_Pprr5xsM04atkxO3IT/view?usp=sharing
word2vec_upper.json - https://drive.google.com/file/d/1H_9C3guLZZIaJG_2XRNkw_hs2OCwpT9v/view?usp=sharing
These files have been created using the word2vec_extract_pretrained_embeddings function available in utils.py