- Implementation of Word2Vec using Continuous Bag of Words (CBOW)
- Intrinsic evaluation metric using analogy labels from "Efficient Estimation of Word Representations in Vector Space"
- Implementation of Word2Vec using Skip-gram
- TSNE for visualizing embeddings of analogy pairs
- Nearest Neighbors analysis for finding similar words
- TODO: filter to the N most common words in the training corpus and mark the rest as OOV
- TODO: download a larger dataset (GloVe paper uses Gigaword5, Wikipedia2014, and Common Crawl)
- TODO: Train GloVe embeddings
- TODO: increase the size of the context vector to 300 depending on training speed
- TODO: Evaluate on word similarity task WordSim-353 used in GloVe paper
- TODO: Extrinsic model evaluation (NER)
- TODO: Write unit tests for model training and inference on small data
- Developed using Python 3.9 but probably works on Python 3 version
cd deep-learning-skunk-works/
export PYTHONPATH=`pwd`
export PROJECT_ROOT=`pwd`
pip install -r requirements.txt
- Set model name (e.g. cbow, skipgram, ...)
export MODEL='cbow'
- Launch tensorboard
tensorboard --logdir=data/$MODEL/models/
- Train model
python src/main.py --train --model $MODEL
- Evaluate model
python src/main.py --eval --model $MODEL