A from-scratch implementation of Word2Vec using the Skip-Gram architecture with Negative Sampling, trained on the CoNLL-2003 Named Entity Recognition dataset.
This project implements the Word2Vec algorithm described in Mikolov et al.'s papers, featuring:
- Skip-Gram Architecture: Predicts context words from center words
- Negative Sampling: Efficient training alternative to hierarchical softmax
- Custom Implementation: Built from scratch using NumPy (no deep learning frameworks)
- Interactive Visualizations: 2D/3D t-SNE and UMAP plots with Plotly
- Comprehensive Evaluation: Word similarity, analogies, and semantic clustering
- β Data preprocessing and vocabulary building with frequency filtering
- β Efficient negative sampling using power-law distribution (f^0.75)
- β Analytical gradient computation for Skip-Gram objective
- β Learning rate decay and gradient clipping
- β Checkpoint saving/loading for resumable training
- β Word similarity and analogy tasks
- β Interactive 2D/3D visualizations (t-SNE, UMAP)
- β Heatmap analysis of semantic relationships
pip install numpy matplotlib datasets tqdm scikit-learn plotly umap-learn seaborn pandasThe model is trained on the CoNLL-2003 dataset (Named Entity Recognition):
- Training: 14,041 sentences
- Validation: 3,250 sentences
- Test: 3,453 sentences
- Vocabulary: ~21,000 unique tokens (after preprocessing)
Open word2vec_final.ipynb and run all cells. Key hyperparameters:
EMBEDDING_DIM = 200 # Embedding vector size
EPOCHS = 15 # Number of training epochs
WINDOW_SIZE = 5 # Context window size
NEG_SAMPLES = 10 # Negative samples per positive
LEARNING_RATE = 0.025 # Initial learning rate
MIN_COUNT = 0 # Minimum word frequency thresholdimport pickle
# Load embeddings
with open('checkpoints/embeddings/final_word_embeddings.pkl', 'rb') as f:
data = pickle.load(f)
embeddings = data['embeddings']
word2idx = data['word2idx']
idx2word = data['idx2word']
# Get word vector
word_idx = word2idx['germany']
word_vec = embeddings[word_idx]from word2vec_final import find_most_similar
# Find words similar to "germany"
find_most_similar(model, vocab, 'germany', top_k=5)
# Output:
# Most similar words to 'germany':
# austria: 0.7234
# switzerland: 0.6891
# italy: 0.6723
# spain: 0.6512
# france: 0.6289from word2vec_final import word_analogy
# Solve: France is to Paris as Germany is to ?
word_analogy(model, vocab, 'France', 'Paris', 'Germany', top_k=5)Input Layer: [vocab_size x embedding_dim] (W_in)
β
Center Word: one-hot encoded word
β
Embedding Lookup: v_c = W_in[center_idx]
β
Context Prediction: Dot products with W_out
β
Loss Function: Binary cross-entropy
L = -log(Ο(u_oΒ·v_c)) - Ξ£ log(Ο(-u_negΒ·v_c))
The objective maximizes:
Where:
-
$\mathbf{v}_c$ : center word embedding -
$\mathbf{u}_o$ : positive context word embedding -
$\mathbf{u}_{w_i}$ : negative sample embeddings -
$\sigma$ : sigmoid function -
$P_n(w) \propto f(w)^{0.75}$ : negative sampling distribution
The model converges after ~10-15 epochs with learning rate decay:
- Initial Loss: ~5.8
- Final Loss: ~2.1
- Training Time: ~45-60 minutes (15 epochs on CPU)
Word Similarity Examples:
| Query Word | Top 5 Similar Words |
|---|---|
| germany | austria, switzerland, italy, spain, france |
| president | minister, prime, government, leader |
| soccer | football, premier, league, championship |
The notebook includes:
- 2D t-SNE Plots: Static matplotlib visualizations
- Interactive 2D Plots: Plotly scatter with hover tooltips
- 3D t-SNE Plots: Rotatable 3D word space exploration
- UMAP Projections: Alternative dimensionality reduction
- Similarity Heatmaps: Seaborn heatmaps of word relationships
Semantic clusters emerge: countries group together, sports terms cluster, political terms align.
Words are sampled with probability proportional to
word_freq = np.array([word_counts[i] for i in range(vocab_size)])
word_freq = word_freq ** 0.75
neg_sampling_dist = word_freq / word_freq.sum()Analytical gradients for efficient training:
# Center word gradient
center_grad = (Ο(u_oΒ·v_c) - 1) * u_o + Ξ£ Ο(u_negΒ·v_c) * u_neg
# Positive context gradient
pos_grad = (Ο(u_oΒ·v_c) - 1) * v_c
# Negative context gradients
neg_grad = Ο(u_negΒ·v_c) * v_c- Score clipping:
[-10, 10]before sigmoid - Gradient clipping:
[-5, 5]max norm - Log-probability smoothing:
log(x + 1e-10) - Xavier initialization for input embeddings
The model saves checkpoints after each epoch:
checkpoints/
βββ checkpoint_epoch_1.pkl
βββ checkpoint_epoch_2.pkl
...
βββ checkpoint_epoch_15.pkl
Each checkpoint contains:
- Input embeddings (W_in)
- Output embeddings (W_out)
- Vocabulary mappings
- Loss history
- Hyperparameters
The repository includes additional experiments:
- HMM.ipynb: Hidden Markov Models for sequence labeling
- word2vec_kaggle.ipynb: Kaggle-optimized version
- read_ckpt.ipynb: Utilities for inspecting checkpoints
- Mikolov, T., et al. (2013). "Efficient Estimation of Word Representations in Vector Space" (arXiv:1301.3781)
- Mikolov, T., et al. (2013). "Distributed Representations of Words and Phrases and their Compositionality" (arXiv:1310.4546)
- Goldberg, Y., & Levy, O. (2014). "word2vec Explained" (arXiv:1402.3722)
