Skip to content

Word2Vec implementation from scratch using Skip-Gram and Negative Sampling. Features interactive t-SNE/UMAP visualizations

Notifications You must be signed in to change notification settings

ranimeshehata/Word2Vec

Repository files navigation

Word2Vec: Skip-Gram with Negative Sampling (SGNS)

A from-scratch implementation of Word2Vec using the Skip-Gram architecture with Negative Sampling, trained on the CoNLL-2003 Named Entity Recognition dataset.

πŸ“‹ Overview

This project implements the Word2Vec algorithm described in Mikolov et al.'s papers, featuring:

  • Skip-Gram Architecture: Predicts context words from center words
  • Negative Sampling: Efficient training alternative to hierarchical softmax
  • Custom Implementation: Built from scratch using NumPy (no deep learning frameworks)
  • Interactive Visualizations: 2D/3D t-SNE and UMAP plots with Plotly
  • Comprehensive Evaluation: Word similarity, analogies, and semantic clustering

πŸš€ Features

  • βœ… Data preprocessing and vocabulary building with frequency filtering
  • βœ… Efficient negative sampling using power-law distribution (f^0.75)
  • βœ… Analytical gradient computation for Skip-Gram objective
  • βœ… Learning rate decay and gradient clipping
  • βœ… Checkpoint saving/loading for resumable training
  • βœ… Word similarity and analogy tasks
  • βœ… Interactive 2D/3D visualizations (t-SNE, UMAP)
  • βœ… Heatmap analysis of semantic relationships

πŸ”§ Installation

Requirements

pip install numpy matplotlib datasets tqdm scikit-learn plotly umap-learn seaborn pandas

Dataset

The model is trained on the CoNLL-2003 dataset (Named Entity Recognition):

  • Training: 14,041 sentences
  • Validation: 3,250 sentences
  • Test: 3,453 sentences
  • Vocabulary: ~21,000 unique tokens (after preprocessing)

🎯 Usage

Training a New Model

Open word2vec_final.ipynb and run all cells. Key hyperparameters:

EMBEDDING_DIM = 200      # Embedding vector size
EPOCHS = 15              # Number of training epochs
WINDOW_SIZE = 5          # Context window size
NEG_SAMPLES = 10         # Negative samples per positive
LEARNING_RATE = 0.025    # Initial learning rate
MIN_COUNT = 0            # Minimum word frequency threshold

Loading Pretrained Embeddings

import pickle

# Load embeddings
with open('checkpoints/embeddings/final_word_embeddings.pkl', 'rb') as f:
    data = pickle.load(f)

embeddings = data['embeddings']
word2idx = data['word2idx']
idx2word = data['idx2word']

# Get word vector
word_idx = word2idx['germany']
word_vec = embeddings[word_idx]

Word Similarity

from word2vec_final import find_most_similar

# Find words similar to "germany"
find_most_similar(model, vocab, 'germany', top_k=5)

# Output:
# Most similar words to 'germany':
#   austria: 0.7234
#   switzerland: 0.6891
#   italy: 0.6723
#   spain: 0.6512
#   france: 0.6289

Word Analogies

from word2vec_final import word_analogy

# Solve: France is to Paris as Germany is to ?
word_analogy(model, vocab, 'France', 'Paris', 'Germany', top_k=5)

πŸ“Š Model Architecture

Skip-Gram with Negative Sampling

Input Layer:        [vocab_size x embedding_dim]  (W_in)
                              ↓
Center Word:        one-hot encoded word
                              ↓
Embedding Lookup:   v_c = W_in[center_idx]
                              ↓
Context Prediction: Dot products with W_out
                              ↓
Loss Function:      Binary cross-entropy
                    L = -log(Οƒ(u_oΒ·v_c)) - Ξ£ log(Οƒ(-u_negΒ·v_c))

Loss Function

The objective maximizes:

$$\log \sigma(\mathbf{u}_o \cdot \mathbf{v}_c) + \sum_{i=1}^{k} \mathbb{E}_{w_i \sim P_n(w)} [\log \sigma(-\mathbf{u}_{w_i} \cdot \mathbf{v}_c)]$$

Where:

  • $\mathbf{v}_c$: center word embedding
  • $\mathbf{u}_o$: positive context word embedding
  • $\mathbf{u}_{w_i}$: negative sample embeddings
  • $\sigma$: sigmoid function
  • $P_n(w) \propto f(w)^{0.75}$: negative sampling distribution

πŸ“ˆ Training Results

Training Curves

The model converges after ~10-15 epochs with learning rate decay:

  • Initial Loss: ~5.8
  • Final Loss: ~2.1
  • Training Time: ~45-60 minutes (15 epochs on CPU)

Evaluation Metrics

Word Similarity Examples:

Query Word Top 5 Similar Words
germany austria, switzerland, italy, spain, france
president minister, prime, government, leader
soccer football, premier, league, championship

🎨 Visualizations

The notebook includes:

  1. 2D t-SNE Plots: Static matplotlib visualizations
  2. Interactive 2D Plots: Plotly scatter with hover tooltips
  3. 3D t-SNE Plots: Rotatable 3D word space exploration
  4. UMAP Projections: Alternative dimensionality reduction
  5. Similarity Heatmaps: Seaborn heatmaps of word relationships

Example Visualization

t-SNE 2D Visualization

Semantic clusters emerge: countries group together, sports terms cluster, political terms align.

πŸ”¬ Key Implementation Details

Negative Sampling Distribution

Words are sampled with probability proportional to $f(w)^{0.75}$:

word_freq = np.array([word_counts[i] for i in range(vocab_size)])
word_freq = word_freq ** 0.75
neg_sampling_dist = word_freq / word_freq.sum()

Gradient Computation

Analytical gradients for efficient training:

# Center word gradient
center_grad = (Οƒ(u_oΒ·v_c) - 1) * u_o + Ξ£ Οƒ(u_negΒ·v_c) * u_neg

# Positive context gradient  
pos_grad = (Οƒ(u_oΒ·v_c) - 1) * v_c

# Negative context gradients
neg_grad = Οƒ(u_negΒ·v_c) * v_c

Numerical Stability

  • Score clipping: [-10, 10] before sigmoid
  • Gradient clipping: [-5, 5] max norm
  • Log-probability smoothing: log(x + 1e-10)
  • Xavier initialization for input embeddings

πŸ’Ύ Checkpointing

The model saves checkpoints after each epoch:

checkpoints/
β”œβ”€β”€ checkpoint_epoch_1.pkl
β”œβ”€β”€ checkpoint_epoch_2.pkl
...
└── checkpoint_epoch_15.pkl

Each checkpoint contains:

  • Input embeddings (W_in)
  • Output embeddings (W_out)
  • Vocabulary mappings
  • Loss history
  • Hyperparameters

πŸ§ͺ Experiments

The repository includes additional experiments:

  • HMM.ipynb: Hidden Markov Models for sequence labeling
  • word2vec_kaggle.ipynb: Kaggle-optimized version
  • read_ckpt.ipynb: Utilities for inspecting checkpoints

πŸ“š References

  1. Mikolov, T., et al. (2013). "Efficient Estimation of Word Representations in Vector Space" (arXiv:1301.3781)
  2. Mikolov, T., et al. (2013). "Distributed Representations of Words and Phrases and their Compositionality" (arXiv:1310.4546)
  3. Goldberg, Y., & Levy, O. (2014). "word2vec Explained" (arXiv:1402.3722)

About

Word2Vec implementation from scratch using Skip-Gram and Negative Sampling. Features interactive t-SNE/UMAP visualizations

Topics

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published