Skip to content

ContextLab/embeddings-llm-course

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

7 Commits
 
 
 
 
 
 
 
 
 
 

Repository files navigation

Assignment 3: Exploring Document Embeddings

Accept this assignment: GitHub Classroom Link

Due: February 2, 2026 at 11:59 PM EST

Click the link above to create your private repository for this assignment. Complete your work in Google Colab, then push your notebook to the repository before the deadline.


Timeline: 1 Week

Overview

"You shall know a word by the company it keeps." - J.R. Firth, 1957

In this assignment, you will explore how machines represent the meaning of documents. Working with Wikipedia articles, you will implement and compare embedding methods spanning five decades of computational linguistics—from classical statistical techniques to modern transformer-based models.

You will:

  • Implement ~10 different embedding approaches
  • Create beautiful visualizations using UMAP, clustering, and DataMapPlot
  • Evaluate embeddings through a document matching task
  • Reflect critically on what different methods capture about meaning

This is a 1-week assignment designed to be achievable with GenAI assistance. Focus on understanding the trade-offs between methods rather than exhaustive implementation details.

Learning Objectives

By completing this assignment, you will:

  • Understand the evolution of semantic representation from classical to modern NLP
  • Implement and compare diverse embedding methods
  • Create publication-quality visualizations of high-dimensional spaces
  • Evaluate embeddings quantitatively through a document matching task
  • Think critically about what different methods capture (and miss) about meaning

Dataset

We provide a curated dataset of 250,000 Wikipedia articles. The following code downloads and loads the dataset:

import os
import urllib.request
import pickle

# Download the dataset if it doesn't exist
dataset_url = 'https://www.dropbox.com/s/v4juxkc5v2rd0xr/wikipedia.pkl?dl=1'
dataset_path = 'wikipedia.pkl'

if not os.path.exists(dataset_path):
    print("Downloading dataset (~750MB)...")
    urllib.request.urlretrieve(dataset_url, dataset_path)
    print("Download complete.")

# Load the dataset
with open(dataset_path, 'rb') as f:
    wikipedia = pickle.load(f)

print(f"Loaded {len(wikipedia)} articles")

Each article is a dictionary with:

  • 'title': Article title (string)
  • 'text': Full article text (string)
  • 'id': Unique identifier (string)
  • 'url': Wikipedia URL (string)

Important: For development and testing, start with a small subset (e.g., 5,000-10,000 articles). Scale up for final results as time permits.


Part 1: Implement Embedding Methods (40 points)

Implement at least 10 of the following embedding approaches. For each method, create document-level embeddings for your Wikipedia subset.

Classical Methods

1. Latent Semantic Analysis (LSA)

  • Use CountVectorizer to create a term-document matrix (raw counts, as in Deerwester et al., 1990)
  • Apply TruncatedSVD for dimensionality reduction (e.g., 300 dimensions)
  • Implementation: sklearn.feature_extraction.text.CountVectorizer + sklearn.decomposition.TruncatedSVD

2. TF-IDF + SVD

  • Variant of LSA using TF-IDF weighting instead of raw counts
  • Implementation: sklearn.feature_extraction.text.TfidfVectorizer + sklearn.decomposition.TruncatedSVD

Static Word Embeddings (aggregate to document level)

3. Word2Vec

  • Use pre-trained word2vec-google-news-300 from gensim
  • Aggregate word vectors via mean pooling
  • Implementation: gensim.downloader

4. GloVe

  • Use pre-trained glove-wiki-gigaword-300 from gensim
  • Aggregate via mean pooling
  • Implementation: gensim.downloader

5. FastText

  • Use pre-trained fasttext-wiki-news-subwords-300 from gensim
  • Handles out-of-vocabulary words via subword embeddings
  • Implementation: gensim.downloader

Transformer-Based Embeddings

6. Sentence-BERT (all-MiniLM-L6-v2)

  • Lightweight, fast sentence transformer
  • Implementation: sentence-transformers

7. Sentence-BERT (all-mpnet-base-v2)

  • Higher quality, slower than MiniLM
  • Implementation: sentence-transformers

8. BGE (BAAI General Embedding)

  • State-of-the-art on MTEB benchmark
  • Try BAAI/bge-small-en-v1.5 or BAAI/bge-base-en-v1.5
  • Implementation: sentence-transformers

9. E5

  • Strong retrieval-focused embeddings
  • Try intfloat/e5-small-v2 or intfloat/e5-base-v2
  • Note: Requires prefix "passage: " for documents
  • Implementation: sentence-transformers

10. Nomic Embed

  • Open-weights, long-context model
  • Try nomic-ai/nomic-embed-text-v1.5
  • Implementation: sentence-transformers

Additional Options (pick any to reach 10+)

  • Doc2Vec: gensim.models.Doc2Vec (train on your corpus)
  • Universal Sentence Encoder: TensorFlow Hub
  • OpenAI Embeddings: text-embedding-3-small (requires API key)
  • Instructor Embeddings: Task-specific prompting
  • GTR: Google's Text Representations

Deliverables for Part 1

  • Working code for each embedding method
  • Brief documentation of hyperparameters chosen
  • Embeddings stored in a consistent format (numpy arrays)

Part 2: Visualization (25 points)

Create visualizations to understand the structure of your embedding spaces.

2.1 Dimensionality Reduction with UMAP

For each embedding method:

  1. Apply UMAP to reduce to 2D: umap.UMAP(n_neighbors=15, min_dist=0.1, metric='cosine')
  2. Store the 2D coordinates for visualization

2.2 Clustering

Apply clustering to discover document groups:

K-Means

  • Experiment with different k values (e.g., 10, 20, 50)
  • Use silhouette score to evaluate cluster quality
  • Implementation: sklearn.cluster.KMeans

HDBSCAN

  • Density-based clustering that automatically determines cluster count
  • Handles noise/outliers gracefully
  • Implementation: hdbscan.HDBSCAN

2.3 DataMapPlot Visualizations

Create publication-quality visualizations using DataMapPlot:

import datamapplot

# Create labeled visualization
fig, ax = datamapplot.create_plot(
    umap_coords,           # (n_samples, 2) array
    cluster_labels,        # Label for each point
    title="Wikipedia Embeddings",
    sub_title="Method: Sentence-BERT",
    label_wrap_width=20,
    darkmode=False
)

Required visualizations:

  1. At least 3 different embedding methods visualized with DataMapPlot
  2. Compare K-Means vs HDBSCAN clustering on the same embeddings
  3. Include cluster labels that are meaningful (e.g., use article titles or LLM-generated descriptions)

2.4 Comparison Visualizations

Create at least one visualization that directly compares methods:

  • Side-by-side DataMapPlots of different embedding methods
  • Or overlay showing how the same articles cluster differently

Deliverables for Part 2

  • UMAP coordinates for all embedding methods
  • Cluster assignments (K-Means and HDBSCAN)
  • At least 5 high-quality DataMapPlot visualizations
  • Brief analysis of what you observe

Part 3: Document Matching Evaluation (20 points)

Evaluate embedding quality through a document matching task.

The Task

For each document:

  1. Split it into two halves (first half and second half of the text)
  2. Embed each half separately
  3. For the first half, find the most similar embedding among all second halves
  4. A "match" is correct if the retrieved second half belongs to the same original document

This tests whether embeddings capture document-level semantics consistently.

Implementation

def evaluate_document_matching(embeddings_first_half, embeddings_second_half):
    """
    Compute matching accuracy.
    
    For each first-half embedding, find the nearest second-half embedding.
    Return the fraction where the nearest neighbor is the correct match.
    """
    # Your implementation here
    pass

Metrics to Report

For each embedding method, report:

  1. Accuracy@1: Fraction where the correct second half is the top match
  2. Accuracy@5: Fraction where the correct second half is in the top 5 matches
  3. Mean Reciprocal Rank (MRR): Average of 1/rank for the correct match

Visualization

Create a bar plot with error bars (or another appropriate visualization) comparing all embedding methods. Use bootstrap resampling to compute confidence intervals for your metrics.

Deliverables for Part 3

  • Implementation of the document matching evaluation
  • Bar plot visualization comparing all embedding methods (with error bars)
  • Brief analysis: Which methods perform best? Why might that be?

Part 4: Reflection Essays (15 points)

Write 1-2 short essays (each 300-500 words) reflecting on your findings.

Essay 1: Trade-offs Between Methods (Required)

Discuss the trade-offs you observed:

  • Speed vs. Quality: Which methods are fast but lower quality? Which are slow but excellent?
  • Interpretability: Can you understand what LSA dimensions or clusters represent? How about transformer embeddings?
  • What each method captures: Do some methods capture topical similarity while others capture stylistic similarity?
  • Practical recommendations: When would you use each method in a real application?

Essay 2: What Is Meaning? (Choose this OR your own topic)

Reflect on the deeper questions:

  • What does it mean for a machine to "understand" meaning?
  • Do embeddings truly capture semantics, or just statistical patterns?
  • What aspects of human understanding are missing from these representations?
  • How do the limitations you observed connect to broader questions about AI and language?

Alternative: Write about a topic of your choice related to embeddings, visualization, or semantic representation. Clear it with the instructor if unsure.

Deliverables for Part 4

  • 1-2 essays in markdown cells in your notebook
  • Thoughtful engagement with the material (not surface-level observations)

Submission Guidelines

GitHub Classroom Submission

  1. Accept the assignment via the GitHub Classroom link above
  2. Clone your repository
  3. Complete your work in Google Colab
  4. Push your notebook to the repository before the deadline

Notebook Requirements

Your notebook should:

  • Run completely in Google Colab with GPU runtime
  • Include all code to download data and install dependencies
  • Have clear markdown sections matching the assignment parts
  • Show all visualizations inline
  • Include your reflection essays as markdown cells

Before Submitting

  • Notebook runs from start to finish without errors
  • All 10+ embedding methods implemented
  • At least 5 DataMapPlot visualizations included
  • Document matching evaluation complete with bar plot (including error bars)
  • 1-2 reflection essays written
  • Code is reasonably commented and organized

Grading Rubric (100 points)

Component Points Criteria
Part 1: Embeddings 40 10+ methods implemented correctly, reasonable hyperparameters
Part 2: Visualization 25 UMAP + clustering + 5+ quality DataMapPlot visualizations
Part 3: Evaluation 20 Document matching implemented, bar plot with error bars, analysis
Part 4: Essays 15 Thoughtful, substantive reflection (300-500 words each)

Bonus opportunities:

  • Exceptionally insightful analysis (+5)
  • Creative additional visualizations (+3)
  • Thorough comparison of clustering methods (+2)

Tips for Success

Start Small

  • Begin with 1,000-5,000 articles for development
  • Get your full pipeline working before scaling up
  • Cache embeddings to avoid recomputation

Computational Efficiency

  • Use GPU runtime in Colab for transformer models
  • Process documents in batches
  • For large models, consider float16 precision

Recommended Timeline

Day Tasks
1-2 Set up environment, implement 5+ embedding methods
3-4 Complete remaining embeddings, create visualizations
5 Implement document matching evaluation
6 Run final experiments, write essays
7 Polish, verify reproducibility, submit

Using GenAI Effectively

You're encouraged to use ChatGPT, Claude, or Copilot to:

  • Debug errors and understand library APIs
  • Generate boilerplate code
  • Explain concepts from papers

You must:

  • Understand all code you submit
  • Write your own analysis and essays
  • Document significant AI assistance used

Resources

Key Libraries

Papers


Questions?

  • Post on the course forum
  • Attend office hours
  • Email the instructor

Good luck exploring the semantic space!

About

Assignment 3 from llm-course

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published