Accept this assignment: GitHub Classroom Link
Due: February 2, 2026 at 11:59 PM EST
Click the link above to create your private repository for this assignment. Complete your work in Google Colab, then push your notebook to the repository before the deadline.
Timeline: 1 Week
"You shall know a word by the company it keeps." - J.R. Firth, 1957
In this assignment, you will explore how machines represent the meaning of documents. Working with Wikipedia articles, you will implement and compare embedding methods spanning five decades of computational linguistics—from classical statistical techniques to modern transformer-based models.
You will:
- Implement ~10 different embedding approaches
- Create beautiful visualizations using UMAP, clustering, and DataMapPlot
- Evaluate embeddings through a document matching task
- Reflect critically on what different methods capture about meaning
This is a 1-week assignment designed to be achievable with GenAI assistance. Focus on understanding the trade-offs between methods rather than exhaustive implementation details.
By completing this assignment, you will:
- Understand the evolution of semantic representation from classical to modern NLP
- Implement and compare diverse embedding methods
- Create publication-quality visualizations of high-dimensional spaces
- Evaluate embeddings quantitatively through a document matching task
- Think critically about what different methods capture (and miss) about meaning
We provide a curated dataset of 250,000 Wikipedia articles. The following code downloads and loads the dataset:
import os
import urllib.request
import pickle
# Download the dataset if it doesn't exist
dataset_url = 'https://www.dropbox.com/s/v4juxkc5v2rd0xr/wikipedia.pkl?dl=1'
dataset_path = 'wikipedia.pkl'
if not os.path.exists(dataset_path):
print("Downloading dataset (~750MB)...")
urllib.request.urlretrieve(dataset_url, dataset_path)
print("Download complete.")
# Load the dataset
with open(dataset_path, 'rb') as f:
wikipedia = pickle.load(f)
print(f"Loaded {len(wikipedia)} articles")Each article is a dictionary with:
'title': Article title (string)'text': Full article text (string)'id': Unique identifier (string)'url': Wikipedia URL (string)
Important: For development and testing, start with a small subset (e.g., 5,000-10,000 articles). Scale up for final results as time permits.
Implement at least 10 of the following embedding approaches. For each method, create document-level embeddings for your Wikipedia subset.
1. Latent Semantic Analysis (LSA)
- Use
CountVectorizerto create a term-document matrix (raw counts, as in Deerwester et al., 1990) - Apply
TruncatedSVDfor dimensionality reduction (e.g., 300 dimensions) - Implementation:
sklearn.feature_extraction.text.CountVectorizer+sklearn.decomposition.TruncatedSVD
2. TF-IDF + SVD
- Variant of LSA using TF-IDF weighting instead of raw counts
- Implementation:
sklearn.feature_extraction.text.TfidfVectorizer+sklearn.decomposition.TruncatedSVD
3. Word2Vec
- Use pre-trained
word2vec-google-news-300fromgensim - Aggregate word vectors via mean pooling
- Implementation:
gensim.downloader
4. GloVe
- Use pre-trained
glove-wiki-gigaword-300fromgensim - Aggregate via mean pooling
- Implementation:
gensim.downloader
5. FastText
- Use pre-trained
fasttext-wiki-news-subwords-300fromgensim - Handles out-of-vocabulary words via subword embeddings
- Implementation:
gensim.downloader
6. Sentence-BERT (all-MiniLM-L6-v2)
- Lightweight, fast sentence transformer
- Implementation:
sentence-transformers
7. Sentence-BERT (all-mpnet-base-v2)
- Higher quality, slower than MiniLM
- Implementation:
sentence-transformers
8. BGE (BAAI General Embedding)
- State-of-the-art on MTEB benchmark
- Try
BAAI/bge-small-en-v1.5orBAAI/bge-base-en-v1.5 - Implementation:
sentence-transformers
9. E5
- Strong retrieval-focused embeddings
- Try
intfloat/e5-small-v2orintfloat/e5-base-v2 - Note: Requires prefix
"passage: "for documents - Implementation:
sentence-transformers
10. Nomic Embed
- Open-weights, long-context model
- Try
nomic-ai/nomic-embed-text-v1.5 - Implementation:
sentence-transformers
- Doc2Vec:
gensim.models.Doc2Vec(train on your corpus) - Universal Sentence Encoder: TensorFlow Hub
- OpenAI Embeddings:
text-embedding-3-small(requires API key) - Instructor Embeddings: Task-specific prompting
- GTR: Google's Text Representations
- Working code for each embedding method
- Brief documentation of hyperparameters chosen
- Embeddings stored in a consistent format (numpy arrays)
Create visualizations to understand the structure of your embedding spaces.
For each embedding method:
- Apply UMAP to reduce to 2D:
umap.UMAP(n_neighbors=15, min_dist=0.1, metric='cosine') - Store the 2D coordinates for visualization
Apply clustering to discover document groups:
K-Means
- Experiment with different k values (e.g., 10, 20, 50)
- Use silhouette score to evaluate cluster quality
- Implementation:
sklearn.cluster.KMeans
HDBSCAN
- Density-based clustering that automatically determines cluster count
- Handles noise/outliers gracefully
- Implementation:
hdbscan.HDBSCAN
Create publication-quality visualizations using DataMapPlot:
import datamapplot
# Create labeled visualization
fig, ax = datamapplot.create_plot(
umap_coords, # (n_samples, 2) array
cluster_labels, # Label for each point
title="Wikipedia Embeddings",
sub_title="Method: Sentence-BERT",
label_wrap_width=20,
darkmode=False
)Required visualizations:
- At least 3 different embedding methods visualized with DataMapPlot
- Compare K-Means vs HDBSCAN clustering on the same embeddings
- Include cluster labels that are meaningful (e.g., use article titles or LLM-generated descriptions)
Create at least one visualization that directly compares methods:
- Side-by-side DataMapPlots of different embedding methods
- Or overlay showing how the same articles cluster differently
- UMAP coordinates for all embedding methods
- Cluster assignments (K-Means and HDBSCAN)
- At least 5 high-quality DataMapPlot visualizations
- Brief analysis of what you observe
Evaluate embedding quality through a document matching task.
For each document:
- Split it into two halves (first half and second half of the text)
- Embed each half separately
- For the first half, find the most similar embedding among all second halves
- A "match" is correct if the retrieved second half belongs to the same original document
This tests whether embeddings capture document-level semantics consistently.
def evaluate_document_matching(embeddings_first_half, embeddings_second_half):
"""
Compute matching accuracy.
For each first-half embedding, find the nearest second-half embedding.
Return the fraction where the nearest neighbor is the correct match.
"""
# Your implementation here
passFor each embedding method, report:
- Accuracy@1: Fraction where the correct second half is the top match
- Accuracy@5: Fraction where the correct second half is in the top 5 matches
- Mean Reciprocal Rank (MRR): Average of 1/rank for the correct match
Create a bar plot with error bars (or another appropriate visualization) comparing all embedding methods. Use bootstrap resampling to compute confidence intervals for your metrics.
- Implementation of the document matching evaluation
- Bar plot visualization comparing all embedding methods (with error bars)
- Brief analysis: Which methods perform best? Why might that be?
Write 1-2 short essays (each 300-500 words) reflecting on your findings.
Discuss the trade-offs you observed:
- Speed vs. Quality: Which methods are fast but lower quality? Which are slow but excellent?
- Interpretability: Can you understand what LSA dimensions or clusters represent? How about transformer embeddings?
- What each method captures: Do some methods capture topical similarity while others capture stylistic similarity?
- Practical recommendations: When would you use each method in a real application?
Reflect on the deeper questions:
- What does it mean for a machine to "understand" meaning?
- Do embeddings truly capture semantics, or just statistical patterns?
- What aspects of human understanding are missing from these representations?
- How do the limitations you observed connect to broader questions about AI and language?
Alternative: Write about a topic of your choice related to embeddings, visualization, or semantic representation. Clear it with the instructor if unsure.
- 1-2 essays in markdown cells in your notebook
- Thoughtful engagement with the material (not surface-level observations)
- Accept the assignment via the GitHub Classroom link above
- Clone your repository
- Complete your work in Google Colab
- Push your notebook to the repository before the deadline
Your notebook should:
- Run completely in Google Colab with GPU runtime
- Include all code to download data and install dependencies
- Have clear markdown sections matching the assignment parts
- Show all visualizations inline
- Include your reflection essays as markdown cells
- Notebook runs from start to finish without errors
- All 10+ embedding methods implemented
- At least 5 DataMapPlot visualizations included
- Document matching evaluation complete with bar plot (including error bars)
- 1-2 reflection essays written
- Code is reasonably commented and organized
| Component | Points | Criteria |
|---|---|---|
| Part 1: Embeddings | 40 | 10+ methods implemented correctly, reasonable hyperparameters |
| Part 2: Visualization | 25 | UMAP + clustering + 5+ quality DataMapPlot visualizations |
| Part 3: Evaluation | 20 | Document matching implemented, bar plot with error bars, analysis |
| Part 4: Essays | 15 | Thoughtful, substantive reflection (300-500 words each) |
Bonus opportunities:
- Exceptionally insightful analysis (+5)
- Creative additional visualizations (+3)
- Thorough comparison of clustering methods (+2)
- Begin with 1,000-5,000 articles for development
- Get your full pipeline working before scaling up
- Cache embeddings to avoid recomputation
- Use GPU runtime in Colab for transformer models
- Process documents in batches
- For large models, consider
float16precision
| Day | Tasks |
|---|---|
| 1-2 | Set up environment, implement 5+ embedding methods |
| 3-4 | Complete remaining embeddings, create visualizations |
| 5 | Implement document matching evaluation |
| 6 | Run final experiments, write essays |
| 7 | Polish, verify reproducibility, submit |
You're encouraged to use ChatGPT, Claude, or Copilot to:
- Debug errors and understand library APIs
- Generate boilerplate code
- Explain concepts from papers
You must:
- Understand all code you submit
- Write your own analysis and essays
- Document significant AI assistance used
- Deerwester et al. (1990). Indexing by Latent Semantic Analysis
- Mikolov et al. (2013). Word2Vec
- Reimers & Gurevych (2019). Sentence-BERT
- McInnes et al. (2018). UMAP
- Post on the course forum
- Attend office hours
- Email the instructor
Good luck exploring the semantic space!