|
1 | 1 | # Introduction: Large Language Models |
2 | | -General overview aned introduction. GPT etc etc. |
3 | | - |
4 | | -> Welcome to the first lesson in the AI module! In this lesson, we will look at how Large Language Models (LLMs) like ChatGPT work. First, we will explore field of natural language processing (NLP) more generally. Next, we will demystify how models such as GPT are built, explain key ideas like tokenization, embeddings, and self-attention, and connect these concepts back to the machine learning skills you’ve already learned. By the end, you’ll have a clearer sense of how modern AI models represent meaning, generate new text, and why this matters for the AI tools you will be learning about in later lessons. |
| 2 | +Welcome to the first lesson in the AI module! In this lesson, you’ll get a behind-the-scenes look at how Large Language Models (LLMs) like ChatGPT work. We’ll start with a quick tour of natural language processing (NLP)—the field that helps computers understand and generate human language. Then you’ll learn how models like GPT are built, connecting them to many of the machine learning concepts you learned in the previous module. By the end, you’ll have a solid, intuitive sense of how AI models represent meaning, and generate text. |
5 | 3 |
|
6 | 4 | ## 1. Natural language processing (NLP) |
7 | 5 | While we will provide a brief overview of NLP, we could easily spend an entire course on the topic. If you want to learn more about NLP, check out the following resources: |
@@ -38,7 +36,6 @@ Since 2022, LLMs have shifted from being mostly academic curiosities to tools th |
38 | 36 |
|
39 | 37 | In the rest of this lesson, we will learn some of the technical basics of how LLMs like ChatGPT work, and try to demystify their operations. Ultimateley, they are just another machine learning model, and they are trained to predict the next token in a string of tokens. |
40 | 38 |
|
41 | | - |
42 | 39 | ## 2. Large language models (LLMs) |
43 | 40 | ### LLMs: autocomplete at scale |
44 | 41 | Modern LLMs are machine learning models that are trained to predict the next word in a sequence, given all the words that came before. Imagine starting a sentence, and the model is tasked with filling in the blank: |
@@ -172,13 +169,141 @@ Now comes the final moment. We ask: given that 3D vector, which of the 50,000 po |
172 | 169 | This is done by a very simple linear neural network (two layers) that is a translator from the model's embedding space to actual tokens. We can think of it as a "de-embedding". |
173 | 170 |
|
174 | 171 |
|
175 | | -## 4. Visualizing embeddings |
176 | | -Show how similar meanings cluster in space using embeddings. |
177 | | - |
178 | | --- insert material from notebook illustrating embedding -- |
179 | | - |
180 | | -## 5. Summary: Why This Matters |
181 | | -- Language models power many modern AI applications. |
182 | | -- LLMs learn semantic structure by representing meaning in vector space. |
183 | | -- Tokenization and embeddings explain what models "know." |
184 | | -- Connects to prior ML concepts (classification, clustering) by adding representation learning. |
| 172 | +## 4. Demo: Visualizing embeddings |
| 173 | +In the following demonstration we will visualizing text embeddings based on their similarity. |
| 174 | + |
| 175 | +> Aside on handling API key (put in README.md) |
| 176 | +
|
| 177 | +First, load API key. |
| 178 | + |
| 179 | +```python |
| 180 | +from dotenv import load_dotenv |
| 181 | + |
| 182 | +if load_dotenv(): |
| 183 | + print("Successfully loaded api key") |
| 184 | +``` |
| 185 | + |
| 186 | +Generate movie summary dictionary that we will use to detect semantic similaties. |
| 187 | + |
| 188 | +```python |
| 189 | +movie_summaries = [ |
| 190 | + # Marvel Superhero Movies |
| 191 | + { |
| 192 | + "title": "Iron Man (2008)", |
| 193 | + "summary": "Billionaire genius Tony Stark builds a high-tech suit to escape captivity and becomes Iron Man, fighting global threats with his wit and advanced technology." |
| 194 | + }, |
| 195 | + { |
| 196 | + "title": "The Avengers (2012)", |
| 197 | + "summary": "Earth’s mightiest heroes, including Iron Man, Captain America, Thor, and Hulk, unite to stop Loki and his alien army from conquering the planet." |
| 198 | + }, |
| 199 | + { |
| 200 | + "title": "Black Panther (2018)", |
| 201 | + "summary": "T’Challa, king of Wakanda, embraces his role as Black Panther to protect his nation and the world from a powerful enemy threatening their vibranium resources." |
| 202 | + }, |
| 203 | + { |
| 204 | + "title": "Spider-Man: No Way Home (2021)", |
| 205 | + "summary": "Peter Parker, unmasked as Spider-Man, teams up with alternate-universe heroes to battle villains from across the multiverse after a spell goes wrong." |
| 206 | + }, |
| 207 | + { |
| 208 | + "title": "Captain Marvel (2019)", |
| 209 | + "summary": "Carol Danvers unlocks her cosmic powers as Captain Marvel, joining the fight against the Kree-Skrull war while uncovering her lost memories on Earth." |
| 210 | + }, |
| 211 | + # Christmas-Themed Movies |
| 212 | + { |
| 213 | + "title": "Home Alone (1990)", |
| 214 | + "summary": "Young Kevin is accidentally left behind during Christmas vacation and must defend his home from bumbling burglars with clever traps and holiday spirit." |
| 215 | + }, |
| 216 | + { |
| 217 | + "title": "Elf (2003)", |
| 218 | + "summary": "Buddy, a human raised by elves, journeys to New York City to find his real father, spreading Christmas cheer in a world that’s lost its festive spark." |
| 219 | + }, |
| 220 | + { |
| 221 | + "title": "The Polar Express (2004)", |
| 222 | + "summary": "A young boy boards a magical train to the North Pole, embarking on a heartwarming adventure that tests his belief in the magic of Christmas." |
| 223 | + }, |
| 224 | + { |
| 225 | + "title": "A Christmas Carol (2009)", |
| 226 | + "summary": "Ebenezer Scrooge, a miserly old man, is visited by three ghosts on Christmas Eve, learning the value of kindness and the true meaning of the holiday." |
| 227 | + }, |
| 228 | + { |
| 229 | + "title": "Love Actually (2003)", |
| 230 | + "summary": "Interwoven stories of love, loss, and connection unfold in London during the Christmas season, celebrating the messy beauty of human relationships." |
| 231 | + }, |
| 232 | + # Romantic Comedies |
| 233 | + { |
| 234 | + "title": "When Harry Met Sally... (1989)", |
| 235 | + "summary": "Harry and Sally’s evolving friendship over years sparks debates about love and friendship, culminating in a heartfelt realization during a New Year’s Eve confession." |
| 236 | + }, |
| 237 | + { |
| 238 | + "title": "The Proposal (2009)", |
| 239 | + "summary": "A high-powered executive forces her assistant into a fake engagement to avoid deportation, leading to unexpected romance during a chaotic family weekend in Alaska." |
| 240 | + }, |
| 241 | + { |
| 242 | + "title": "Crazy Rich Asians (2018)", |
| 243 | + "summary": "Rachel Chu accompanies her boyfriend to Singapore, facing his ultra-wealthy family’s disapproval in a whirlwind of opulence, tradition, and newfound love." |
| 244 | + }, |
| 245 | + { |
| 246 | + "title": "10 Things I Hate About You (1999)", |
| 247 | + "summary": "A rebellious teen, Kat, is wooed by bad-boy Patrick in a modern Shakespearean tale of high school romance, deception, and heartfelt connection." |
| 248 | + }, |
| 249 | + { |
| 250 | + "title": "Notting Hill (1999)", |
| 251 | + "summary": "A humble London bookseller falls for a famous American actress, navigating fame, cultural clashes, and personal insecurities to pursue an unlikely love story." |
| 252 | + } |
| 253 | +] |
| 254 | +``` |
| 255 | + |
| 256 | +### Generate embeddings |
| 257 | + |
| 258 | +The `text-embedding-3-small` model converts text into numerical vectors and captures the "meaning" of the input text (in this case the movie reviews). Above, we discussed embedding vectors for single tokens, but OpenAI's model will do this for any length of text, such as our movie summaries. |
| 259 | + |
| 260 | +```python |
| 261 | +client = OpenAI() |
| 262 | + |
| 263 | +embeddings = [] |
| 264 | +for summary in movie_summaries: |
| 265 | + response = client.embeddings.create( |
| 266 | + model="text-embedding-3-small", |
| 267 | + input=summary |
| 268 | + ) |
| 269 | + embeddings.append(response.data[0].embedding) |
| 270 | +embeddings = np.array(embeddings) |
| 271 | +print(embeddings.shape) |
| 272 | +``` |
| 273 | + |
| 274 | + |
| 275 | +### Examine embeddings in 2d using PCA |
| 276 | + |
| 277 | +PCA reduces high-dimensional embeddings (1500-dimensions) to a lower-dimensional 2D for intuitive plots, making embeddings easy to visualize. It makes more clear the similarity relations among the embeddings, revealing the semantic structure between the summaries. |
| 278 | + |
| 279 | +We discussed above just how important this kind of perspective is in the development of LLMs/self-attention etc! |
| 280 | + |
| 281 | +```python |
| 282 | +# Do PCA to project to lower-dimensional embedding space |
| 283 | +pca = PCA(n_components=2) |
| 284 | +embeddings_2d = pca.fit_transform(embeddings) |
| 285 | + |
| 286 | +# Visualize embeddings |
| 287 | +plt.figure(figsize=(8, 6)) |
| 288 | +for i, summary in enumerate(movie_summaries): |
| 289 | + title = movie_titles[i] |
| 290 | + plt.scatter(embeddings_2d[i, 0], embeddings_2d[i, 1]) |
| 291 | + plt.text(embeddings_2d[i, 0] + 0.02, embeddings_2d[i, 1], title, size=8) |
| 292 | +plt.title("2D Visualization of Summary Embeddings") |
| 293 | +plt.xlabel("PCA Component 1") |
| 294 | +plt.ylabel("PCA Component 2") |
| 295 | +plt.show() |
| 296 | +``` |
| 297 | + |
| 298 | +We will see in the assignment that you can feed any text you'd like into the embedding model, it is a lot of fun to probe the semantic map embodied in these models. |
| 299 | + |
| 300 | + |
| 301 | +## 5. Key points |
| 302 | +Congrats we just covered the basics of natural language processing and modern LLM function! Some of the key points we covered: |
| 303 | + |
| 304 | +- Modern NLP helps computers understand and generate human language using data-driven deep learning rather than hand-crafted rules. |
| 305 | +- Large Language Models (LLMs) like GPT are trained with self-supervised learning to predict the next word in a sequence -- essentially, autocomplete at scale. |
| 306 | +- Tokenization, embeddings, and attention let models capture word meaning and context dynamically. |
| 307 | +- Transformers incorporate attention layers and neural networks to generate context-aware text predictions. |
| 308 | + |
| 309 | +While in subsequent lessons we will use APIs that rely on models built with these architectures, it's important to understand what’s happening under the hood. Hopefully, knowing a little bit about how tokenization, embeddings, and attention works will help demystify LLMs and gives you intuition for why they can generate language so effectively (and sometimes make such strange errors, which we will discuss in Lesson N). |
0 commit comments