Skip to content

Commit a126036

Browse files
committed
LLM architecture section: tokenizers
1 parent cb5d75c commit a126036

File tree

2 files changed

+64
-22
lines changed

2 files changed

+64
-22
lines changed

lessons/05_AI_intro/01_intro_nlp_llms.md

Lines changed: 64 additions & 22 deletions
Original file line numberDiff line numberDiff line change
@@ -22,76 +22,118 @@ The goals of NLP span a wide spectrum, including:
2222
Ultimately, NLP aims to make human-computer interaction as intuitive as human-to-human exchanges, so it can be used in fields as diverse as healthcare diagnostics, explaining complex legal documents, and personalized education.
2323

2424
### NLP Methods
25-
The field of NLP has seen a transformation in methodology over the past 50 years. It has seen a progression from rigid, rule-based approaches to data-driven, adaptive techniques that leverage machine learning and neural networks.
25+
The field of NLP has seen a transformation in methods over the past 50 years. It has seen a progression from rigid, rule-based approaches to data-driven, adaptive techniques that leverage machine learning and neural networks.
2626

2727
In its early days, NLP depended on rule-based systems and hand-crafted grammars to parse linguistic inputs. In these days, human experts manually encoded explicit linguistic rules into NLP systems. These methods were brittle, and struggled with the variability in real-world language.
2828

2929
The shift to statistical methods in the late 20th century marked a shift in method, incorporating probabilities to model patterns in language. This paved the way for machine learning, where algorithms learn directly from examples. Today, the dominant paradigm is deep learning. As we saw last week, deep learning is a subset of machine learning that uses neural networks with multiple layers to automatically extract features from raw data.
3030

31-
At the forefront of this approach are large language models (LLMs), such as those powering tools like GPT, which are pre-trained on billions of internet-scale sources of text. Such modern models, which we will discuss next, have dramatically advanced performance, though they still grapple with challenges like bias in training data and computational demands.
31+
At the forefront of this approach are large language models (LLMs), such as those powering tools like GPT, which are pre-trained on billions of internet-scale sources of text.
3232

33-
The release of ChatGPT (from OpenAI) on November 30, 2022, was a watershed moment in the history of NLP. It lead to a massive surge in public awareness and usage of LLMs. This easily accessible chatbot allowed millions to interact directly with an advanced LLM. Overnight, the app amassed over a million users, and almost instantly generated awareness of the power of AI. It also accelerated its adoption in industries like education, customer service, and content creation.
33+
The release of ChatGPT (from OpenAI) on November 30, 2022, was a watershed moment in the history of NLP. It lead to a massive surge in public awareness and usage of LLMs. This easily accessible chatbot allowed millions to interact directly with an advanced LLM. The app amassed over a million users in one day, instantly creating awareness of the power of AI. It also accelerated its adoption in industries like education, customer service, and content creation.
3434

3535
The newest wave of LLMs inspired a surge in research, and spawned ethical debates on issues like misinformation (LLM hallucinations), job displacement, and even concerns about [conscious AI](https://www.scientificamerican.com/article/google-engineer-claims-ai-chatbot-is-sentient-why-that-matters/). For an interesting discussion of the impact of ChatGPT on the field of NLP, see the [oral history in Quanta Magazine](https://www.quantamagazine.org/when-chatgpt-broke-an-entire-field-an-oral-history-20250430/).
3636

37-
Since 2022, LLMs have shifted from being mostly academic curiosities to tools that attract billions in revenue from major software companies, and they are reshaping how millions of people learn and interact with computers.
37+
Since 2022, LLMs have shifted from being mostly academic curiosities to tools that attract billions in investment every year, and they are reshaping how people learn and interact with computers.
3838

39-
In the rest of this lesson, we will learn some of the technical basics of how LLMs like ChatGPT work, and try to demystify their operations. Ultimateley, they are just another machine learning model, and they are trained to predict the next token in a string of tokens. Sort of like
39+
In the rest of this lesson, we will learn some of the technical basics of how LLMs like ChatGPT work, and try to demystify their operations. Ultimateley, they are just another machine learning model, and they are trained to predict the next token in a string of tokens.
4040

4141

4242
## 2. Large language models (LLMs)
4343
### LLMs: autocomplete at scale
44-
Modern LLMs are machine learning models that are trained to predict the next word in a sequence, given all the words that came before. Imagine starting a sentence, and the model is tasked with filling in the blank: "The cat sat on the ___." The model looks at the context and generates a probability distribution over possible words. It might estimate that "mat" has a 70% chance, "floor" 20%, "sofa" 5%, and so on. It then picks the most likely candidate (or sometimes samples from that distribution to keep things more varied).
44+
Modern LLMs are machine learning models that are trained to predict the next word in a sequence, given all the words that came before. Imagine starting a sentence, and the model is tasked with filling in the blank:
45+
46+
The cat sat on the ___
47+
48+
The model looks at the context -- the first five words -- and generates a probability distribution to find the most likely next word. It might estimate that "mat" has a 70% chance, "floor" 20%, "sofa" 5%, and so on. It then picks the most likely candidate (or sometimes samples from that distribution to keep things more varied).
4549

4650
This simple "predict the next word" trick turns out to be extremely powerful. By repeating it over and over, LLMs can generate entire paragraphs, answer questions, write code, or carry on conversations.
4751

4852
There is an excellent discussion of this at 3blue1brown (the following will open a video at YouTube):
4953

5054
[![Watch the video](https://img.youtube.com/vi/LPZh9BOjkQs/hqdefault.jpg)](https://www.youtube.com/watch?v=LPZh9BOjkQs)
5155

52-
5356
You have likely seen a similar mechanism on your phone when writing text and it suggests the next word using its *autocomplete* feature. Basically what LLMs do is autocompletion on a large scale. What makes LLMs *large* is the amount of data used to train them, and the size of the models.
5457

5558
LLMs are trained on enormous collections of text, including books, Wikipedia, articles, and large parts of the internet. The models also contain billions (sometimes even trillions) of parameters, which allow the model to capture much more subtle patterns in language. It's this large scale, as well as the underlying transformer architecture (which we will discuss below) that makes modern LLMs so much more fluent and flexible than your phone's autocomplete feature.
5659

57-
### How LLMs Learn
58-
The training process itself for LLMs is also different from what we saw in the ML module, where humans provide labeled data as ground truth to help train the models. Instead, LLMs use what’s called *self-supervised learning*. Because the "correct next word" is already present in every text sequence, the data effectively labels itself.
60+
### How LLMs Learn: Self-supervised learning
61+
The training process for LLMs is different from what we saw in the ML module -- there, we learned that humans provide labeled data as ground truth to help train the models. Instead, LLMs use what’s called *self-supervised learning*. Because the "correct next word" is already present in every text sequence, the data effectively labels itself.
5962

6063
For example, in the phrase "The cat sat on the mat," the model can practice by hiding "mat" and predicting it from the context. This setup is also called *autoregression*, because the model predicts each word based on all the words before it.
6164

62-
With this approach, you can train on billions or trillions of examples without manual annotation. Over time, the model learns grammar, facts, reasoning shortcuts, and stylistic patterns simply by getting better at predicting the next word.
65+
With this approach, you can train on billions or trillions of examples without having to manually annotate ground-truth data. Over time, the model learns facts, grammar, and reasoning patterns simply by getting better at predicting the next word.
66+
67+
There is one wrinkle we should cover regarding how LLMs learn before moving on to more technical matters. There are really *two* different learning modes for LLMs. First, by training on huge bodies of text in the next-word-prediction task, we end up with *foundational* or *pretrained* or *base* models. These are general purpose models that embody information from extremely broad sources.
6368

64-
There is one wrinkle we should cover before moving on to more technical matters. There are really two different ways that LLMs learn. First, by training on huge bodies of text in the next-word-prediction task, we end up with *foundational* or *pretrained* models. These are general purpose models that embody information from extremely broad sources. However, they don't work well as personal assistants, chatbots, etc. To get good performance on such specialized tasks, a second training step is needed, where these foundational models are *fine-tuned* on a labeled dataset that is tailored to a specific task or application.
69+
However, just foundational models don't work well in special-purpose jobs like personal assistants, chatbots, etc. To get good performance on such specialized tasks, a second training step is needed, where these foundational models are *fine-tuned* on a labeled dataset that is tailored to a specific task or application.
6570

6671
![pretrained vs fine-tuned llm](resources/pretrained_finetuned_llm.jpg)
6772

68-
In other words, fine-tuning takes a base model and adjusts it for specific purposes, such as answering questions, following instructions, or writing in a particular style.
73+
In other words, fine-tuning takes a foundational model and adjusts it for specific purposes, such as answering questions, following instructions, or writing in a particular style. There are various ways to do this. One, supervised fine-tuning (SFT) follows a more traditional ML approach where the model is given paired examples of inputs and desired outputs.
74+
75+
Another is [reinforcement learning from human feedback](https://www.youtube.com/watch?v=T_X4XFwKX8k) (RLHF). With RLHF the model adapts (using reinforcement learning procedures) to produce responses that are ranked more highly by human judges.
76+
77+
The result with fine-tuning is the production of specialized models built on top of the same foundation -- one model might become a customer service chatbot, another a medical assistant, and another a coding helper. The distinction between the general pretrained model and its fine-tuned variants is key to understanding why LLMs are so adaptable in practice.
78+
79+
While in this course we will not go through the process of building your own LLM, the excellent book [Build a Large Language Model from Scratch](https://www.manning.com/books/build-a-large-language-model-from-scratch) by Sebastian Raschka, walks you through this in detailk using PyTorch if you are interested. The above picture is adapted from his book.
80+
81+
In the next section we will dig into the details about how LLMS actually work: as we said, it isn't just that they are *large*, but their *architecture*, that makes them so powerful.
6982

70-
There are various ways to do this. For instance, with supervised fine-tuning (SFT) follows a more traditional ML approach where the model is given paired examples of inputs and desired outputs. With [reinforcement learning from human feedback](https://www.youtube.com/watch?v=T_X4XFwKX8k) (RLHF), the model adapts (using reinforcement learning procedures) to produce responses that are ranked more highly by human judges.
7183

72-
The result is that you can build specialized models on top of the same foundation -- one version might become a customer service chatbot, another a medical assistant, and another a coding helper. The distinction between the general base model and its fine-tuned variants is key to understanding why LLMs are so adaptable in practice.
84+
## 3. LLM architecture
85+
In this section we will walk step-by-step through the following simplifed LLM architecture diagram, which is adapted from Chapter 2 of Raschka's excellent book:
7386

74-
While in this course we will not go through building your own LLM, the excellent book [Build a Large Language Model from Scratch](https://www.manning.com/books/build-a-large-language-model-from-scratch) by Sebastian Raschka, walks you through this process using PyTorch if you are interested. The above picture is adapted from his book.
87+
![LLM architectural](resources/llm_architecture.jpg)
7588

76-
In the next section we will dig into the details a bit more about how LLMS work: as we said, it isn't just that they are *large*, but their particular computational architecture, that makes them so powerful.
89+
There are three main steps that it is important to focus on when understanding how LLMs get so good at predict the next word in a sequence:
7790

91+
- tokenization
92+
- token embedding
93+
- attention
7894

95+
### Tokenization: From raw text to token ids
96+
Tokenization is the process of breaking chunks of text into smaller pieces that an LLM can handle. For example, the sentence “The cat sat on the mat.” might be split into tokens like ["The", " cat", " sat", " on", " the", " mat", "."]. These tokens are then mapped to unique integer IDs.
7997

98+
Importantly, tokens are not always whole words. To keep the vocabulary manageable, many tokenizers break rare or complex words into smaller chunks. For example, "blueberries" might become ["blue", "berries"]. This makes it possible to represent any string of text, even if it never appeared in training.
8099

100+
You can [play online]( https://platform.openai.com/tokenizer) with a popular tokenizer, *tiktoken*. You can explore how it breaks down text into parts and creates numerical ids for each token.
81101

82-
Pretrained models
83-
Tokenizing Embedding (we will cover this)
84-
Transformer and attention to create better embeddings
85-
Next word prediction: autoregression -- tons of training data
102+
You can learn more about tokenization at the following resources:
103+
- [Super Data Science video](https://www.youtube.com/watch?v=ql-XNY_qZHc)
104+
- [Huggingface introduction]()
86105

106+
### Token embeddings: from meanings
107+
Once a sentence is carved into tokens, and the token IDs are created, there is no *meaning*. To enter the world of semantics, or meaning, a neural network known as an *embedding network* converts the token ids into *embedding vectors*. This is a crucial step in the conversion of symbolic, linguistic data into the numeric data that PyTorch and other deep learning frameworks can use in their next-word prediction tasks.
87108

109+
. The next step is to turn them into embeddings, which give them a place in semantic space.
110+
https://medium.com/@saschametzger/what-are-tokens-vectors-and-embeddings-how-do-you-create-them-e2a3e698e037
111+
112+
but these need to be converted to MEANINGS.
113+
cat and feline should be closer together than dog, and these should all be closer together than car and bicycle, which should be together in semantic space close to vehicles.
114+
115+
Embeddings don't have to be just words, but lets pretend that for now.
116+
117+
Embeddings: https://www.youtube.com/watch?v=OxCpWwDCDFQ
118+
How does transformer architecture solve this problem? (attention)
119+
120+
Brief discussion of attention and how it helps solve this (attention is all you need), and how attention is wrapped into transformer. And how this improved on earlier conviction that you needed recurrent networks.,
121+
122+
Potential links:
123+
https://huggingface.co/learn/llm-course/chapter1/4?fw=pt
124+
Statquest but it is often too mathematical
125+
Word embeddings: https://www.youtube.com/watch?v=wgfSDrqYMJ4
126+
Tokenizer play: https://platform.openai.com/tokenizer
127+
128+
## 4. From languge to meaning: tokenization and embedding
129+
130+
Blah blah blah
88131

89132
## 2. What Are Language Models (and LLMs)?
90133
Show the progression from early NLP to modern LLMs and why LLMs are different.
91134

92135
**Topics to cover:**
93-
- What is a language model? (next-word prediction, completion)
94-
- Early NLP (rule-based → bag-of-words → word2vec)
136+
95137
- The shift to transformers and LLMs
96138
- What makes a model 'large'? (parameters, data, compute)
97139
- Rise of pretrained models (e.g., BERT, GPT)
106 KB
Loading

0 commit comments

Comments
 (0)