Code-the-Dream-School
diff --git a/‎lessons/05_AI_intro/01_intro_nlp_llms.md‎
Lines changed: 64 additions & 22 deletions b/‎lessons/05_AI_intro/01_intro_nlp_llms.md‎
Lines changed: 64 additions & 22 deletions
diff --git a/‎lessons/05_AI_intro/resources/llm_architecture.jpg‎
106 KB b/‎lessons/05_AI_intro/resources/llm_architecture.jpg‎
106 KB
@@ -22,76 +22,118 @@ The goals of NLP span a wide spectrum, including:
 Ultimately, NLP aims to make human-computer interaction as intuitive as human-to-human exchanges, so it can be used in fields as diverse as healthcare diagnostics, explaining complex legal documents, and personalized education. 
 
 ### NLP Methods
-The field of NLP has seen a transformation in methodology over the past 50 years. It has seen a progression from rigid, rule-based approaches to data-driven, adaptive techniques that leverage machine learning and neural networks. 
+The field of NLP has seen a transformation in methods over the past 50 years. It has seen a progression from rigid, rule-based approaches to data-driven, adaptive techniques that leverage machine learning and neural networks. 
 
 In its early days, NLP depended on rule-based systems and hand-crafted grammars to parse linguistic inputs. In these days, human experts manually encoded explicit linguistic rules into NLP systems. These methods were brittle, and struggled with the variability in real-world language. 
 
 The shift to statistical methods in the late 20th century marked a shift in method, incorporating probabilities to model patterns in language. This paved the way for machine learning, where algorithms learn directly from examples. Today, the dominant paradigm is deep learning. As we saw last week, deep learning is a subset of machine learning that uses neural networks with multiple layers to automatically extract features from raw data. 
 
-At the forefront of this approach are large language models (LLMs), such as those powering tools like GPT, which are pre-trained on billions of internet-scale sources of text. Such modern models, which we will discuss next, have dramatically advanced performance, though they still grapple with challenges like bias in training data and computational demands.
+At the forefront of this approach are large language models (LLMs), such as those powering tools like GPT, which are pre-trained on billions of internet-scale sources of text. 
 
-The release of ChatGPT (from OpenAI) on November 30, 2022, was a watershed moment in the history of NLP. It lead to a massive surge in public awareness and usage of LLMs. This easily accessible chatbot allowed millions to interact directly with an advanced LLM. Overnight, the app amassed over a million users, and almost instantly generated awareness of the power of AI. It also accelerated its adoption in industries like education, customer service, and content creation. 
+The release of ChatGPT (from OpenAI) on November 30, 2022, was a watershed moment in the history of NLP. It lead to a massive surge in public awareness and usage of LLMs. This easily accessible chatbot allowed millions to interact directly with an advanced LLM. The app amassed over a million users in one day, instantly creating awareness of the power of AI. It also accelerated its adoption in industries like education, customer service, and content creation. 
 
 The newest wave of LLMs inspired a surge in research, and spawned ethical debates on issues like misinformation (LLM hallucinations), job displacement, and even concerns about [conscious AI](https://www.scientificamerican.com/article/google-engineer-claims-ai-chatbot-is-sentient-why-that-matters/). For an interesting discussion of the impact of ChatGPT on the field of NLP, see the [oral history in Quanta Magazine](https://www.quantamagazine.org/when-chatgpt-broke-an-entire-field-an-oral-history-20250430/). 
 
-Since 2022, LLMs have shifted from being mostly academic curiosities to tools that attract billions in revenue from major software companies, and they are reshaping how millions of people learn and interact with computers. 
+Since 2022, LLMs have shifted from being mostly academic curiosities to tools that attract billions in investment every year, and they are reshaping how people learn and interact with computers. 
 
-In the rest of this lesson, we will learn some of the technical basics of how LLMs like ChatGPT work, and try to demystify their operations. Ultimateley, they are just another machine learning model, and they are trained to predict the next token in a string of tokens. Sort of like 
+In the rest of this lesson, we will learn some of the technical basics of how LLMs like ChatGPT work, and try to demystify their operations. Ultimateley, they are just another machine learning model, and they are trained to predict the next token in a string of tokens. 
 
 
 ## 2. Large language models (LLMs)
 ### LLMs: autocomplete at scale
-Modern LLMs are machine learning models that are trained to predict the next word in a sequence, given all the words that came before. Imagine starting a sentence, and the model is tasked with filling in the blank: "The cat sat on the ___." The model looks at the context and generates a probability distribution over possible words. It might estimate that "mat" has a 70% chance, "floor" 20%, "sofa" 5%, and so on. It then picks the most likely candidate (or sometimes samples from that distribution to keep things more varied). 
+Modern LLMs are machine learning models that are trained to predict the next word in a sequence, given all the words that came before. Imagine starting a sentence, and the model is tasked with filling in the blank: 
+
+    The cat sat on the ___
+
+The model looks at the context -- the first five words -- and generates a probability distribution to find the most likely next word. It might estimate that "mat" has a 70% chance, "floor" 20%, "sofa" 5%, and so on. It then picks the most likely candidate (or sometimes samples from that distribution to keep things more varied). 
 
 This simple "predict the next word" trick turns out to be extremely powerful. By repeating it over and over, LLMs can generate entire paragraphs, answer questions, write code, or carry on conversations.
 
 There is an excellent discussion of this at 3blue1brown (the following will open a video at YouTube):
 
 [![Watch the video](https://img.youtube.com/vi/LPZh9BOjkQs/hqdefault.jpg)](https://www.youtube.com/watch?v=LPZh9BOjkQs)
 
-
 You have likely seen a similar mechanism on your phone when writing text and it suggests the next word using its *autocomplete* feature. Basically what LLMs do is autocompletion on a large scale. What makes LLMs *large* is the amount of data used to train them, and the size of the models. 
 
 LLMs are trained on enormous collections of text, including books, Wikipedia, articles, and large parts of the internet. The models also contain billions (sometimes even trillions) of parameters, which allow the model to capture much more subtle patterns in language. It's this large scale, as well as the underlying transformer architecture (which we will discuss below) that makes modern LLMs so much more fluent and flexible than your phone's autocomplete feature. 
 
-### How LLMs Learn
-The training process itself for LLMs is also different from what we saw in the ML module, where humans provide labeled data as ground truth to help train the models. Instead, LLMs use what’s called *self-supervised learning*. Because the "correct next word" is already present in every text sequence, the data effectively labels itself. 
+### How LLMs Learn: Self-supervised learning 
+The training process for LLMs is different from what we saw in the ML module -- there, we learned that humans provide labeled data as ground truth to help train the models. Instead, LLMs use what’s called *self-supervised learning*. Because the "correct next word" is already present in every text sequence, the data effectively labels itself. 
 
 For example, in the phrase "The cat sat on the mat," the model can practice by hiding "mat" and predicting it from the context. This setup is also called *autoregression*, because the model predicts each word based on all the words before it.
 
-With this approach, you can train on billions or trillions of examples without manual annotation. Over time, the model learns grammar, facts, reasoning shortcuts, and stylistic patterns simply by getting better at predicting the next word.
+With this approach, you can train on billions or trillions of examples without having to manually annotate ground-truth data. Over time, the model learns facts, grammar, and reasoning patterns simply by getting better at predicting the next word.
+
+There is one wrinkle we should cover regarding how LLMs learn before moving on to more technical matters. There are really *two* different learning modes for LLMs. First, by training on huge bodies of text in the next-word-prediction task, we end up with *foundational* or *pretrained* or *base* models. These are general purpose models that embody information from extremely broad sources. 
 
-There is one wrinkle we should cover before moving on to more technical matters. There are really two different ways that LLMs learn. First, by training on huge bodies of text in the next-word-prediction task, we end up with *foundational* or *pretrained* models. These are general purpose models that embody information from extremely broad sources. However, they don't work well as personal assistants, chatbots, etc. To get good performance on such specialized tasks, a second training step is needed, where these foundational models are *fine-tuned* on a labeled dataset that is tailored to a specific task or application.
+However, just foundational models don't work well in special-purpose jobs like personal assistants, chatbots, etc. To get good performance on such specialized tasks, a second training step is needed, where these foundational models are *fine-tuned* on a labeled dataset that is tailored to a specific task or application.
 
 ![pretrained vs fine-tuned llm](resources/pretrained_finetuned_llm.jpg)
 
-In other words, fine-tuning takes a base model and adjusts it for specific purposes, such as answering questions, following instructions, or writing in a particular style. 
+In other words, fine-tuning takes a foundational model and adjusts it for specific purposes, such as answering questions, following instructions, or writing in a particular style. There are various ways to do this. One, supervised fine-tuning (SFT) follows a more traditional ML approach where the model is given paired examples of inputs and desired outputs. 
+
+Another is [reinforcement learning from human feedback](https://www.youtube.com/watch?v=T_X4XFwKX8k) (RLHF). With RLHF the model adapts (using reinforcement learning procedures) to produce responses that are ranked more highly by human judges. 
+
+The result with fine-tuning is the production of specialized models built on top of the same foundation -- one model might become a customer service chatbot, another a medical assistant, and another a coding helper. The distinction between the general pretrained model and its fine-tuned variants is key to understanding why LLMs are so adaptable in practice.
+
+While in this course we will not go through the process of building your own LLM, the excellent book [Build a Large Language Model from Scratch](https://www.manning.com/books/build-a-large-language-model-from-scratch) by Sebastian Raschka, walks you through this in detailk using PyTorch if you are interested. The above picture is adapted from his book.
+
+In the next section we will dig into the details about how LLMS actually work: as we said, it isn't just that they are *large*, but their *architecture*, that makes them so powerful. 
 
-There are various ways to do this. For instance, with supervised fine-tuning (SFT) follows a more traditional ML approach where the model is given paired examples of inputs and desired outputs. With [reinforcement learning from human feedback](https://www.youtube.com/watch?v=T_X4XFwKX8k) (RLHF), the model adapts (using reinforcement learning procedures) to produce responses that are ranked more highly by human judges. 
 
-The result is that you can build specialized models on top of the same foundation -- one version might become a customer service chatbot, another a medical assistant, and another a coding helper. The distinction between the general base model and its fine-tuned variants is key to understanding why LLMs are so adaptable in practice.
+## 3. LLM architecture
+In this section we will walk step-by-step through the following simplifed LLM architecture diagram, which is adapted from Chapter 2 of Raschka's excellent book:
 
-While in this course we will not go through building your own LLM, the excellent book [Build a Large Language Model from Scratch](https://www.manning.com/books/build-a-large-language-model-from-scratch) by Sebastian Raschka, walks you through this process using PyTorch if you are interested. The above picture is adapted from his book.
+![LLM architectural](resources/llm_architecture.jpg)
 
-In the next section we will dig into the details a bit more about how LLMS work: as we said, it isn't just that they are *large*, but their particular computational architecture, that makes them so powerful. 
+There are three main steps that it is important to focus on when understanding how LLMs get so good at predict the next word in a sequence:
 
+- tokenization
+- token embedding
+- attention
 
+### Tokenization: From raw text to token ids
+Tokenization is the process of breaking chunks of text into smaller pieces that an LLM can handle. For example, the sentence “The cat sat on the mat.” might be split into tokens like ["The", " cat", " sat", " on", " the", " mat", "."]. These tokens are then mapped to unique integer IDs.
 
+Importantly, tokens are not always whole words. To keep the vocabulary manageable, many tokenizers break rare or complex words into smaller chunks. For example, "blueberries" might become ["blue", "berries"]. This makes it possible to represent any string of text, even if it never appeared in training.
 
+You can [play online]( https://platform.openai.com/tokenizer) with a popular tokenizer, *tiktoken*. You can explore how it breaks down text into parts and creates numerical ids for each token. 
 
-Pretrained models
-Tokenizing Embedding (we will cover this)
-Transformer and attention to create better embeddings
-Next word prediction: autoregression -- tons of training data
+You can learn more about tokenization at the following resources:
+- [Super Data Science video](https://www.youtube.com/watch?v=ql-XNY_qZHc)
+- [Huggingface introduction]()
 
+### Token embeddings: from meanings
+Once a sentence is carved into tokens, and the token IDs are created, there is no *meaning*. To enter the world of semantics, or meaning, a neural network known as an *embedding network* converts the token ids into *embedding vectors*. This is a crucial step in the conversion of symbolic, linguistic data into the numeric data that PyTorch and other deep learning frameworks can use in their next-word prediction tasks. 
 
+. The next step is to turn them into embeddings, which give them a place in semantic space.
+https://medium.com/@saschametzger/what-are-tokens-vectors-and-embeddings-how-do-you-create-them-e2a3e698e037
+
+but these need to be converted to MEANINGS.
+cat and feline should be closer together than dog, and these should all be closer together than car and bicycle, which should be together in semantic space close to vehicles. 
+
+Embeddings don't have to be just words, but lets pretend that for now. 
+
+Embeddings: https://www.youtube.com/watch?v=OxCpWwDCDFQ 
+How does transformer architecture solve this problem? (attention)
+
+Brief discussion of attention and how it helps solve this (attention is all you need), and how attention is wrapped into transformer. And how this improved on earlier conviction that you needed recurrent networks., 
+
+Potential links:
+https://huggingface.co/learn/llm-course/chapter1/4?fw=pt
+Statquest but it is often too mathematical
+Word embeddings: https://www.youtube.com/watch?v=wgfSDrqYMJ4
+Tokenizer play: https://platform.openai.com/tokenizer 
+
+## 4. From languge to meaning: tokenization and embedding
+
+Blah blah blah 
 
 ## 2. What Are Language Models (and LLMs)?
 Show the progression from early NLP to modern LLMs and why LLMs are different.
 
 **Topics to cover:**
-- What is a language model? (next-word prediction, completion)
-- Early NLP (rule-based → bag-of-words → word2vec)
+
 - The shift to transformers and LLMs
 - What makes a model 'large'? (parameters, data, compute)
 - Rise of pretrained models (e.g., BERT, GPT)