minor edits

cjyabraham · cjyabraham · commit 29c8fb89dee5 · 2024-11-14T13:17:39.000-08:00
Signed-off-by: Chris Abraham &lt;cjyabraham@gmail.com&gt;
diff --git a/_posts/2024-11-13-llama-into-torchtune.md b/_posts/2024-11-13-llama-into-torchtune.md
@@ -105,7 +105,7 @@ The idea of knowledge distillation is that a smaller model can achieve better pe
    </td>
   </tr>
   <tr>
-   <td><a href="https://arxiv.org/pdf/2404.02657">AKL</a>
+   <td>AKL
    </td>
    <td>24.4
    </td>
@@ -176,15 +176,13 @@ Below is a simplified example of how knowledge distillation differs from supervi
    <pre class="highlight">
    <code>
 model = llama3_2_1b()
-teacher_model = llama3_1_8b()
 ce_loss = CrossEntropyLoss()
 kd_loss = ForwardKLLoss()
 
 tokens, labels = batch["tokens"], batch["labels"]
 logits = model(tokens, ...)
 
 loss = ce_loss(logits, labels)
-
 loss.backward()
 
    </code>