Hi, thank you for releasing this excellent work and the well‑organized codebase!
I’m trying to reproduce the results reported in Figure 1 of the paper, where the pre‑distillation student model (Llama‑3.2‑3B‑Base) achieves around 31% accuracy on GSM8K. However, the provided pipeline.sh only evaluate the post‑distillation student models.
When I directly evaluate the base Llama‑3.2‑3B using the same gentraces.py setup, its accuracy is much lower than 31%, possibly due to a prompt or template mismatch between the base and fine‑tuned models. Could you please share the code used to evaluate the pre‑distillation student model’s performance?
Thanks again for your great work!