Request for evaluation code of the pre-distillation student model

Hi, thank you for releasing this excellent work and the well‑organized codebase!

I’m trying to reproduce the results reported in Figure 1 of the paper, where the pre‑distillation student model (Llama‑3.2‑3B‑Base) achieves around 31% accuracy on GSM8K. However, the provided `pipeline.sh` only evaluate the post‑distillation student models.

When I directly evaluate the base Llama‑3.2‑3B using the same `gentraces.py` setup, its accuracy is much lower than 31%, possibly due to a prompt or template mismatch between the base and fine‑tuned models. Could you please share the code used to evaluate the pre‑distillation student model’s performance?

Thanks again for your great work!

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Request for evaluation code of the pre-distillation student model #2

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Request for evaluation code of the pre-distillation student model #2

Description

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions