Reproduction Fail on Llama 3

Hello, thank you for the tremendous contribution your team has made to trustworthy inference for large models. I encountered some issues while reproducing your experiments and would appreciate your help in resolving them.

When using the eval method you mentioned to test the metrics, we found discrepancies between our results and yours. Specifically, when testing the Llama-3-8b-Instruct model trained with DPO on the `asqa_eval_top100_calibrated.json`, the metrics did not match those reported in the paper.

The results I obtained are as follows:

![Image](https://github.com/user-attachments/assets/cd361e52-aad4-4a97-8c19-2b6f0a2612ee)

My testing environment is: Ubuntu 20.04, Torch 2.5.0, vllm 0.6.5, with 4 A100 80G GPUs. The DPO training parameters are consistent with those in your code (only the dataset and model load paths were modified). Could you help explain the possible cause of this discrepancy?

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Reproduction Fail on Llama 3 #3

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development