Description
Hello, thank you for the tremendous contribution your team has made to trustworthy inference for large models. I encountered some issues while reproducing your experiments and would appreciate your help in resolving them.
When using the eval method you mentioned to test the metrics, we found discrepancies between our results and yours. Specifically, when testing the Llama-3-8b-Instruct model trained with DPO on the asqa_eval_top100_calibrated.json
, the metrics did not match those reported in the paper.
The results I obtained are as follows:
My testing environment is: Ubuntu 20.04, Torch 2.5.0, vllm 0.6.5, with 4 A100 80G GPUs. The DPO training parameters are consistent with those in your code (only the dataset and model load paths were modified). Could you help explain the possible cause of this discrepancy?