DPO training with 'logits/chosen':nan,'logits/rejected':nan #2435
Labels
🐛 bug
Something isn't working
🏋 DPO
Related to DPO
⏳ needs more info
Additional information or clarification is required to proceed
System Info
These situations occur during the training process, I found that the results after DPO training is not as good as the normal fine-tuned results instead, is there master to help?, see if there is any problem with my training:
{'loss': 0.6931, 'grad_norm': 2.299546957015991, 'learning_rate': 1e-05, 'rewards/chosen': -0.05413971096277237, 'rewards/rejected': -0.054256755858659744, 'rewards/accuracies': 0.25, 'rewards/margins': 0.00011704633652698249, 'logps/chosen': -252.8257598876953, 'logps/rejected': -254.96127319335938, 'logits/chosen': nan, 'logits/rejected': nan, 'epoch': 0.03}
{'loss': 0.6928, 'grad_norm': 0.8526943922042847, 'learning_rate': 2e-05, 'rewards/chosen': -0.08855850994586945, 'rewards/rejected': -0.08932790160179138, 'rewards/accuracies': 0.20000000298023224, 'rewards/margins': 0.0007693897932767868, 'logps/chosen': -262.14874267578125, 'logps/rejected': -264.9452819824219, 'logits/chosen': nan, 'logits/rejected': nan, 'epoch': 0.05}
{'loss': 0.691, 'grad_norm': 0.769097626209259, 'learning_rate': 3e-05, 'rewards/chosen': -0.12483439594507217, 'rewards/rejected': -0.12917408347129822, 'rewards/accuracies': 0.32499998807907104, 'rewards/margins': 0.004339695908129215, 'logps/chosen': -281.1364440917969, 'logps/rejected': -284.90081787109375, 'logits/chosen': nan, 'logits/rejected': nan, 'epoch': 0.08}
{'loss': 0.6846, 'grad_norm': 1.830174446105957, 'learning_rate': 4e-05, 'rewards/chosen': -0.324862539768219, 'rewards/rejected': -0.3433191180229187, 'rewards/accuracies': 0.32499998807907104, 'rewards/margins': 0.01845661923289299, 'logps/chosen': -304.3634338378906, 'logps/rejected': -308.95806884765625, 'logits/chosen': nan, 'logits/rejected': nan, 'epoch': 0.11}
Information
Tasks
examples
folderReproduction
Additional Information
I'm new to this and would appreciate more guidance from the big guys, thanks!!!!
What I'm trying to understand is when trying to fine tune or DPO training, is it always necessary to construct input_ids and other such before training as a dataset, or what is the general process? Not really sure about the process
Here is my DPO training code
The code that builds the dataset, does it need to be handled here first when passed in?
Expected behavior
I want know what is the normal training loss, Tanks
Checklist
The text was updated successfully, but these errors were encountered: