-
Notifications
You must be signed in to change notification settings - Fork 1.3k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[DPO] Adding weighted preference optimization (WPO) #2141
Conversation
The docs for this PR live here. All of your documentation changes will be reflected on that endpoint. The docs are available until 30 days after the last update. |
@gaetanlop tests for visual-llm models are failing i believe due to an extra argument? |
@kashif my bad, should be fixed now |
Hey @kashif, I don’t think the failing test is related to the PR. Am I right? |
No, see #2147 (comment) |
Looks like #2131 has been merged. I will check if there are any modifications required. |
What does this PR do?
Adding WPO to the
DPOTrainer
. The paper is in the list of accepted papers of EMNLP 2024 and introduces an elegant method to similuate on-policy data while training on off policy preference pairs. It works by prioritizing the most probable samples under the current policy during optimization.Implementation wise, it is independent from the loss function being used as it only provides a weighting for each sample pair during loss computation. My implementation is based on the authors implemation in https://github.com/wzhouad/WPO/blob/main/scripts/run_wpo.py
Before submitting
Pull Request section?
Who can review?
@kashif