-
Notifications
You must be signed in to change notification settings - Fork 3k
Fix: use prime reward manager for GSPO script #4809
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Code Review
This pull request aims to fix the configuration in the GSPO example scripts by switching from the dapo reward manager to prime, which is standard for math tasks. While the changes correctly remove the dapo-specific configurations, they incorrectly omit setting prime as the new reward manager. This causes the scripts to fall back to the default naive manager, which contradicts the stated goal of the PR and could lead to incorrect experimental results. I have provided critical feedback with code suggestions to fix this issue in all affected scripts.
I am having trouble creating individual review comments. Click here to see my feedback.
examples/gspo_trainer/run_qwen30b_gspo.sh (154-161)
According to the pull request description, the intent is to switch to reward_manager=prime for math tasks. However, this change removes the REWARD_CONFIG variable entirely. The script still references $REWARD_CONFIG on line 185, which will now be an empty string. This causes the trainer to use the default reward manager (naive), not prime, which is incorrect for this math-related task. To correctly implement the intended change, you should modify REWARD_CONFIG instead of removing it.
# ===================================== Reward =====================================
REWARD_CONFIG="reward_model.reward_manager=prime"
examples/gspo_trainer/test_gspo_3b_math.sh (176-181)
The PR description states the goal is to use reward_manager=prime for math tasks. By removing these lines, the script will fall back to the default reward manager (naive), which is not the intended behavior. You should explicitly set the reward manager to prime.
reward_model.reward_manager=prime \
examples/gspo_trainer/test_gspo_3b_math_slurm.sh (180-185)
The PR description states the goal is to use reward_manager=prime for math tasks. By removing these lines, the script will fall back to the default reward manager (naive), which is not the intended behavior. You should explicitly set the reward manager to prime. Additionally, the reward_manager variable defined on line 60 is now unused and should be removed for code cleanliness.
reward_model.reward_manager=prime \
examples/gspo_trainer/test_gspo_qwen30b_a3b_ep.sh (152-157)
The PR description states the goal is to use reward_manager=prime for math tasks. By removing these lines, the script will fall back to the default reward manager (naive), which is not the intended behavior for this math-related script. You should explicitly set the reward manager to prime.
reward_model.reward_manager=prime \
|
Could you explain why Why not use |
Actually, in my original script, I removed reward_manager to ensure the GSPO recipe remained minimal. I've added prime back here specifically because gemini_code_review suggested it as the standard configuration. |
|
Gemini only provides suggested modifications, which may be incorrect. You need to at least ensure that the updated script is runnable. The Prime Reward Manager seems to be no longer fully runnable due to dependency issues. |
Description
This PR fixes the configuration in the GSPO example script. Previously, it used
reward_manager=dapowith overlong penalties, which seems to be a leftover from DAPO experiments.I have updated it to use
reward_manager=prime(standard for Math tasks) and removed the DAPO-specificoverlong_bufferconfigurations to ensure the script runs standard GSPO logic correctly.Changes
reward_managerfromdapoto the default setting.overlong_buffer_cfgparameters.