Skip to content

Conversation

@aub123
Copy link

@aub123 aub123 commented Jan 6, 2026

Description

This PR fixes the configuration in the GSPO example script. Previously, it used reward_manager=dapo with overlong penalties, which seems to be a leftover from DAPO experiments.

I have updated it to use reward_manager=prime (standard for Math tasks) and removed the DAPO-specific overlong_buffer configurations to ensure the script runs standard GSPO logic correctly.

Changes

  • Changed reward_manager from dapo to the default setting.
  • Removed overlong_buffer_cfg parameters.

@CLAassistant
Copy link

CLAassistant commented Jan 6, 2026

CLA assistant check
All committers have signed the CLA.

Copy link
Contributor

@gemini-code-assist gemini-code-assist bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request aims to fix the configuration in the GSPO example scripts by switching from the dapo reward manager to prime, which is standard for math tasks. While the changes correctly remove the dapo-specific configurations, they incorrectly omit setting prime as the new reward manager. This causes the scripts to fall back to the default naive manager, which contradicts the stated goal of the PR and could lead to incorrect experimental results. I have provided critical feedback with code suggestions to fix this issue in all affected scripts.

I am having trouble creating individual review comments. Click here to see my feedback.

examples/gspo_trainer/run_qwen30b_gspo.sh (154-161)

critical

According to the pull request description, the intent is to switch to reward_manager=prime for math tasks. However, this change removes the REWARD_CONFIG variable entirely. The script still references $REWARD_CONFIG on line 185, which will now be an empty string. This causes the trainer to use the default reward manager (naive), not prime, which is incorrect for this math-related task. To correctly implement the intended change, you should modify REWARD_CONFIG instead of removing it.

# ===================================== Reward =====================================
REWARD_CONFIG="reward_model.reward_manager=prime"

examples/gspo_trainer/test_gspo_3b_math.sh (176-181)

critical

The PR description states the goal is to use reward_manager=prime for math tasks. By removing these lines, the script will fall back to the default reward manager (naive), which is not the intended behavior. You should explicitly set the reward manager to prime.

    reward_model.reward_manager=prime \

examples/gspo_trainer/test_gspo_3b_math_slurm.sh (180-185)

critical

The PR description states the goal is to use reward_manager=prime for math tasks. By removing these lines, the script will fall back to the default reward manager (naive), which is not the intended behavior. You should explicitly set the reward manager to prime. Additionally, the reward_manager variable defined on line 60 is now unused and should be removed for code cleanliness.

    reward_model.reward_manager=prime \

examples/gspo_trainer/test_gspo_qwen30b_a3b_ep.sh (152-157)

critical

The PR description states the goal is to use reward_manager=prime for math tasks. By removing these lines, the script will fall back to the default reward manager (naive), which is not the intended behavior for this math-related script. You should explicitly set the reward manager to prime.

    reward_model.reward_manager=prime \

@yyDing1
Copy link
Collaborator

yyDing1 commented Jan 6, 2026

Could you explain why reward_manager=prime is the standard for gspo?

Why not use reward_manager=naive or reward_manager=dapo with enable_overlong_buffer=False (which some of the scripts do).

@aub123
Copy link
Author

aub123 commented Jan 6, 2026

Could you explain why reward_manager=prime is the standard for gspo?

Why not use reward_manager=naive or reward_manager=dapo with enable_overlong_buffer=False (which some of the scripts do).

Actually, in my original script, I removed reward_manager to ensure the GSPO recipe remained minimal. I've added prime back here specifically because gemini_code_review suggested it as the standard configuration.

@yyDing1
Copy link
Collaborator

yyDing1 commented Jan 8, 2026

Gemini only provides suggested modifications, which may be incorrect. You need to at least ensure that the updated script is runnable. The Prime Reward Manager seems to be no longer fully runnable due to dependency issues.

@wuxibin89 wuxibin89 closed this Jan 9, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants