Fix: use prime reward manager for GSPO script #4809

aub123 · 2026-01-06T06:36:48Z

Description

This PR fixes the configuration in the GSPO example script. Previously, it used reward_manager=dapo with overlong penalties, which seems to be a leftover from DAPO experiments.

I have updated it to use reward_manager=prime (standard for Math tasks) and removed the DAPO-specific overlong_buffer configurations to ensure the script runs standard GSPO logic correctly.

Changes

Changed reward_manager from dapo to the default setting.
Removed overlong_buffer_cfg parameters.

CLAassistant · 2026-01-06T06:36:56Z

All committers have signed the CLA.

gemini-code-assist

Code Review

This pull request aims to fix the configuration in the GSPO example scripts by switching from the dapo reward manager to prime, which is standard for math tasks. While the changes correctly remove the dapo-specific configurations, they incorrectly omit setting prime as the new reward manager. This causes the scripts to fall back to the default naive manager, which contradicts the stated goal of the PR and could lead to incorrect experimental results. I have provided critical feedback with code suggestions to fix this issue in all affected scripts.

I am having trouble creating individual review comments. Click here to see my feedback.

examples/gspo_trainer/run_qwen30b_gspo.sh (154-161)

According to the pull request description, the intent is to switch to reward_manager=prime for math tasks. However, this change removes the REWARD_CONFIG variable entirely. The script still references $REWARD_CONFIG on line 185, which will now be an empty string. This causes the trainer to use the default reward manager (naive), not prime, which is incorrect for this math-related task. To correctly implement the intended change, you should modify REWARD_CONFIG instead of removing it.

# ===================================== Reward =====================================
REWARD_CONFIG="reward_model.reward_manager=prime"

examples/gspo_trainer/test_gspo_3b_math.sh (176-181)

The PR description states the goal is to use reward_manager=prime for math tasks. By removing these lines, the script will fall back to the default reward manager (naive), which is not the intended behavior. You should explicitly set the reward manager to prime.

    reward_model.reward_manager=prime \

examples/gspo_trainer/test_gspo_3b_math_slurm.sh (180-185)

The PR description states the goal is to use reward_manager=prime for math tasks. By removing these lines, the script will fall back to the default reward manager (naive), which is not the intended behavior. You should explicitly set the reward manager to prime. Additionally, the reward_manager variable defined on line 60 is now unused and should be removed for code cleanliness.

    reward_model.reward_manager=prime \

examples/gspo_trainer/test_gspo_qwen30b_a3b_ep.sh (152-157)

The PR description states the goal is to use reward_manager=prime for math tasks. By removing these lines, the script will fall back to the default reward manager (naive), which is not the intended behavior for this math-related script. You should explicitly set the reward manager to prime.

    reward_model.reward_manager=prime \

yyDing1 · 2026-01-06T07:22:59Z

Could you explain why reward_manager=prime is the standard for gspo?

Why not use reward_manager=naive or reward_manager=dapo with enable_overlong_buffer=False (which some of the scripts do).

aub123 · 2026-01-06T08:00:39Z

Could you explain why reward_manager=prime is the standard for gspo?

Why not use reward_manager=naive or reward_manager=dapo with enable_overlong_buffer=False (which some of the scripts do).

Actually, in my original script, I removed reward_manager to ensure the GSPO recipe remained minimal. I've added prime back here specifically because gemini_code_review suggested it as the standard configuration.

yyDing1 · 2026-01-08T04:07:43Z

Gemini only provides suggested modifications, which may be incorrect. You need to at least ensure that the updated script is runnable. The Prime Reward Manager seems to be no longer fully runnable due to dependency issues.

gemini-code-assist bot reviewed Jan 6, 2026

View reviewed changes

Fix: use prime reward manager for GSPO script

d984674

aub123 force-pushed the main branch from 542c9e6 to d984674 Compare January 6, 2026 06:44

aub123 closed this Jan 6, 2026

aub123 reopened this Jan 6, 2026

wuxibin89 closed this Jan 9, 2026

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Fix: use prime reward manager for GSPO script #4809

Fix: use prime reward manager for GSPO script #4809

Uh oh!

aub123 commented Jan 6, 2026

Uh oh!

CLAassistant commented Jan 6, 2026 •

edited

Loading

Uh oh!

gemini-code-assist bot left a comment

Uh oh!

yyDing1 commented Jan 6, 2026 •

edited

Loading

Uh oh!

aub123 commented Jan 6, 2026

Uh oh!

yyDing1 commented Jan 8, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

Fix: use prime reward manager for GSPO script #4809

Fix: use prime reward manager for GSPO script #4809

Uh oh!

Conversation

aub123 commented Jan 6, 2026

Description

Changes

Uh oh!

CLAassistant commented Jan 6, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

gemini-code-assist bot left a comment

Choose a reason for hiding this comment

Code Review

examples/gspo_trainer/run_qwen30b_gspo.sh (154-161)

examples/gspo_trainer/test_gspo_3b_math.sh (176-181)

examples/gspo_trainer/test_gspo_3b_math_slurm.sh (180-185)

examples/gspo_trainer/test_gspo_qwen30b_a3b_ep.sh (152-157)

Uh oh!

yyDing1 commented Jan 6, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

aub123 commented Jan 6, 2026

Uh oh!

yyDing1 commented Jan 8, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

CLAassistant commented Jan 6, 2026 •

edited

Loading

yyDing1 commented Jan 6, 2026 •

edited

Loading