support GSPO-token #3820

hjh0119 · 2025-07-31T06:47:06Z

Support for GSPO-token as described in GSPO paper, Section 4.3.

related issue: #3811

GSPO
$w_{i}^{\mathrm{GSPO}} = \left[ \frac{\pi_{\theta}(y_i \mid x)}{\pi_{\theta_{\mathrm{old}}}(y_i \mid x)} \right]^{\frac{1}{|y_i|}} = \exp(\frac{1}{|y_i|} \sum_{t=1}^{|y_i|} \log \frac{\pi_{\theta}(y_{i, t} \mid x, y_{i, <t})}{\pi_{\theta_{\mathrm{old}}}(y_{i, t} \mid x, y_{i, <t})})$

GSPO-token
$w_{i, t}^{\mathrm{GSPO_token}} = \mathrm{sg}\left[w_i^{\mathrm{GSPO}}\right] \cdot \frac{\pi_{\theta}(y_{i, t} \mid x, y_{i, < t})}{\mathrm{sg}\left[\pi_{\theta}(y_{i, t} \mid x, y_{i, < t})\right]}$

where $\mathrm{sg}[\cdot]$ denotes the stop-gradient (detach) operation.

💡 NOTE: GSPO-token enables support for fine-grained (token-level) advantages.
However, given the current formulation for advantage computation, all tokens within a sentence share the same value. In this case, GSPO and GSPO-token are theoretically equivalent, as shown in equations (11) and (18) of the paper.

LeonEricsson · 2025-07-31T14:07:52Z

Thanks for this. Since GSPO-token is a generalized version of vanilla GSPO, I suggest we fully transition to GSPO-token instead of supporting both versions. Consequently, we would rename/remove importance_sampling_level, as both methods operate at the token level.

LeonEricsson · 2025-07-31T14:08:30Z

trl/trainer/grpo_trainer.py

+        elif self.importance_sampling_level == 'sequence_token':
+            # GSPO-token: sg[si(θ)] * πθ(yi,t)/sg[πθ(yi,t)]
+            seq_level_log_weight = (log_ratio * completion_mask).sum(-1) / completion_mask.sum(-1).clamp(min=1.0)
+            seq_level_log_weight = seq_level_log_weight.unsqueeze(-1).detach()  # Stop gradient


Suggested change

seq_level_log_weight = seq_level_log_weight.unsqueeze(-1).detach() # Stop gradient

seq_level_log_weight = seq_level_log_weight.detach().unsqueeze(-1) # Stop gradient

(log_ratio * completion_mask).sum(-1) / completion_mask.sum(-1).clamp(min=1.0)

This op is common across GSPO and GSPO-token, would be good to have a single variable pointing to this value under an if condition like

if self.importance_sampling_level != 'token'

make sense, so shall we move the invalid value check for importance_sampling_level into the model parameter initialization?

trl/trainer/grpo_trainer.py

pramodith · 2025-08-01T15:18:13Z

trl/trainer/grpo_trainer.py

+        elif self.importance_sampling_level == 'sequence_token':
+            # GSPO-token: sg[si(θ)] * πθ(yi,t)/sg[πθ(yi,t)]
+            seq_level_log_weight = (log_ratio * completion_mask).sum(-1) / completion_mask.sum(-1).clamp(min=1.0)
+            seq_level_log_weight = seq_level_log_weight.unsqueeze(-1).detach()  # Stop gradient


(log_ratio * completion_mask).sum(-1) / completion_mask.sum(-1).clamp(min=1.0)

This op is common across GSPO and GSPO-token, would be good to have a single variable pointing to this value under an if condition like

if self.importance_sampling_level != 'token'

hjh0119 · 2025-08-02T16:10:48Z

Thanks for this. Since GSPO-token is a generalized version of vanilla GSPO, I suggest we fully transition to GSPO-token instead of supporting both versions. Consequently, we would rename/remove importance_sampling_level, as both methods operate at the token level.

Agreed to keep GSPO-token. Should we retain this parameter for compatibility with previous usage, or introduce an additional parameter instead? Which is better?

hjh0119 · 2025-08-05T02:01:19Z

@qgallouedec @lewtun @edbeeching @kashif If there are any concerns or suggestions, please feel free to let me know. Thank you very much in advance

LeonEricsson · 2025-08-05T11:34:20Z

Agreed to keep GSPO-token. Should we retain this parameter for compatibility with previous usage, or introduce an additional parameter instead? Which is better?

imo it should be removed, however, since it's already been published as part of TRL v0.20, we may need to keep it for backward comp. I can't speak to it myself, so I'll leave it to someone else to decide.

support gspo-token

d48f18f

LeonEricsson reviewed Jul 31, 2025

View reviewed changes

LeonEricsson requested review from edbeeching, lewtun and qgallouedec July 31, 2025 14:11

pramodith suggested changes Aug 1, 2025

View reviewed changes

hjh0119 added 2 commits August 3, 2025 19:38

reorder compute log weight

e29321f

lint

32ab1e3

exchange detach and unsqueeze

6301d5c

Merge branch 'main' into gspo-token

db68524

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

support GSPO-token #3820

support GSPO-token #3820

Uh oh!

hjh0119 commented Jul 31, 2025

Uh oh!

LeonEricsson commented Jul 31, 2025

Uh oh!

LeonEricsson Jul 31, 2025

Uh oh!

pramodith Aug 1, 2025

Uh oh!

hjh0119 Aug 2, 2025

Uh oh!

Uh oh!

pramodith Aug 1, 2025

Uh oh!

hjh0119 commented Aug 2, 2025

Uh oh!

hjh0119 commented Aug 5, 2025

Uh oh!

LeonEricsson commented Aug 5, 2025 •

edited

Loading

Uh oh!

Uh oh!

	seq_level_log_weight = seq_level_log_weight.unsqueeze(-1).detach() # Stop gradient
	seq_level_log_weight = seq_level_log_weight.detach().unsqueeze(-1) # Stop gradient

support GSPO-token #3820

Are you sure you want to change the base?

support GSPO-token #3820

Uh oh!

Conversation

hjh0119 commented Jul 31, 2025

Uh oh!

LeonEricsson commented Jul 31, 2025

Uh oh!

LeonEricsson Jul 31, 2025

Choose a reason for hiding this comment

Uh oh!

pramodith Aug 1, 2025

Choose a reason for hiding this comment

Uh oh!

hjh0119 Aug 2, 2025

Choose a reason for hiding this comment

Uh oh!

Uh oh!

pramodith Aug 1, 2025

Choose a reason for hiding this comment

Uh oh!

hjh0119 commented Aug 2, 2025

Uh oh!

hjh0119 commented Aug 5, 2025

Uh oh!

LeonEricsson commented Aug 5, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Uh oh!

LeonEricsson commented Aug 5, 2025 •

edited

Loading