Skip to content

Conversation

@yurekami
Copy link
Contributor

Summary

  • Add logging of extra/intermediate rewards to training metrics
  • This improves training monitoring by providing visibility into component reward values

Motivation

As discussed in #2279, logging extra rewards like format_reward and acc_reward is useful for training monitoring. This enables users to track intermediate reward components in addition to the final combined score.

Changes

Added logging of extra reward metrics after the actor update:

# Log extra reward metrics (e.g., format_reward, acc_reward) for training monitoring
if reward_extra_infos_dict:
    for key, values in reward_extra_infos_dict.items():
        if key != "score" and len(values) > 0:
            metrics[f"critic/rewards/{key}"] = np.mean(values)

Usage

Reward functions can return extra info via the reward_extra_info dict:

return {
    'score': final_reward,
    'format_reward': format_score,
    'acc_reward': accuracy_score,
    # ... other intermediate rewards
}

These will be logged as:

  • critic/rewards/format_reward
  • critic/rewards/acc_reward
  • etc.

Closes #4545

Test plan

  • Verify extra rewards appear in wandb/tensorboard when using a reward function with extra info
  • Confirm metrics are logged correctly with multiple extra reward keys

🤖 Generated with Claude Code

Add logging of extra/intermediate rewards (e.g., format_reward,
acc_reward) to the training metrics. This improves training
monitoring by providing visibility into component reward values
in addition to the final combined score.

The extra rewards are logged under the "critic/rewards/{key}" namespace,
matching the pattern used for other critic metrics.

Reward functions can return extra info via the "reward_extra_info" dict:
{'score': reward, 'format_reward': value_1, 'acc_reward': value_2}

Closes volcengine#4545

🤖 Generated with [Claude Code](https://claude.com/claude-code)

Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
Copy link
Contributor

@gemini-code-assist gemini-code-assist bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request adds logging for extra reward metrics from custom reward functions. The change is straightforward, but I've identified a potential robustness issue where non-numeric extra reward information could crash the training loop. I've suggested a fix to handle this gracefully by adding a try-except block.

if reward_extra_infos_dict:
for key, values in reward_extra_infos_dict.items():
if key != "score" and len(values) > 0:
metrics[f"critic/rewards/{key}"] = np.mean(values)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

high

The current implementation assumes that all values in reward_extra_infos_dict are numeric and can be averaged by np.mean. However, custom reward functions can return non-numeric extra information (e.g., strings, dictionaries), which would cause np.mean to raise a TypeError or ValueError, crashing the training loop. To make this more robust, it's better to wrap the np.mean call in a try-except block to gracefully handle non-numeric metrics.

Suggested change
metrics[f"critic/rewards/{key}"] = np.mean(values)
try:
metrics[f"critic/rewards/{key}"] = np.mean(values)
except (TypeError, ValueError):
# Not all extra reward info may be numeric, so we skip what can't be averaged.
pass

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could you rename this to training/reward/xxx?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

[trainer,ray] {feat} Log extra rewards

2 participants