Skip to content

RFC 006 proposal: standard multi-component rewards with scalar compatibility #623

@adithya-s-k

Description

@adithya-s-k

Summary

I want to propose RFC 006 to standardize multi-component rewards while keeping the existing scalar reward contract intact.

Problem

Today we optimize on a single scalar reward, which works for training but makes reward design/debugging harder in complex agentic tasks (e.g., coding with safety/process constraints). Teams can manually encode these pieces into one score, but there is no common schema to inspect and compare components across environments.

Proposal

Introduce a standard component representation (e.g., success/progress/penalty/shaping/binary) emitted in observation metadata, while preserving observation.reward as the scalar optimization target.

Why now

This is a core reward-contract extension that touches environment authoring, observability, and training diagnostics. It should be aligned via RFC before implementation.

Draft RFC

A full draft is prepared in rfcs/006-multi-component-rewards.md and is submitted via PR for review.

Related RFCs

  • RFC 001 (abstractions / boundaries)
  • RFC 002 (environment-computed rewards)
  • RFC 004 (rubrics)
  • RFC 005 (harness boundary and reward opacity)

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type
    No fields configured for issues without a type.

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions