Skip to content

[RFC] Add RFC 004: Rubrics#237

Merged
Darktex merged 4 commits into
mainfrom
rfc-004-reward-pipelines
Jan 26, 2026
Merged

[RFC] Add RFC 004: Rubrics#237
Darktex merged 4 commits into
mainfrom
rfc-004-reward-pipelines

Conversation

@Darktex

@Darktex Darktex commented Dec 5, 2025

Copy link
Copy Markdown
Collaborator

Summary

Introduces RFC 004: Rubric System—a composable, nn.Module-inspired abstraction for computing rewards in OpenEnv environments.

Key design decisions:

  • Environment authors implement init and forward(action, observation) -> float
  • Child rubrics auto-register when assigned as attributes
  • Sync forward() + async evaluate() for batch parallelism (no async knowledge required from authors)
  • Hooks for observability without polluting the base class

What's included:

  • Rubric base class with PyTorch-like API
  • Container rubrics: Sequential, Gate, WeightedSum, RubricList, LLMJudge
  • evaluate_batch() helper for parallel evaluation in training loops

Design informed by:

  • RLTF (hierarchical gating)
  • Rubicon (multi-dimensional rubrics)
  • AdvancedIF (all-or-nothing aggregation)
  • OpenRubrics (gatekeeper mechanism)

Test plan

  • Review RFC for clarity and completeness
  • Gather feedback on API design
  • Validate against existing environment implementations

@meta-cla meta-cla Bot added the CLA Signed This label is managed by the Meta Open Source bot. label Dec 5, 2025
@Darktex Darktex marked this pull request as draft December 5, 2025 21:39
Builds on RFC 003 to standardize reward computation:
- Rubrics are INTERNAL (not exposed to agent)
- Rubrics USE MCP to call external services (LLM judges, DBs)
- Observation.metadata["reward_components"] for per-rubric logging
- POST /config endpoint for dynamic reward shaping
- SDK helpers: RubricComposer, RewardNormalizer

Key insight: MCP tools are for agent actions (RFC 003).
Rubrics use MCP internally for external RPC, but are not tools themselves.
@Darktex Darktex force-pushed the rfc-004-reward-pipelines branch from 602b743 to 6c2acb0 Compare December 17, 2025 01:24
@Darktex Darktex force-pushed the rfc-004-reward-pipelines branch from 6c2acb0 to d6b15e9 Compare December 17, 2025 01:24
@Darktex Darktex changed the title [RFC] Add RFC 004: Reward Pipelines [RFC] Add RFC 004: Rubrics Dec 17, 2025
@Darktex Darktex marked this pull request as ready for review December 17, 2025 01:25
@Darktex

Darktex commented Jan 13, 2026

Copy link
Copy Markdown
Collaborator Author

Note: This is an automated review by Claude Code (alignment-reviewer agent), not a human review. The account posting this is shared with the human maintainer.


Now I have enough context. Let me produce the alignment review.


Alignment Review: RFC 004 - Rubric System for Reward Computation

Executive Summary

Tier 1 (Bugs/Lint): ✅ PASS - No critical issues found
Tier 2 (Alignment): ⚠️ CONCERNS - Several architectural alignment questions requiring human review


Tier 1: Critical Issues

Automated Checks

Lint: Cannot verify (uv not installed in environment)
Debug Code: No debug code in RFC file
Security: N/A (documentation only)
Syntax: Markdown is well-formed

Code Quality

No implementation changes - This is an RFC only, no code to review


Tier 2: Alignment Concerns

🔴 ALIGNMENT FLAG #1: Dual API Boundary Violation Risk

Invariant at risk: Dual API boundary (INVARIANTS.md:41-58)

The concern: The RFC does not explicitly address how rubrics interact with the dual API model (MCP for agents vs Gym-like for infrastructure). Specifically:

  1. Line 72 of RFC states: "The API is modeled after PyTorch's nn.Module" and shows forward(action, observation) -> float
  2. The action and observation parameters in rubric signatures could create confusion about whether rubrics:
    • Run server-side during step() (correct per RFC 002)
    • Might accidentally expose simulation control to agents via MCP
    • Could be called by agents directly (violation)

Evidence from codebase:

  • INVARIANTS.md:50 states: "The Gym-like API is NOT accessible to the agent being trained"
  • RFC 002:161 confirms "Rewards are computed inside the environment"
  • Existing RewardProvider in envs/textarena_env/rewards.py:14 shows server-side pattern

Required clarification:

  • Add explicit statement that rubrics execute server-side only during step()
  • Confirm rubrics NEVER exposed via MCP to agents
  • Clarify that action and observation in rubric signatures are the server-side objects from the Gym-like API

Suggested reviewer: @Darktex (RFC 001 author, dual API design owner)


🟡 ALIGNMENT FLAG #2: Unclear Relationship to Existing Reward Infrastructure

Principle at risk: Design for LLMs, Minimize lifecycle deltas (PRINCIPLES.md)

The concern: The RFC introduces a new Rubric abstraction but doesn't clearly explain:

  1. Migration path from existing code:

    • RewardProvider protocol exists in envs/textarena_env/rewards.py:14
    • Different signature: compute(action, observation) -> Dict[str, float]
    • Should existing code migrate to Rubric?
  2. Relationship to mentioned "Transform pipeline":

    • INVARIANTS.md:65-67: "Reward computation must stay inside environment boundary. External reward augmentation uses Transform pipeline."
    • RFC doesn't mention Transform pipeline at all
    • Are rubrics the implementation of rewards, or transforms, or both?
  3. Where does Observation.reward come from?:

    • RFC 002:177 shows reward: Union[bool, int, float, None] in base Observation
    • If rubrics return float, how does that become the observation's reward field?
    • Is there a default aggregation, or must environments explicitly wire it?

Required clarification:

  • State whether Rubric replaces RewardProvider or complements it
  • Explain relationship to Transform pipeline mentioned in invariants
  • Show example of environment wiring rubric output to Observation.reward
  • Consider adding "Migration from Existing Patterns" section

🟡 ALIGNMENT FLAG #3: Async Design May Contradict "Rewards Inside Environment"

Invariant at risk: Rewards in environment (INVARIANTS.md:64-67)

The concern: The RFC proposes:

async def train():
    # All 64 samples evaluated concurrently via thread pool
    rewards = await evaluate_batch(rubric, actions, batch.observations)

This pattern suggests:

  1. Rubrics might evaluate outside the environment's step() call
  2. Training code directly calls rubrics with actions/observations
  3. Violates "reward computation must stay inside environment boundary"

Contradiction with RFC 002:

  • RFC 002:163: "Rewards are computed inside the environment and returned as part of the observation"
  • RFC 002:188: "Clients receive fully-formed observations with rewards already computed"

Current codebase pattern (envs/textarena_env/rewards.py):

# Rewards computed server-side during step()
def compute(*, action: TextArenaAction, observation: TextArenaObservation) -> Dict[str, float]

Possible interpretations:

  1. Misunderstanding: The example is pseudocode showing internal environment implementation
  2. New pattern: RFC 004 changes RFC 002's decision to allow external reward computation
  3. Batch optimization: Rubrics run inside environment but support batch evaluation internally

Required clarification:

  • Confirm rubrics only called from within Environment.step() implementation
  • Clarify that training code example is showing environment's internal logic, not client code
  • Or: Explicitly state this RFC amends RFC 002 to allow external reward computation

🟡 ALIGNMENT FLAG #4: Missing Integration with Environment API

Concern: The RFC doesn't show how rubrics integrate with the existing Environment base class.

Questions:

  1. Does Environment base class get a new rubric attribute?
  2. Does framework auto-call rubrics, or must each environment manually invoke?
  3. How do rubrics access environment internal state (mentioned in RFC 002:168)?

Expected in RFC:

  • Proposed changes to Environment class signature
  • Example showing environment using rubric in its step() implementation
  • Pattern for rubrics accessing environment state if needed

Evidence: RFC 002:82 shows current Environment interface with no rubric field


🟢 ALIGNMENT FLAG #5: LLMJudge Container Pattern (Informational)

Not a violation, but worth discussing:

The RFC proposes LLMJudge as a container rubric that "calls an LLM endpoint via configured MCP service" (line 105-109).

Architectural question:

  • If LLMJudge calls out to external LLM via MCP, how does this interact with:
    • Container isolation (environments run in Docker)
    • The "MCP for agents" boundary (are rubrics using agent tools?)
    • Network access restrictions (INVARIANTS.md:32)

Likely fine if:

  • LLMJudge is server-side only (per Flag Initial skeleton #1)
  • Calls external service, not agent's MCP tools
  • Environment container has network access configured

Suggested: Add note about LLMJudge network requirements in deployment section


Minor Issues (Not Blocking)

Documentation Gaps

  1. No "What Changes" for Environment base class (line 151-159)

    • Table shows new classes but not modifications to existing Environment API
    • Should show proposed Environment constructor/step signature changes
  2. State serialization unclear (line 176)

    • state_dict() / load_state_dict() mentioned but not explained
    • What state needs serialization? Weights? Config? Both?
    • Are rubrics expected to be stateful (like _WordleRewardProvider._guess_history)?
  3. Missing "Open Questions" about environment state access

    • RFC 002:168 says rewards can use "internal state transitions"
    • How do rubrics access environment's internal state?
    • Or are they limited to (action, observation) pairs?

Examples Could Be Clearer

  1. CodeRubric example (line 73-88)

    • Shows composition but mixes gating logic with weighted sum
    • Could show simpler examples first, then build to complex
  2. Training loop example (line 139-147)

    • Appears to show client-side reward computation (contradicts RFC 002)
    • Should clarify this is environment's internal implementation

Positive Observations

Strong academic grounding - Appendix with literature patterns is excellent
Composable design - nn.Module analogy works well
Clear alternatives section - Shows thoughtful design exploration
Hook-based observability - Avoids polluting base class
Sync-first API - Good ergonomics for simple cases


Recommendations

Before Merge

MUST address:

  1. Flag Initial skeleton #1: Add explicit statement about server-side execution and MCP boundary
  2. Flag Add boiler plate code for CodingEnv #3: Clarify async example doesn't violate "rewards inside environment"
  3. Flag Add BaseHTTPClient to talk to Envs via JSON over RPC. #4: Show integration with Environment base class

SHOULD address:
2. Flag #2: Explain relationship to existing RewardProvider and Transform pipeline

Consider for Follow-up

  • Separate RFC for Transform pipeline (mentioned in invariants but not documented)
  • Migration guide for existing environments using RewardProvider pattern
  • Examples of rubrics accessing environment internal state

Summary

This RFC proposes a well-designed abstraction for reward computation. However, it has unclear boundaries with existing architectural decisions:

  1. Dual API model - Needs explicit confirmation rubrics are server-side only
  2. Rewards-in-environment invariant - Async example suggests external computation
  3. Environment integration - Missing concrete API changes to Environment class

These are likely documentation gaps rather than design flaws, but they risk implementation that violates established invariants. A human reviewer (suggest @Darktex) should confirm the intended architecture before implementation begins.

Recommendation: ⚠️ REQUEST CHANGES - Address alignment flags before merge


Automated review by Claude Code | Learn more about OpenEnv's agentic workflow

@Darktex Darktex requested a review from burtenshaw January 16, 2026 19:40
@burtenshaw

burtenshaw commented Jan 20, 2026

Copy link
Copy Markdown
Collaborator

Thanks for taking the time to write this out @Darktex, this is an important RFC that's very close to challenges we're facing at HF.

If I understand correctly, this RFC defines a single abstraction for computing and handling envs' scores. The result of which is:

  • envs that are easier to build well
  • envs that are more compatible with training libraries/systems

Main high level points

  • IMO, Container Rubrics and the auto-registration of children is the most crucial element. So that the trainer can get a group of rewards and handle them atomically with other use case knowledge.
  • For multi-task envs (Browsergym tasks, Atari games, TextArena), should there be per-task rubrics or a shared rubric with task-aware logic? An example showing AtariRubric dispatching to PongRubric vs BreakoutRubric would clarify. Also, should we automatically select the task rubric?
  • From the trainer side, the batch evaluation features are impactful. Two questions: (1) Should environments expose batch state via the session entity? (2) How does the trainer signal batch boundaries to the env — new API, or reuse existing session lifecycle?
  • One thing that isn't clear to me is whether evaluate_batch() appears client-side, which contradicts "rewards inside environment." Is evaluate_batch() intended for internal env implementation, offline eval, both, neither?"

… batch example

- Add "Rubrics Live Inside Environments" section clarifying env.rubric access
- Fix batch evaluation example to show EnvPool pattern (not standalone rubric)
- Add RubricDict container for multi-task environments
- Add get_rubric(path) method for nested access
- Update Environment base class to require rubric attribute
- Add Implementation Plan section with stacked PR breakdown
- Add "One env = one trajectory" principle to PRINCIPLES.md

Addresses feedback from @burtenshaw and alignment review.
@Darktex

Darktex commented Jan 22, 2026

Copy link
Copy Markdown
Collaborator Author

Thanks for the thoughtful review @burtenshaw! You raised several important points that helped clarify the design.

Here's how I'd address them (and since claude is doing all the work now, the PR was updated with what it's done :D )

1. Multi-task environments (per-task vs shared rubrics)

I think that the PyTorch inspiration keeps working well here. I simply added RubricDict — a new container analogous to nn.ModuleDict for keyed dispatch:

class AtariRubric(Rubric):
    def __init__(self):
        self.games = RubricDict({
            "pong": PongRubric(),
            "breakout": BreakoutRubric(),
        })

    def forward(self, action, obs) -> float:
        return self.games[obs.game_id](action, obs)

Access: env.rubric.games["pong"]

Also added get_rubric(path) for dot-separated path access (like PyTorch's get_submodule()), see next point.

2. Container Rubrics + auto-registration

Agreed this is crucial. The RFC already covers this, but we've now made it clearer that trainers can access the full hierarchy via env.rubric.get_rubric(path) for atomic handling. Basically same as .get_submodule() in PyTorch.

3. Batch evaluation interface

The examples were super confusing here, so I revised them. Thanks for catching this. Let's make sure we align on the design!

Basically, we stick to one env = one trajectory. Environments don't support multiplexing. Batching is achieved by stacking environments:

envs = EnvPool("code_env", n=64)
observations = await envs.step_batch(actions) # can be made cleverer w/ Asyncio but in the end you gotta await
rewards = [obs.reward for obs in observations]

Individual envs don't need batch awareness — EnvPool handles orchestration. This is now documented in both the RFC and PRINCIPLES.md.

4. evaluate_batch() location

This is also another byproduct of having a bad example in the code. Fixed the confusing example. The RFC now shows that:

  • Rubrics live inside environments (env.rubric)
  • Rewards are computed server-side during step()
  • Training code never instantiates rubrics directly

@burtenshaw burtenshaw left a comment

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

For me, this RFC is ready for the light. I'm looking forward to what the community has to say.

@Darktex Darktex merged commit 66533ba into main Jan 26, 2026
4 checks passed
@Darktex Darktex deleted the rfc-004-reward-pipelines branch January 26, 2026 20:38
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

CLA Signed This label is managed by the Meta Open Source bot.

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants