[RFC] Add RFC 004: Rubrics by Darktex · Pull Request #237 · huggingface/OpenEnv

Darktex · 2025-12-05T21:39:24Z

Summary

Introduces RFC 004: Rubric System—a composable, nn.Module-inspired abstraction for computing rewards in OpenEnv environments.

Key design decisions:

Environment authors implement init and forward(action, observation) -> float
Child rubrics auto-register when assigned as attributes
Sync forward() + async evaluate() for batch parallelism (no async knowledge required from authors)
Hooks for observability without polluting the base class

What's included:

Rubric base class with PyTorch-like API
Container rubrics: Sequential, Gate, WeightedSum, RubricList, LLMJudge
evaluate_batch() helper for parallel evaluation in training loops

Design informed by:

RLTF (hierarchical gating)
Rubicon (multi-dimensional rubrics)
AdvancedIF (all-or-nothing aggregation)
OpenRubrics (gatekeeper mechanism)

Test plan

Review RFC for clarity and completeness
Gather feedback on API design
Validate against existing environment implementations

Builds on RFC 003 to standardize reward computation: - Rubrics are INTERNAL (not exposed to agent) - Rubrics USE MCP to call external services (LLM judges, DBs) - Observation.metadata["reward_components"] for per-rubric logging - POST /config endpoint for dynamic reward shaping - SDK helpers: RubricComposer, RewardNormalizer Key insight: MCP tools are for agent actions (RFC 003). Rubrics use MCP internally for external RPC, but are not tools themselves.

Darktex · 2026-01-13T02:16:38Z

Note: This is an automated review by Claude Code (alignment-reviewer agent), not a human review. The account posting this is shared with the human maintainer.

Now I have enough context. Let me produce the alignment review.

Alignment Review: RFC 004 - Rubric System for Reward Computation

Executive Summary

Tier 1 (Bugs/Lint): ✅ PASS - No critical issues found
Tier 2 (Alignment): ⚠️ CONCERNS - Several architectural alignment questions requiring human review

Tier 1: Critical Issues

Automated Checks

✅ Lint: Cannot verify (uv not installed in environment)
✅ Debug Code: No debug code in RFC file
✅ Security: N/A (documentation only)
✅ Syntax: Markdown is well-formed

Code Quality

✅ No implementation changes - This is an RFC only, no code to review

Tier 2: Alignment Concerns

🔴 ALIGNMENT FLAG #1: Dual API Boundary Violation Risk

Invariant at risk: Dual API boundary (INVARIANTS.md:41-58)

The concern: The RFC does not explicitly address how rubrics interact with the dual API model (MCP for agents vs Gym-like for infrastructure). Specifically:

Line 72 of RFC states: "The API is modeled after PyTorch's nn.Module" and shows forward(action, observation) -> float
The action and observation parameters in rubric signatures could create confusion about whether rubrics:
- Run server-side during step() (correct per RFC 002)
- Might accidentally expose simulation control to agents via MCP
- Could be called by agents directly (violation)

Evidence from codebase:

INVARIANTS.md:50 states: "The Gym-like API is NOT accessible to the agent being trained"
RFC 002:161 confirms "Rewards are computed inside the environment"
Existing RewardProvider in envs/textarena_env/rewards.py:14 shows server-side pattern

Required clarification:

Add explicit statement that rubrics execute server-side only during step()
Confirm rubrics NEVER exposed via MCP to agents
Clarify that action and observation in rubric signatures are the server-side objects from the Gym-like API

Suggested reviewer: @Darktex (RFC 001 author, dual API design owner)

🟡 ALIGNMENT FLAG #2: Unclear Relationship to Existing Reward Infrastructure

Principle at risk: Design for LLMs, Minimize lifecycle deltas (PRINCIPLES.md)

The concern: The RFC introduces a new Rubric abstraction but doesn't clearly explain:

Migration path from existing code:
- RewardProvider protocol exists in envs/textarena_env/rewards.py:14
- Different signature: compute(action, observation) -> Dict[str, float]
- Should existing code migrate to Rubric?
Relationship to mentioned "Transform pipeline":
- INVARIANTS.md:65-67: "Reward computation must stay inside environment boundary. External reward augmentation uses Transform pipeline."
- RFC doesn't mention Transform pipeline at all
- Are rubrics the implementation of rewards, or transforms, or both?
Where does Observation.reward come from?:
- RFC 002:177 shows reward: Union[bool, int, float, None] in base Observation
- If rubrics return float, how does that become the observation's reward field?
- Is there a default aggregation, or must environments explicitly wire it?

Required clarification:

State whether Rubric replaces RewardProvider or complements it
Explain relationship to Transform pipeline mentioned in invariants
Show example of environment wiring rubric output to Observation.reward
Consider adding "Migration from Existing Patterns" section

🟡 ALIGNMENT FLAG #3: Async Design May Contradict "Rewards Inside Environment"

Invariant at risk: Rewards in environment (INVARIANTS.md:64-67)

The concern: The RFC proposes:

async def train():
    # All 64 samples evaluated concurrently via thread pool
    rewards = await evaluate_batch(rubric, actions, batch.observations)

This pattern suggests:

Rubrics might evaluate outside the environment's step() call
Training code directly calls rubrics with actions/observations
Violates "reward computation must stay inside environment boundary"

Contradiction with RFC 002:

RFC 002:163: "Rewards are computed inside the environment and returned as part of the observation"
RFC 002:188: "Clients receive fully-formed observations with rewards already computed"

Current codebase pattern (envs/textarena_env/rewards.py):

# Rewards computed server-side during step()
def compute(*, action: TextArenaAction, observation: TextArenaObservation) -> Dict[str, float]

Possible interpretations:

Misunderstanding: The example is pseudocode showing internal environment implementation
New pattern: RFC 004 changes RFC 002's decision to allow external reward computation
Batch optimization: Rubrics run inside environment but support batch evaluation internally

Required clarification:

Confirm rubrics only called from within Environment.step() implementation
Clarify that training code example is showing environment's internal logic, not client code
Or: Explicitly state this RFC amends RFC 002 to allow external reward computation

🟡 ALIGNMENT FLAG #4: Missing Integration with Environment API

Concern: The RFC doesn't show how rubrics integrate with the existing Environment base class.

Questions:

Does Environment base class get a new rubric attribute?
Does framework auto-call rubrics, or must each environment manually invoke?
How do rubrics access environment internal state (mentioned in RFC 002:168)?

Expected in RFC:

Proposed changes to Environment class signature
Example showing environment using rubric in its step() implementation
Pattern for rubrics accessing environment state if needed

Evidence: RFC 002:82 shows current Environment interface with no rubric field

🟢 ALIGNMENT FLAG #5: LLMJudge Container Pattern (Informational)

Not a violation, but worth discussing:

The RFC proposes LLMJudge as a container rubric that "calls an LLM endpoint via configured MCP service" (line 105-109).

Architectural question:

If LLMJudge calls out to external LLM via MCP, how does this interact with:
- Container isolation (environments run in Docker)
- The "MCP for agents" boundary (are rubrics using agent tools?)
- Network access restrictions (INVARIANTS.md:32)

Likely fine if:

LLMJudge is server-side only (per Flag Initial skeleton #1)
Calls external service, not agent's MCP tools
Environment container has network access configured

Suggested: Add note about LLMJudge network requirements in deployment section

Minor Issues (Not Blocking)

Documentation Gaps

No "What Changes" for Environment base class (line 151-159)
- Table shows new classes but not modifications to existing Environment API
- Should show proposed Environment constructor/step signature changes
State serialization unclear (line 176)
- state_dict() / load_state_dict() mentioned but not explained
- What state needs serialization? Weights? Config? Both?
- Are rubrics expected to be stateful (like _WordleRewardProvider._guess_history)?
Missing "Open Questions" about environment state access
- RFC 002:168 says rewards can use "internal state transitions"
- How do rubrics access environment's internal state?
- Or are they limited to (action, observation) pairs?

Examples Could Be Clearer

CodeRubric example (line 73-88)
- Shows composition but mixes gating logic with weighted sum
- Could show simpler examples first, then build to complex
Training loop example (line 139-147)
- Appears to show client-side reward computation (contradicts RFC 002)
- Should clarify this is environment's internal implementation

Positive Observations

✅ Strong academic grounding - Appendix with literature patterns is excellent
✅ Composable design - nn.Module analogy works well
✅ Clear alternatives section - Shows thoughtful design exploration
✅ Hook-based observability - Avoids polluting base class
✅ Sync-first API - Good ergonomics for simple cases

Recommendations

Before Merge

MUST address:

Flag Initial skeleton #1: Add explicit statement about server-side execution and MCP boundary
Flag Add boiler plate code for CodingEnv #3: Clarify async example doesn't violate "rewards inside environment"
Flag Add BaseHTTPClient to talk to Envs via JSON over RPC. #4: Show integration with Environment base class

SHOULD address:
2. Flag #2: Explain relationship to existing RewardProvider and Transform pipeline

Consider for Follow-up

Separate RFC for Transform pipeline (mentioned in invariants but not documented)
Migration guide for existing environments using RewardProvider pattern
Examples of rubrics accessing environment internal state

Summary

This RFC proposes a well-designed abstraction for reward computation. However, it has unclear boundaries with existing architectural decisions:

Dual API model - Needs explicit confirmation rubrics are server-side only
Rewards-in-environment invariant - Async example suggests external computation
Environment integration - Missing concrete API changes to Environment class

These are likely documentation gaps rather than design flaws, but they risk implementation that violates established invariants. A human reviewer (suggest @Darktex) should confirm the intended architecture before implementation begins.

Recommendation: ⚠️ REQUEST CHANGES - Address alignment flags before merge

Automated review by Claude Code | Learn more about OpenEnv's agentic workflow

burtenshaw · 2026-01-20T10:22:54Z

Thanks for taking the time to write this out @Darktex, this is an important RFC that's very close to challenges we're facing at HF.

If I understand correctly, this RFC defines a single abstraction for computing and handling envs' scores. The result of which is:

envs that are easier to build well
envs that are more compatible with training libraries/systems

Main high level points

IMO, Container Rubrics and the auto-registration of children is the most crucial element. So that the trainer can get a group of rewards and handle them atomically with other use case knowledge.
For multi-task envs (Browsergym tasks, Atari games, TextArena), should there be per-task rubrics or a shared rubric with task-aware logic? An example showing AtariRubric dispatching to PongRubric vs BreakoutRubric would clarify. Also, should we automatically select the task rubric?
From the trainer side, the batch evaluation features are impactful. Two questions: (1) Should environments expose batch state via the session entity? (2) How does the trainer signal batch boundaries to the env — new API, or reuse existing session lifecycle?
One thing that isn't clear to me is whether evaluate_batch() appears client-side, which contradicts "rewards inside environment." Is evaluate_batch() intended for internal env implementation, offline eval, both, neither?"

@burtenshaw

… batch example - Add "Rubrics Live Inside Environments" section clarifying env.rubric access - Fix batch evaluation example to show EnvPool pattern (not standalone rubric) - Add RubricDict container for multi-task environments - Add get_rubric(path) method for nested access - Update Environment base class to require rubric attribute - Add Implementation Plan section with stacked PR breakdown - Add "One env = one trajectory" principle to PRINCIPLES.md Addresses feedback from @burtenshaw and alignment review.

Darktex · 2026-01-22T00:20:24Z

Thanks for the thoughtful review @burtenshaw! You raised several important points that helped clarify the design.

Here's how I'd address them (and since claude is doing all the work now, the PR was updated with what it's done :D )

1. Multi-task environments (per-task vs shared rubrics)

I think that the PyTorch inspiration keeps working well here. I simply added RubricDict — a new container analogous to nn.ModuleDict for keyed dispatch:

class AtariRubric(Rubric):
    def __init__(self):
        self.games = RubricDict({
            "pong": PongRubric(),
            "breakout": BreakoutRubric(),
        })

    def forward(self, action, obs) -> float:
        return self.games[obs.game_id](action, obs)

Access: env.rubric.games["pong"]

Also added get_rubric(path) for dot-separated path access (like PyTorch's get_submodule()), see next point.

2. Container Rubrics + auto-registration

Agreed this is crucial. The RFC already covers this, but we've now made it clearer that trainers can access the full hierarchy via env.rubric.get_rubric(path) for atomic handling. Basically same as .get_submodule() in PyTorch.

3. Batch evaluation interface

The examples were super confusing here, so I revised them. Thanks for catching this. Let's make sure we align on the design!

Basically, we stick to one env = one trajectory. Environments don't support multiplexing. Batching is achieved by stacking environments:

envs = EnvPool("code_env", n=64)
observations = await envs.step_batch(actions) # can be made cleverer w/ Asyncio but in the end you gotta await
rewards = [obs.reward for obs in observations]

Individual envs don't need batch awareness — EnvPool handles orchestration. This is now documented in both the RFC and PRINCIPLES.md.

4. evaluate_batch() location

This is also another byproduct of having a bad example in the code. Fixed the confusing example. The RFC now shows that:

Rubrics live inside environments (env.rubric)
Rewards are computed server-side during step()
Training code never instantiates rubrics directly

burtenshaw

For me, this RFC is ready for the light. I'm looking forward to what the community has to say.

meta-cla Bot added the CLA Signed This label is managed by the Meta Open Source bot. label Dec 5, 2025

Darktex marked this pull request as draft December 5, 2025 21:39

Darktex force-pushed the rfc-004-reward-pipelines branch from 602b743 to 6c2acb0 Compare December 17, 2025 01:24

Rewrote

d6b15e9

Darktex force-pushed the rfc-004-reward-pipelines branch from 6c2acb0 to d6b15e9 Compare December 17, 2025 01:24

Darktex changed the title ~~[RFC] Add RFC 004: Reward Pipelines~~ [RFC] Add RFC 004: Rubrics Dec 17, 2025

Darktex marked this pull request as ready for review December 17, 2025 01:25

Darktex requested a review from burtenshaw January 16, 2026 19:40

Darktex added 2 commits January 21, 2026 13:31

Merge remote-tracking branch 'origin/main' into rfc-004-reward-pipelines

34047b1

burtenshaw approved these changes Jan 26, 2026

View reviewed changes

Darktex merged commit 66533ba into main Jan 26, 2026
4 checks passed

Darktex deleted the rfc-004-reward-pipelines branch January 26, 2026 20:38

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[RFC] Add RFC 004: Rubrics#237

[RFC] Add RFC 004: Rubrics#237
Darktex merged 4 commits into
mainfrom
rfc-004-reward-pipelines

Darktex commented Dec 5, 2025 •

edited

Loading

Uh oh!

Darktex commented Jan 13, 2026

Uh oh!

burtenshaw commented Jan 20, 2026 •

edited

Loading

Uh oh!

Darktex commented Jan 22, 2026 •

edited

Loading

Uh oh!

burtenshaw left a comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

Darktex commented Dec 5, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary

Uh oh!

Darktex commented Jan 13, 2026

Alignment Review: RFC 004 - Rubric System for Reward Computation

Executive Summary

Tier 1: Critical Issues

Automated Checks

Code Quality

Tier 2: Alignment Concerns

🔴 ALIGNMENT FLAG #1: Dual API Boundary Violation Risk

🟡 ALIGNMENT FLAG #2: Unclear Relationship to Existing Reward Infrastructure

🟡 ALIGNMENT FLAG #3: Async Design May Contradict "Rewards Inside Environment"

🟡 ALIGNMENT FLAG #4: Missing Integration with Environment API

🟢 ALIGNMENT FLAG #5: LLMJudge Container Pattern (Informational)

Minor Issues (Not Blocking)

Documentation Gaps

Examples Could Be Clearer

Positive Observations

Recommendations

Before Merge

Consider for Follow-up

Summary

Uh oh!

burtenshaw commented Jan 20, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Main high level points

Uh oh!

Darktex commented Jan 22, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

1. Multi-task environments (per-task vs shared rubrics)

2. Container Rubrics + auto-registration

3. Batch evaluation interface

4. evaluate_batch() location

Uh oh!

burtenshaw left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Darktex commented Dec 5, 2025 •

edited

Loading

burtenshaw commented Jan 20, 2026 •

edited

Loading

Darktex commented Jan 22, 2026 •

edited

Loading