[RFC] Add RFC 004: Rubrics#237
Conversation
Builds on RFC 003 to standardize reward computation: - Rubrics are INTERNAL (not exposed to agent) - Rubrics USE MCP to call external services (LLM judges, DBs) - Observation.metadata["reward_components"] for per-rubric logging - POST /config endpoint for dynamic reward shaping - SDK helpers: RubricComposer, RewardNormalizer Key insight: MCP tools are for agent actions (RFC 003). Rubrics use MCP internally for external RPC, but are not tools themselves.
602b743 to
6c2acb0
Compare
6c2acb0 to
d6b15e9
Compare
Now I have enough context. Let me produce the alignment review. Alignment Review: RFC 004 - Rubric System for Reward ComputationExecutive SummaryTier 1 (Bugs/Lint): ✅ PASS - No critical issues found Tier 1: Critical IssuesAutomated Checks✅ Lint: Cannot verify (uv not installed in environment) Code Quality✅ No implementation changes - This is an RFC only, no code to review Tier 2: Alignment Concerns🔴 ALIGNMENT FLAG #1: Dual API Boundary Violation RiskInvariant at risk: Dual API boundary (INVARIANTS.md:41-58) The concern: The RFC does not explicitly address how rubrics interact with the dual API model (MCP for agents vs Gym-like for infrastructure). Specifically:
Evidence from codebase:
Required clarification:
Suggested reviewer: @Darktex (RFC 001 author, dual API design owner) 🟡 ALIGNMENT FLAG #2: Unclear Relationship to Existing Reward InfrastructurePrinciple at risk: Design for LLMs, Minimize lifecycle deltas (PRINCIPLES.md) The concern: The RFC introduces a new
Required clarification:
🟡 ALIGNMENT FLAG #3: Async Design May Contradict "Rewards Inside Environment"Invariant at risk: Rewards in environment (INVARIANTS.md:64-67) The concern: The RFC proposes: async def train():
# All 64 samples evaluated concurrently via thread pool
rewards = await evaluate_batch(rubric, actions, batch.observations)This pattern suggests:
Contradiction with RFC 002:
Current codebase pattern ( # Rewards computed server-side during step()
def compute(*, action: TextArenaAction, observation: TextArenaObservation) -> Dict[str, float]Possible interpretations:
Required clarification:
🟡 ALIGNMENT FLAG #4: Missing Integration with Environment APIConcern: The RFC doesn't show how rubrics integrate with the existing Questions:
Expected in RFC:
Evidence: RFC 002:82 shows current 🟢 ALIGNMENT FLAG #5: LLMJudge Container Pattern (Informational)Not a violation, but worth discussing: The RFC proposes Architectural question:
Likely fine if:
Suggested: Add note about LLMJudge network requirements in deployment section Minor Issues (Not Blocking)Documentation Gaps
Examples Could Be Clearer
Positive Observations✅ Strong academic grounding - Appendix with literature patterns is excellent RecommendationsBefore MergeMUST address:
SHOULD address: Consider for Follow-up
SummaryThis RFC proposes a well-designed abstraction for reward computation. However, it has unclear boundaries with existing architectural decisions:
These are likely documentation gaps rather than design flaws, but they risk implementation that violates established invariants. A human reviewer (suggest @Darktex) should confirm the intended architecture before implementation begins. Recommendation: Automated review by Claude Code | Learn more about OpenEnv's agentic workflow |
|
Thanks for taking the time to write this out @Darktex, this is an important RFC that's very close to challenges we're facing at HF. If I understand correctly, this RFC defines a single abstraction for computing and handling envs' scores. The result of which is:
Main high level points
|
… batch example - Add "Rubrics Live Inside Environments" section clarifying env.rubric access - Fix batch evaluation example to show EnvPool pattern (not standalone rubric) - Add RubricDict container for multi-task environments - Add get_rubric(path) method for nested access - Update Environment base class to require rubric attribute - Add Implementation Plan section with stacked PR breakdown - Add "One env = one trajectory" principle to PRINCIPLES.md Addresses feedback from @burtenshaw and alignment review.
|
Thanks for the thoughtful review @burtenshaw! You raised several important points that helped clarify the design. Here's how I'd address them (and since claude is doing all the work now, the PR was updated with what it's done :D ) 1. Multi-task environments (per-task vs shared rubrics)I think that the PyTorch inspiration keeps working well here. I simply added RubricDict — a new container analogous to nn.ModuleDict for keyed dispatch: class AtariRubric(Rubric):
def __init__(self):
self.games = RubricDict({
"pong": PongRubric(),
"breakout": BreakoutRubric(),
})
def forward(self, action, obs) -> float:
return self.games[obs.game_id](action, obs)Access: Also added 2. Container Rubrics + auto-registrationAgreed this is crucial. The RFC already covers this, but we've now made it clearer that trainers can access the full hierarchy via 3. Batch evaluation interfaceThe examples were super confusing here, so I revised them. Thanks for catching this. Let's make sure we align on the design! Basically, we stick to one env = one trajectory. Environments don't support multiplexing. Batching is achieved by stacking environments: envs = EnvPool("code_env", n=64)
observations = await envs.step_batch(actions) # can be made cleverer w/ Asyncio but in the end you gotta await
rewards = [obs.reward for obs in observations]Individual envs don't need batch awareness — EnvPool handles orchestration. This is now documented in both the RFC and PRINCIPLES.md. 4. evaluate_batch() locationThis is also another byproduct of having a bad example in the code. Fixed the confusing example. The RFC now shows that:
|
burtenshaw
left a comment
There was a problem hiding this comment.
For me, this RFC is ready for the light. I'm looking forward to what the community has to say.
Summary
Introduces RFC 004: Rubric System—a composable, nn.Module-inspired abstraction for computing rewards in OpenEnv environments.
Key design decisions:
What's included:
Design informed by:
Test plan