Add custom task system for BrowserGym environment#201
Conversation
|
@atchudhansg - thanks for sharing this PR. Supporting generic integration with training tasks is definitely part of the roadmap(cc: @Darktex ) but we are still in early stages of other features. That being said, does it make sense to have an RFC on how we will support tasks more generically that run inside/alongside OpenEnv? This wont be specific to browser tasks only. Let me know what you think. |
|
Thanks for the feedback @pankit-eng ! I completely agree that a generic approach for custom tasks across all OpenEnv environments (not just BrowserGym) would be valuable. I've drafted an RFC (RFC 005) to propose a standardized interface for this. The core idea is to allow users to inject task logic (rewards, observations, termination) at runtime without needing to rebuild containers or modify the core environment code. I've added the RFC file ( |
Darktex
left a comment
There was a problem hiding this comment.
Note: This is an automated review by Claude Code (alignment-reviewer agent), not a human review. The account posting this is shared with the human maintainer.
Based on my review of the PR, design documents, and existing codebase, I can now provide my comprehensive alignment review.
PR #201 Review: Add Custom Task System for BrowserGym Environment
Previous Review Summary
Key points from @pankit-eng and @atchudhansg discussion:
- Need for generic task integration across all OpenEnv environments, not just BrowserGym
- RFC 005 drafted to propose standardized interface for custom tasks
- Core idea: Allow runtime injection of task logic (rewards, observations, termination) without rebuilding containers
Tier 1: Bugs, Lint, Security, Debug Code
❌ CRITICAL: Incomplete Code in custom_base.py
Location: src/envs/browsergym_env/server/custom/custom_base.py:310
The _get_or_create_event_loop method is truncated mid-implementation:
def step(self, action: CustomGymAction) -> CustomGymObservation:
loop = self._get_or_create_event_loop()
return loop.run_until_complete(self._async_step(action))
# Code cuts off here - missing async_close and other methodsImpact: This will cause runtime errors. The file appears incomplete.
Fix needed: Complete the implementation or verify the diff is showing the full file.
⚠️ Lint Issues
The lint check failed due to missing uv:
Error: 'uv' is not installed or not in PATH
Action required: Run lint manually with ruff format --check once uv is available, or use Docker-based lint.
⚠️ Debug Code in Example
Location: examples/browsergym_custom_example.py
Multiple print() statements throughout the example file (lines 21-128). While acceptable for examples, consider whether this should use logging for consistency.
Recommendation: Examples are typically allowed debug output. No action required unless project style dictates otherwise.
⚠️ Emoji Removal
Location: Multiple files in README updates
The PR removes emojis from documentation:
- Changed from
RecommendedtoRecommended - Removed various visual indicators
Observation: This aligns with project style (CLAUDE.md says "only use emojis if user explicitly requests"). Good cleanup.
🔒 Security: No Critical Issues
- No credential exposure detected
- Container isolation maintained
- No SQL injection or XSS vulnerabilities introduced
- File paths properly validated
Tier 2: Alignment with Principles & Invariants
🚨 ALIGNMENT FLAG #1: RFC Required for Architectural Change
Principle at stake: "Changes affecting core APIs or architecture require RFCs" (CONTRIBUTING.md:28-34)
The concern:
This PR introduces a parallel environment system (CustomBrowserGymEnvironment) that bypasses the standard BrowserGym integration pattern. This is a significant architectural decision that:
- Adds a second environment implementation path - Custom tasks use Playwright directly while official benchmarks use BrowserGym's gym registration
- Introduces new abstractions -
CustomGymAction,CustomGymObservation,CustomGymStateparallel the existing models - Changes the interface contract - The
benchmark="custom"parameter fundamentally changes howBrowserGymEnvironmentoperates
Per CONTRIBUTING.md:
Required for:
- New core APIs in
src/openenv/core/- Breaking changes to existing APIs
- Major architectural decisions
- New abstractions or design patterns
The proposed RFC 005 is currently "Draft" status and has critical issues (see FLAG #2 below).
Suggested reviewer: @pankit-eng (already engaged in discussion), @Darktex
Recommendation:
- Do not merge until RFC 005 is reviewed and accepted
- RFC should address whether this pattern should be generalized to other environments first
- Consider whether this belongs in browsergym_env or deserves a separate "custom_env" module
🚨 ALIGNMENT FLAG #2: RFC 005 Has Wrong RFC Number
Location: rfcs/005-generic-task-integration.md:6
**RFC ID:** 004 # Should be 005!Current RFCs on main:
- 000, 001, 002, 003 exist
- Next available: 004 (not 005)
Additionally:
- RFC 005 claims to be "004" in the ID field
- File is named
005-generic-task-integration.mdbut should be004-generic-task-integration.md
Fix needed: Rename file to 004-generic-task-integration.md and update RFC ID to 004.
⚠️ ALIGNMENT FLAG #3: Dual API Boundary Potentially Violated
Invariant at stake: "Dual API boundary" (INVARIANTS.md:40-58)
The concern:
The custom task system allows arbitrary JavaScript execution:
# In custom_base.py:_execute_action()
else:
# Try to execute as JavaScript if not recognized
await self._page.evaluate(action_str)From INVARIANTS.md:
The Gym-like API is NOT accessible to the agent being trained.
Violations to flag:
- Exposing
reset(),step(), orstate()via MCP tools- Giving agents direct access to the Gym-like WebSocket API
- Any mechanism that lets an agent trigger simulation control
Analysis:
While this doesn't directly expose reset(), allowing arbitrary JavaScript could theoretically let an agent:
- Manipulate the browser state in ways that affect reward calculation
- Access browser APIs that could leak information about the simulation
- Potentially interfere with the observation extraction logic
However, this appears to be standard BrowserGym behavior - official benchmarks also execute arbitrary JavaScript for complex actions.
Recommendation:
- Document in RFC 005 that custom tasks inherit BrowserGym's action execution model
- Clarify that JavaScript sandboxing is BrowserGym's responsibility
- Consider whether custom tasks need additional guardrails
Suggested reviewer: @Darktex (architectural invariants)
⚠️ ALIGNMENT FLAG #4: Rewards Outside Environment Boundary?
Principle at stake: "Rewards inside environment" (PRINCIPLES.md:31, INVARIANTS.md:65-67)
The concern:
From INVARIANTS.md:
Reward computation must stay inside environment boundary
The custom task system allows users to define reward functions in task classes:
def _calculate_reward(self, page_data, action, error=None) -> float:
"""Calculate reward based on page state."""
# User-defined reward logicAnalysis:
This appears compliant because:
- Tasks are instantiated inside the container (server-side only)
- Task classes are part of the environment implementation
- Rewards are computed server-side and returned in observations
However, the RFC 005 proposal mentions:
"Users often need to test agents on proprietary or local tasks"
If "local tasks" means tasks defined client-side and somehow injected, this would violate the invariant.
The current implementation appears safe - tasks are defined in server/custom/custom_tasks.py on the server side.
Recommendation:
- RFC 005 should explicitly state that custom tasks are server-side only
- Clarify that reward logic cannot be injected from client
- Document the volume mount mechanism mentioned in RFC 005 more clearly
⚠️ Principle Alignment: Client-Server Separation
Invariant: "Client-server separation" (INVARIANTS.md:59-62)
Observation:
The PR adds imports at the top of browsergym_environment.py:
# Add the server directory to sys.path
_SERVER_DIR = os.path.dirname(os.path.abspath(__file__))
if _SERVER_DIR not in sys.path:
sys.path.insert(0, _SERVER_DIR)
# Import custom models
from custom.custom_models import (
CustomGymAction,
CustomGymObservation,
CustomGymState,
)Analysis:
This is acceptable because:
- Server code importing other server code (same directory)
- Client code in
client.pydoes not import custom models - Custom types are converted to
BrowserGymObservationbefore leaving the server
Compliant ✓
📋 Missing: Tests
Observation:
The PR adds:
- New environment classes
- New action parsing logic
- New observation conversion logic
- Error handling paths
But no tests are included.
From .claude/hooks/test.sh:
PYTHONPATH=src:envs uv run pytest tests/ -v --tb=shortRecommendation:
- Add
tests/envs/test_browsergym_custom.py - Test custom task registration
- Test action parsing (especially edge cases in
_execute_action) - Test observation conversion
- Test error handling when custom task not found
📋 Missing: Documentation in RFC 005
Current RFC 005 gaps:
- No implementation details - How does task registration work? Where do HTML files go?
- No security model - Who can add custom tasks? How are they sandboxed?
- No migration path - How does this relate to existing BrowserGym benchmarks?
- Vague volume mounting - The RFC mentions volume mounting but the PR doesn't implement it
- No rejection criteria - When should we reject a custom task approach?
From CONTRIBUTING.md, RFC template requires:
- Architecture Overview
- Core Abstractions
- Key Design Decisions with rationale and trade-offs
- Examples (✓ present)
Recommendation:
RFC 005 needs significant expansion before acceptance.
✅ Positive Alignment: Task-Specific Logic
Principle: "Domain knowledge encapsulated in env, not external" (PRINCIPLES.md:31)
The custom task system correctly keeps:
- Reward logic inside environment
- Termination logic inside environment
- Observation extraction inside environment
This aligns well with the "Rewards inside environment" principle. ✓
✅ Positive: Type Safety Maintained
Invariant: "Generic type safety" (INVARIANTS.md:13-16)
The PR maintains type safety:
- Custom types extend the same Pydantic patterns
- Observation conversion preserves types
- Generic types match across boundaries
Compliant ✓
Summary
Critical Issues (Must Fix Before Merge)
- Incomplete code in
custom_base.py- file appears truncated - RFC 005 wrong number - should be RFC 004
- RFC 005 not accepted - still in "Draft", needs review per CONTRIBUTING.md
Alignment Flags (Need Human Review)
- RFC required - Architectural change needs accepted RFC before merge
- RFC 005 incomplete - Needs expansion on security, architecture, trade-offs
- Dual API boundary - Arbitrary JavaScript execution needs documentation
- Rewards boundary - Clarify that tasks are server-side only
Missing Components
- Tests - No test coverage for new functionality
- Lint verification - Unable to run due to
uvmissing
Positive Aspects
- Maintains type safety ✓
- Respects client-server separation ✓
- Keeps rewards inside environment ✓
- Good documentation in README ✓
- Removes emojis per style guide ✓
Recommendation
DO NOT MERGE until:
- ✅ Complete the truncated code in
custom_base.py - ✅ Rename RFC to
004-generic-task-integration.mdwith correct ID - ✅ Expand RFC 004 with architecture, security, trade-offs
- ✅ Get RFC 004 reviewed and accepted per CONTRIBUTING.md
- ✅ Add test coverage for custom task system
- ✅ Run and pass lint checks
- ✅ Address whether this should be generalized beyond BrowserGym
Timeline: Human review needed on architectural decisions before implementation can proceed.
Automated review by Claude Code | Learn more about OpenEnv's agentic workflow
|
@greptile |
Greptile Summary
Important Files Changed
Confidence score: 3/5
Sequence DiagramsequenceDiagram
participant User
participant BrowserGymEnv as "BrowserGymEnvironment"
participant CustomEnv as "Custom Environment"
participant Browser as "Playwright Browser"
participant HTMLPage as "HTML Task Page"
User->>BrowserGymEnv: "reset()"
BrowserGymEnv->>CustomEnv: "reset()"
CustomEnv->>Browser: "launch browser"
CustomEnv->>Browser: "create new page"
CustomEnv->>HTMLPage: "navigate to task URL"
HTMLPage-->>CustomEnv: "page loaded"
CustomEnv->>HTMLPage: "extract observation"
HTMLPage-->>CustomEnv: "page content & state"
CustomEnv-->>BrowserGymEnv: "CustomGymObservation"
BrowserGymEnv-->>User: "BrowserGymObservation"
User->>BrowserGymEnv: "step(action)"
BrowserGymEnv->>CustomEnv: "step(CustomGymAction)"
CustomEnv->>Browser: "execute action (click/fill/press)"
Browser->>HTMLPage: "perform DOM interaction"
HTMLPage-->>Browser: "action result"
CustomEnv->>HTMLPage: "extract observation"
HTMLPage-->>CustomEnv: "updated page state"
CustomEnv->>CustomEnv: "calculate reward"
CustomEnv->>CustomEnv: "check if done"
CustomEnv-->>BrowserGymEnv: "CustomGymObservation"
BrowserGymEnv-->>User: "BrowserGymObservation"
User->>BrowserGymEnv: "close()"
BrowserGymEnv->>CustomEnv: "close()"
CustomEnv->>Browser: "close browser"
|
| const currentPath = window.location.pathname; | ||
| const directory = currentPath.substring(0, currentPath.lastIndexOf('/')); | ||
| const targetPath = directory + '/copy-paste-target.html'; | ||
| window.location.href = 'file://' + targetPath; |
There was a problem hiding this comment.
logic: file:// protocol navigation may not work reliably in browser environments due to security restrictions. Consider using relative paths or server-side routing instead. Will this task always run in an environment that allows file:// protocol access?
Prompt To Fix With AI
This is a comment left during a code review.
Path: src/envs/browsergym_env/server/custom/tasks/copy-paste-source.html
Line: 33:33
Comment:
**logic:** file:// protocol navigation may not work reliably in browser environments due to security restrictions. Consider using relative paths or server-side routing instead. Will this task always run in an environment that allows file:// protocol access?
How can I resolve this? If you propose a fix, please make it concise.| **Status**: Draft | ||
| **Created**: 06/12/2025 | ||
| **Authors**: @atchudhansg | ||
| **RFC ID:** 004 |
There was a problem hiding this comment.
syntax: RFC ID is incorrect - document title indicates this is RFC 005 but header shows RFC 004
| **RFC ID:** 004 | |
| **RFC ID:** 005 |
Prompt To Fix With AI
This is a comment left during a code review.
Path: rfcs/005-generic-task-integration.md
Line: 6:6
Comment:
**syntax:** RFC ID is incorrect - document title indicates this is RFC 005 but header shows RFC 004
```suggestion
**RFC ID:** 005
```
How can I resolve this? If you propose a fix, please make it concise.| else: | ||
| # Try to execute as JavaScript if not recognized | ||
| await self._page.evaluate(action_str) |
There was a problem hiding this comment.
logic: JavaScript execution fallback creates potential security vulnerability - arbitrary code could be executed
| else: | |
| # Try to execute as JavaScript if not recognized | |
| await self._page.evaluate(action_str) | |
| else: | |
| # Unsupported action - raise error instead of executing arbitrary JS | |
| raise ValueError(f"Unsupported action format: {action_str}") |
Prompt To Fix With AI
This is a comment left during a code review.
Path: src/envs/browsergym_env/server/custom/custom_base.py
Line: 301:303
Comment:
**logic:** JavaScript execution fallback creates potential security vulnerability - arbitrary code could be executed
```suggestion
else:
# Unsupported action - raise error instead of executing arbitrary JS
raise ValueError(f"Unsupported action format: {action_str}")
```
How can I resolve this? If you propose a fix, please make it concise.| # Add the server directory to sys.path to allow custom module imports | ||
| _SERVER_DIR = os.path.dirname(os.path.abspath(__file__)) | ||
| if _SERVER_DIR not in sys.path: | ||
| sys.path.insert(0, _SERVER_DIR) |
There was a problem hiding this comment.
style: Modifying sys.path at module import can cause issues with module resolution and testing. Consider using importlib.util.spec_from_file_location for more controlled imports.
Note: If this suggestion doesn't match your team's coding style, reply to this and let me know. I'll remember it for next time!
Prompt To Fix With AI
This is a comment left during a code review.
Path: src/envs/browsergym_env/server/browsergym_environment.py
Line: 26:29
Comment:
**style:** Modifying sys.path at module import can cause issues with module resolution and testing. Consider using importlib.util.spec_from_file_location for more controlled imports.
<sub>Note: If this suggestion doesn't match your team's coding style, reply to this and let me know. I'll remember it for next time!</sub>
How can I resolve this? If you propose a fix, please make it concise.| try: | ||
| from custom.custom_models import ( | ||
| CustomGymAction as _CustomGymAction, | ||
| CustomGymObservation as _CustomGymObservation, | ||
| CustomGymState as _CustomGymState, | ||
| ) | ||
| CUSTOM_AVAILABLE = True | ||
| CustomGymAction = _CustomGymAction | ||
| CustomGymObservation = _CustomGymObservation | ||
| CustomGymState = _CustomGymState | ||
| _CUSTOM_IMPORT_ERROR = None | ||
| except ImportError as e: | ||
| CUSTOM_AVAILABLE = False | ||
| CustomGymAction = None # type: ignore | ||
| CustomGymObservation = None # type: ignore | ||
| CustomGymState = None # type: ignore | ||
| _CUSTOM_IMPORT_ERROR = str(e) |
There was a problem hiding this comment.
style: The try/except import pattern with global variable assignment creates type checking complications. The # type: ignore comments suggest this approach is fighting the type system. Is there a specific reason for this pattern instead of using conditional imports or factory functions?
Note: If this suggestion doesn't match your team's coding style, reply to this and let me know. I'll remember it for next time!
Prompt To Fix With AI
This is a comment left during a code review.
Path: src/envs/browsergym_env/server/browsergym_environment.py
Line: 40:56
Comment:
**style:** The try/except import pattern with global variable assignment creates type checking complications. The `# type: ignore` comments suggest this approach is fighting the type system. Is there a specific reason for this pattern instead of using conditional imports or factory functions?
<sub>Note: If this suggestion doesn't match your team's coding style, reply to this and let me know. I'll remember it for next time!</sub>
How can I resolve this? If you propose a fix, please make it concise.| obs = self.custom_env.reset(seed=seed) | ||
| self._state = self.custom_env.state |
There was a problem hiding this comment.
logic: Directly overwriting self._state with self.custom_env.state bypasses the OpenEnv state management. This could break state tracking and episode management.
Prompt To Fix With AI
This is a comment left during a code review.
Path: src/envs/browsergym_env/server/browsergym_environment.py
Line: 212:213
Comment:
**logic:** Directly overwriting `self._state` with `self.custom_env.state` bypasses the OpenEnv state management. This could break state tracking and episode management.
How can I resolve this? If you propose a fix, please make it concise.| url: str = "" | ||
| """Current URL of the page""" | ||
|
|
||
| screenshot: Optional[List[List[List[int]]]] = None |
There was a problem hiding this comment.
style: Screenshot type should be more specific than nested lists. Consider using numpy.ndarray or a proper image type.
Note: If this suggestion doesn't match your team's coding style, reply to this and let me know. I'll remember it for next time!
Prompt To Fix With AI
This is a comment left during a code review.
Path: src/envs/browsergym_env/server/custom/custom_models.py
Line: 56:56
Comment:
**style:** Screenshot type should be more specific than nested lists. Consider using numpy.ndarray or a proper image type.
<sub>Note: If this suggestion doesn't match your team's coding style, reply to this and let me know. I'll remember it for next time!</sub>
How can I resolve this? If you propose a fix, please make it concise.| # Add current directory to path for relative imports | ||
| _CURRENT_DIR = os.path.dirname(os.path.abspath(__file__)) | ||
| if _CURRENT_DIR not in sys.path: | ||
| sys.path.insert(0, _CURRENT_DIR) |
There was a problem hiding this comment.
style: Adding current directory to sys.path can be a security risk as it affects module resolution globally. Consider using absolute imports or relative imports within the package structure instead. Is there a specific reason for this path manipulation rather than using proper package imports?
Note: If this suggestion doesn't match your team's coding style, reply to this and let me know. I'll remember it for next time!
Prompt To Fix With AI
This is a comment left during a code review.
Path: src/envs/browsergym_env/server/custom/custom_tasks.py
Line: 11:14
Comment:
**style:** Adding current directory to `sys.path` can be a security risk as it affects module resolution globally. Consider using absolute imports or relative imports within the package structure instead. Is there a specific reason for this path manipulation rather than using proper package imports?
<sub>Note: If this suggestion doesn't match your team's coding style, reply to this and let me know. I'll remember it for next time!</sub>
How can I resolve this? If you propose a fix, please make it concise.| def _get_task_url(self) -> str: | ||
| """Get the URL for the copy-paste task.""" | ||
| # This should point to a local HTML file or a URL hosting the task | ||
| import os |
There was a problem hiding this comment.
style: Importing os inside method when it's already imported at module level (line 8)
| import os | |
| task_html = os.path.join( |
Note: If this suggestion doesn't match your team's coding style, reply to this and let me know. I'll remember it for next time!
Prompt To Fix With AI
This is a comment left during a code review.
Path: src/envs/browsergym_env/server/custom/custom_tasks.py
Line: 81:81
Comment:
**style:** Importing `os` inside method when it's already imported at module level (line 8)
```suggestion
task_html = os.path.join(
```
<sub>Note: If this suggestion doesn't match your team's coding style, reply to this and let me know. I'll remember it for next time!</sub>
How can I resolve this? If you propose a fix, please make it concise.
Review Feedback AddressedThanks for the review! I've addressed all the concerns raised: Added Comprehensive Test Suite
Code Cleanup
Test ResultsAll validation and integration tests pass successfully: PYTHONPATH=src python3 tests/envs/test_custom_validation.py
# ✅ PASS: Imports
# ✅ PASS: Task Registration
# ✅ PASS: Task Interface
# ✅ PASS: Model Instantiation
PYTHONPATH=src python3 tests/envs/test_custom_integration.py
# ✅ PASS: Environment Creation
# ✅ PASS: HTML Task Loading
# ✅ PASS: Multi-tab Task
# ✅ PASS: Environment Reset/Step/StateThe custom task system works correctly through the OpenEnv interface. The comprehensive pytest suite is ready for your CI/CD pipeline. Let me know if there are any other changes needed! |
- Implement custom task framework with base classes, models, and registry - Add comprehensive test suite (validation, integration, pytest) - Add RFC 005 documenting generic task support design - Support custom browser tasks alongside official BrowserGym benchmarks - Migrate to new package structure (openenv.core.env_server) - Update data models to use Pydantic v2 BaseModel Changes rebased onto restructured main branch.
- Fix import path: browsergym_env.models -> envs.browsergym_env.models - Fix observation conversion: use actual error/last_action_error values - Reduce test verbosity: remove excessive print statements - All integration tests now pass with browser automation
30335d5 to
ab1a52c
Compare
Add Custom Task System for BrowserGym
Summary
Adds a custom task framework that allows users to create domain-specific BrowserGym tasks without depending on official benchmark packages.
Changes
browsergym_custom_example.pydemonstrating usageKey Features
Usage
Files Changed