Skip to content
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
10 changes: 10 additions & 0 deletions .cursor/BUGBOT.md
Original file line number Diff line number Diff line change
Expand Up @@ -14,3 +14,13 @@ Any PR that adds or modifies core user-facing functionality as described in `doc
Notable information which should be available for reference, but does not neatly map to a specific documentation section, should be mentioned in `docs/faqs.md`.

If such changes are detected without a corresponding documentation update, request that the author add an entry.

## Example Environments Updates

Any PR that adds or removes an environment from the `environments/` folder must update `environments/README.md` to reflect the change. The README should:

- List the new environment under the appropriate category/pattern section
- Remove references to deleted environments
- Update the "What to look at for each pattern" section if applicable

If an environment is added or removed without a corresponding `environments/README.md` update, request that the author add the necessary changes.
43 changes: 25 additions & 18 deletions environments/README.md
Original file line number Diff line number Diff line change
Expand Up @@ -11,18 +11,15 @@ This folder contains installable example environments that showcase common usage

### SingleTurnEnv (prompt → single response)
- **gsm8k**: Classic QA with exact-match reward; toggles `ThinkParser` vs `Parser` and format reward.
- **math**: Hendrycks MATH dataset with `MathRubric` reward (using HuggingFace's `math-verify` scorer).
- **reverse_text**: XML formatting with non-binary LCS reward + format reward.
- **gpqa**: Multiple-choice; demonstrates optional judge-based secondary scoring via `RubricGroup`.
- **simpleqa**: Judge-graded A/B/C classification using `JudgeRubric` rewards.
- **summarize_text**: Multiple rewards (length/format + similarity) combined in one `Rubric`.
- **continuation_quality**: Completion-style generation (`message_type="completion"`) judged for prose quality with `JudgeRubric`.
- **mmmu**: Multimodal inputs (image + text) packed in chat content; single-turn boxed-answer check.

### SingleTurnEnv subclass (custom dataset/scoring wrappers)
- **reasoning_gym_env**: Wraps `reasoning_gym` procedural datasets, converts to HF datasets, uses `XMLParser` and task-specific scoring.

### MultiTurnEnv (custom interaction protocols)
- **alphabet_sort**: Multi-turn task requiring the model to maintain and update an alphabetically sorted list of names across turns; uses `XMLParser` with per-turn sequence similarity rewards.
- **doublecheck**: Simple follow-up turn ("Are you sure?") with math rewards; minimal `is_completed`/`env_response` implementation.
- **sentence_repeater**: Multi-turn Q/A over a paragraph; rewards compare assistant messages to expected answers.
- **wordle**: Game-style interaction via `TextArenaEnv`; multiple rewards (correctness, partial credit, few-turn bonus) and XML formatting.
Expand All @@ -32,45 +29,55 @@ This folder contains installable example environments that showcase common usage
- **tool_test**: Validates parallel tool calls and checks exact tool usage via `ToolRubric` + custom reward.
- **wiki_search**: Multi-tool retrieval (search/view/read) with `ToolEnv`; final judgment combined via `RubricGroup` with a `JudgeRubric`.

- **XML tool calling (roll-your-own on MultiTurnEnv)**
- **xml_tool_env**: Parses `<tool>{...}</tool>` commands with `XMLParser`, executes Python functions, and returns `<result>...</result>` via `env_response`.
- **xlam_function_calling**: Single-turn XML tool-call verification (no execution) that checks called tools match the ground truth list.
- **smolagents_math_tools**: Integrates Smolagents `Tool` objects and a custom parser for tool/answer XML; demonstrates external tool frameworks.

### Sandboxes
- **PythonEnv (ipython-style REPL)**
- **math_python**: Solve math problems using Python in a sandbox environment.

### GymEnv (external gym environments)
- **gem_wordle**: Multi-turn Wordle game powered by the GEM framework; models must guess a 5-letter word using `\boxed{}` format.

### Experimental environments
- **MCPEnv (MCP server integration)**
- **mcp_search_env**: Example environment demonstrating `vf.MCPEnv` for Model Context Protocol server integration.

- **RLMEnv (Recursive Language Model)**
- **rlm_secrets**: Puzzle environment testing RLM functionality including root-level tools, sub-LLM tool use, and file operations.

- **HarborEnv / CliAgentEnv (CLI agent sandboxes)**
- **dummy_harbor_env**: Minimal Harbor environment for testing the CLI agent interception framework.
- **opencode_harbor**: Runs the OpenCode CLI agent on Harbor tasks with API interception via Prime Tunnel.
- **terminus_harbor**: Runs the Terminus agent on Harbor tasks with API interception via Prime Tunnel.

### Composition
- **EnvGroup**
- **math_group**: Groups two `SingleTurnEnv` tasks (GSM8K + Math) into one environment with shared interface.

- **RubricGroup**
- **math_python**: `ToolRubric` (tool adherence) + `MathRubric` (answer correctness).
- **gpqa**: Adds a `JudgeRubric` alongside base rubric for auxiliary scoring.
- **wiki_search**: Merges judge scoring with the tool-use rubric.

### Judge-based evaluation (LLM-as-judge)
- **simpleqa**: Judge rubric maps graded letters to reward.
- **continuation_quality**: Judge rubric extracts `<grade>` and maps A–F to a continuous score.
- **toxicity_explanation**: Judge rubric returns 0–10 normalized score for both classification correctness and explanation quality.
- **self_reward**: pattern for `SingleTurnEnv` with only a `JudgeRubric` over a dataset that supplies `question`/`answer`; intended for online RL where model acts as its own judge.
- **self_reward**: Pattern for `SingleTurnEnv` with only a `JudgeRubric` over a dataset that supplies `question`/`answer`; intended for online RL where model acts as its own judge.

### Parsers and formatting
- **ThinkParser**: Used in `gsm8k`, `wiki_search` to separate reasoning from final answers.
- **XMLParser**: Used in `reverse_text`, `wordle`, `summarize_text`, `reasoning_gym_env`, `xml_tool_env`, `xlam_function_calling` to enforce structured outputs and enable format rewards.
- **Custom parsers**: `smolagents_math_tools` defines a bespoke parser to interoperate with external tool schemas.
- **XMLParser**: Used in `reverse_text`, `wordle`, `alphabet_sort`, `reasoning_gym_env` to enforce structured outputs and enable format rewards.

### Multimodal inputs
- **mmmu**: Demonstrates passing images via chat `content` items with `{type: "image_url", image_url: {url: ...}}` and standard answer parsing.

## What to look at for each pattern
- **Minimal SingleTurnEnv**: `reverse_text`, `gsm8k`
- **JudgeRubric end-to-end**: `simpleqa`, `continuation_quality`, `toxicity_explanation`, `self_reward`
- **JudgeRubric end-to-end**: `continuation_quality`, `toxicity_explanation`, `self_reward`
- **ToolEnv with real tools**: `wiki_search`, `math_python`
- **Custom MultiTurnEnv**: `doublecheck`, `sentence_repeater`, `wordle`
- **XML tools without native function-calling**: `xml_tool_env`, `xlam_function_calling`
- **Environment and rubric composition**: `math_group`, `math_python`, `gpqa`, `wiki_search`
- **Custom MultiTurnEnv**: `alphabet_sort`, `doublecheck`, `sentence_repeater`, `wordle`
- **GymEnv integration**: `gem_wordle`
- **CLI agent sandboxes**: `dummy_harbor_env`, `opencode_harbor`, `terminus_harbor`
- **MCP integration**: `mcp_search_env`
- **RLM (recursive LLM)**: `rlm_secrets`
- **Environment and rubric composition**: `math_group`, `math_python`, `wiki_search`
- **Procedural datasets**: `reasoning_gym_env`
- **Multimodal**: `mmmu`

Expand Down
Loading