Skip to content
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
Show all changes
25 commits
Select commit Hold shift + click to select a range
File filter

Filter by extension

Filter by extension


Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
2 changes: 1 addition & 1 deletion docs/environments.md
Original file line number Diff line number Diff line change
Expand Up @@ -795,6 +795,6 @@ These require additional dependencies installed via extras (e.g., `uv add 'verif
Newer and more experimental environment classes include:

- **`GymEnv`** — universal runner for Gym-compatible environments (OpenAI Gym / Gymnasium API)
- **`CliAgentEnv`** — runs custom agent code inside sandboxes, intercepting API requests. Accepts sandbox configuration parameters including `docker_image`, `cpu_cores`, `memory_gb`, `disk_size_gb`, `gpu_count`, `timeout_minutes`, `environment_vars`, and `labels` for sandbox categorization
- **`CliAgentEnv`** — runs custom agent code inside sandboxes, intercepting API requests. Accepts sandbox configuration parameters including `docker_image`, `cpu_cores`, `memory_gb`, `disk_size_gb`, `gpu_count`, `timeout_minutes`, `environment_vars`, and `labels` for sandbox categorization. Also accepts retry tuning (like `max_retries`) and connection pooling ( like `sandbox_client_max_workers`) parameters via `SandboxMixin`
- **`HarborEnv`** — loads Harbor-format agent benchmark tasks
- **`RLMEnv`** — implements Recursive Language Models for unbounded context processing. Execution supports both local and sandbox backends via `execution_backend` (`"local"` default, `"sandbox"` to run the REPL inside a Prime Sandbox). Context is still filesystem-based: a provided `context_dir` is copied into the working directory, or legacy JSON-serializable `context` data is written to `context.json`/`context.txt`. The RLM scaffolding prompt (filesystem availability note, REPL workflow, tool docs) is injected into the first user message wrapped in `<RLM_SCAFFOLDING>...</RLM_SCAFFOLDING>`, preserving any external system prompt; the model-visible prompt is stored in `state["prompt"]`, while the original input prompt is preserved in `state["raw_prompt"]`. The REPL language is configurable via `repl_language` (default: `bash`); use `repl_language="python"` to retain the Python REPL. Bash mode uses `call_bash_repl` and behaves like a terminal; Python mode uses `call_python_repl`. Sub-LLM and root-tool interception for sandboxes is routed through a Prime Tunnel unless `interception_url` is provided. Tooling can be split via `tools` (shared), `root_tools` (REPL-only), and `sub_tools` (sub-LLM tools). Fixed root tools like `llm_batch` are always present and cannot be overridden. Tool ordering is fixed tools → shared tools → role-specific tools, with per-list deduplication by name. Root tools are callable only inside the REPL; sub-LLM tools use standard tool-calling. When using the sandbox backend, the sandbox and worker are started eagerly during `setup_state`, and package installs are skipped when the package is already importable in the image. Environments can pre-set `state["rlm_fs_root_remote"]` (and optionally `state["rlm_control_dir_remote"]`) before calling `super().setup_state` to point the worker at an existing filesystem path in the sandbox. For further customization, override `get_sandbox_request`, `on_sandbox_ready`, or `customize_worker_script` on `RLMEnv`.
4 changes: 2 additions & 2 deletions environments/AGENTS.md
Original file line number Diff line number Diff line change
Expand Up @@ -799,6 +799,6 @@ These require additional dependencies installed via extras (e.g., `uv add 'verif
Newer and more experimental environment classes include:

- **`GymEnv`** — universal runner for Gym-compatible environments (OpenAI Gym / Gymnasium API)
- **`CliAgentEnv`** — runs custom agent code inside sandboxes, intercepting API requests. Accepts sandbox configuration parameters including `docker_image`, `cpu_cores`, `memory_gb`, `disk_size_gb`, `gpu_count`, `timeout_minutes`, `environment_vars`, and `labels` for sandbox categorization
- **`CliAgentEnv`** — runs custom agent code inside sandboxes, intercepting API requests. Accepts sandbox configuration parameters including `docker_image`, `cpu_cores`, `memory_gb`, `disk_size_gb`, `gpu_count`, `timeout_minutes`, `environment_vars`, and `labels` for sandbox categorization. Also accepts retry tuning (like `max_retries`) and connection pooling ( like `sandbox_client_max_workers`) parameters via `SandboxMixin`
- **`HarborEnv`** — loads Harbor-format agent benchmark tasks
- **`RLMEnv`** — implements Recursive Language Models for unbounded context processing. Execution supports both local and sandbox backends via `execution_backend` (`"local"` default, `"sandbox"` to run the REPL inside a Prime Sandbox). Context is still filesystem-based: a provided `context_dir` is copied into the working directory, or legacy JSON-serializable `context` data is written to `context.json`/`context.txt`. The RLM scaffolding prompt (filesystem availability note, REPL workflow, tool docs) is injected into the first user message wrapped in `<RLM_SCAFFOLDING>...</RLM_SCAFFOLDING>`, preserving any external system prompt; the model-visible prompt is stored in `state["prompt"]`, while the original input prompt is preserved in `state["raw_prompt"]`. The REPL language is configurable via `repl_language` (default: `bash`); use `repl_language="python"` to retain the Python REPL. Bash mode uses `call_bash_repl` and behaves like a terminal; Python mode uses `call_python_repl`. Sub-LLM and root-tool interception for sandboxes is routed through a Prime Tunnel unless `interception_url` is provided. Tooling can be split via `tools` (shared), `root_tools` (REPL-only), and `sub_tools` (sub-LLM tools). Fixed root tools like `llm_batch` are always present and cannot be overridden. Tool ordering is fixed tools → shared tools → role-specific tools, with per-list deduplication by name. Root tools are callable only inside the REPL; sub-LLM tools use standard tool-calling.
- **`RLMEnv`** — implements Recursive Language Models for unbounded context processing. Execution supports both local and sandbox backends via `execution_backend` (`"local"` default, `"sandbox"` to run the REPL inside a Prime Sandbox). Context is still filesystem-based: a provided `context_dir` is copied into the working directory, or legacy JSON-serializable `context` data is written to `context.json`/`context.txt`. The RLM scaffolding prompt (filesystem availability note, REPL workflow, tool docs) is injected into the first user message wrapped in `<RLM_SCAFFOLDING>...</RLM_SCAFFOLDING>`, preserving any external system prompt; the model-visible prompt is stored in `state["prompt"]`, while the original input prompt is preserved in `state["raw_prompt"]`. The REPL language is configurable via `repl_language` (default: `bash`); use `repl_language="python"` to retain the Python REPL. Bash mode uses `call_bash_repl` and behaves like a terminal; Python mode uses `call_python_repl`. Sub-LLM and root-tool interception for sandboxes is routed through a Prime Tunnel unless `interception_url` is provided. Tooling can be split via `tools` (shared), `root_tools` (REPL-only), and `sub_tools` (sub-LLM tools). Fixed root tools like `llm_batch` are always present and cannot be overridden. Tool ordering is fixed tools → shared tools → role-specific tools, with per-list deduplication by name. Root tools are callable only inside the REPL; sub-LLM tools use standard tool-calling. When using the sandbox backend, the sandbox and worker are started eagerly during `setup_state`, and package installs are skipped when the package is already importable in the image. Environments can pre-set `state["rlm_fs_root_remote"]` (and optionally `state["rlm_control_dir_remote"]`) before calling `super().setup_state` to point the worker at an existing filesystem path in the sandbox. For further customization, override `get_sandbox_request`, `on_sandbox_ready`, or `customize_worker_script` on `RLMEnv`.
123 changes: 92 additions & 31 deletions environments/opencode_harbor/opencode_harbor.py
Original file line number Diff line number Diff line change
@@ -1,3 +1,4 @@
import json
import logging
from pathlib import Path

Expand All @@ -6,48 +7,74 @@
logger = logging.getLogger("verifiers.envs.OpenCodeHarborEnv")


def _build_run_command(agent_workdir: str) -> str:
def _build_opencode_config(
disabled_tools: list[str] | None = None,
system_prompt_path: str | None = None,
) -> str:
config: dict = {
"${SCHEMA_DOLLAR}schema": "https://opencode.ai/config.json",
"provider": {
"intercepted": {
"npm": "@ai-sdk/openai-compatible",
"name": "Intercepted",
"options": {
"baseURL": "$OPENAI_BASE_URL",
"apiKey": "intercepted",
"timeout": 600000,
},
"models": {
"model": {
"name": "Intercepted Model",
"modalities": {"input": ["text", "image"], "output": ["text"]},
}
},
}
},
"model": "intercepted/model",
}

# Add agent config if we have custom prompt or disabled tools
if system_prompt_path or disabled_tools:
build_config: dict = {}

if system_prompt_path:
build_config["prompt"] = "{file:" + system_prompt_path + "}"

if disabled_tools:
build_config["tools"] = {tool: False for tool in disabled_tools}

config["agent"] = {"build": build_config}

return json.dumps(config, indent=2)


def _build_run_command(
agent_workdir: str,
disabled_tools: list[str] | None = None,
has_system_prompt: bool = False,
) -> str:
# Path where we'll upload the system prompt in the sandbox
system_prompt_sandbox_path = "/opencode/prompt.txt" if has_system_prompt else None
config_json = _build_opencode_config(disabled_tools, system_prompt_sandbox_path)

return f"""
set -e

echo "Starting OpenCode agent..."
echo "Base URL: $OPENAI_BASE_URL"

apt-get update && apt-get install -y curl

# TODO: Add opencode to prebuilt images so we don't need to install at runtime
curl -fsSL https://opencode.ai/install | bash
export PATH="$HOME/.opencode/bin:$PATH"

# Create opencode config directory
mkdir -p ~/.config/opencode

# Create opencode.json config with intercepted provider
# Preserve JSON schema key literal in unquoted heredoc while still expanding
# OPENAI_BASE_URL.
SCHEMA_DOLLAR='$'

# Create opencode.json config
cat > ~/.config/opencode/opencode.json << EOFCONFIG
{{
"\\$schema": "https://opencode.ai/config.json",
"provider": {{
"intercepted": {{
"npm": "@ai-sdk/openai-compatible",
"name": "Intercepted",
"options": {{
"baseURL": "$OPENAI_BASE_URL",
"apiKey": "intercepted",
"timeout": 600000
}},
"models": {{
"model": {{
"name": "Intercepted Model",
"modalities": {{
"input": ["text", "image"],
"output": ["text"]
}}
}}
}}
}}
}},
"model": "intercepted/model"
}}
{config_json}
EOFCONFIG

mkdir -p /logs/agent
Expand All @@ -65,23 +92,55 @@ def __init__(
tasks: list[str] | None = None,
agent_workdir: str = "/app",
docker_image: str = "python:3.11-slim",
system_prompt_path: str | Path | None = None,
disabled_tools: list[str] | None = None,
**kwargs,
):
self.system_prompt_path = (
Path(system_prompt_path) if system_prompt_path else None
)
self.disabled_tools = disabled_tools

super().__init__(
run_command=_build_run_command(agent_workdir),
run_command=_build_run_command(
agent_workdir,
disabled_tools=disabled_tools,
has_system_prompt=system_prompt_path is not None,
),
dataset_path=dataset_path,
tasks=tasks,
agent_workdir=agent_workdir,
docker_image=docker_image,
**kwargs,
)

async def post_sandbox_setup(self, state) -> None:
"""Upload Harbor task assets and optional system prompt after sandbox creation."""
await super().post_sandbox_setup(state)

if self.system_prompt_path:
if not self.system_prompt_path.exists():
raise FileNotFoundError(
f"System prompt file not found: {self.system_prompt_path}"
)

sandbox_id = state["sandbox_id"]
await self.sandbox_client.execute_command(
sandbox_id, "mkdir -p /opencode", working_dir=None
)
await self.sandbox_client.upload_file(
sandbox_id, "/opencode/prompt.txt", str(self.system_prompt_path)
)
logger.info(f"Uploaded system prompt from {self.system_prompt_path}")


def load_environment(
dataset_path: str | Path = Path(__file__).parent / "tasks",
tasks: list[str] | None = None,
agent_workdir: str = "/app",
docker_image: str = "python:3.11-slim",
system_prompt_path: str | Path | None = Path(__file__).parent / "prompt.txt",
disabled_tools: list[str] | None = ["webfetch", "question"],
timeout_seconds: float = 900.0,
cpu_cores: int = 2,
memory_gb: int = 4,
Expand All @@ -94,6 +153,8 @@ def load_environment(
tasks=tasks,
agent_workdir=agent_workdir,
docker_image=docker_image,
system_prompt_path=system_prompt_path,
disabled_tools=disabled_tools,
timeout_seconds=timeout_seconds,
cpu_cores=cpu_cores,
memory_gb=memory_gb,
Expand Down
Loading
Loading