PrimeIntellect-ai · rasdani · Jan 28, 2026 · Jan 28, 2026 · Jan 25, 2026 · Jan 25, 2026
diff --git a/AGENTS.md b/AGENTS.md
@@ -96,7 +96,6 @@ View local evaluation results in the terminal UI:
 ```bash
 prime eval tui
 ```
-In the TUI, press `c` to open Copy Mode for prompt/completion text; highlight and press `c` again to copy.
 
 To publish the environment to the [Environments Hub](https://app.primeintellect.ai/dashboard/environments?ex_sort=most_stars), do:
 ```bash
@@ -120,4 +119,4 @@ prime eval run primeintellect/math-python
 
 **[API Reference](reference.md)** — Understanding the API and data structures
 
-**[FAQs](faqs.md)** - Other frequently asked questions.
+**[FAQs](faqs.md)** - Other frequently asked questions.
diff --git a/docs/environments.md b/docs/environments.md
@@ -629,12 +629,14 @@ class MyGameEnv(vf.MultiTurnEnv):
     @vf.cleanup
     async def save_game_log(self, state: vf.State):
         await log_game_result(state["game_id"], state["score"])
-    
+
     @vf.teardown
     async def close_connections(self):
         await self.db_connection.close()
 ```
 
+> **Important:** Cleanup methods should be **idempotent**—safe to call multiple times—and handle errors gracefully. This ensures correct behavior when rollouts are cancelled or interrupted, and that cleanup completes even when resources are in unexpected states.
+
 ### Signaling Early Termination
 
 To end a rollout from within `env_response` (e.g., when the game ends), set `state["final_env_response"]`:
@@ -791,6 +793,6 @@ These require additional dependencies installed via extras (e.g., `uv add 'verif
 Newer and more experimental environment classes include:
 
 - **`GymEnv`** — universal runner for Gym-compatible environments (OpenAI Gym / Gymnasium API)
-- **`CliAgentEnv`** — runs custom agent code inside sandboxes, intercepting API requests. Accepts sandbox configuration parameters including `docker_image`, `cpu_cores`, `memory_gb`, `disk_size_gb`, `gpu_count`, `timeout_minutes`, `environment_vars`, and `labels` for sandbox categorization
+- **`CliAgentEnv`** — runs custom agent code inside sandboxes, intercepting API requests. Accepts sandbox configuration parameters including `docker_image`, `cpu_cores`, `memory_gb`, `disk_size_gb`, `gpu_count`, `timeout_minutes`, `environment_vars`, and `labels` for sandbox categorization. Also accepts retry tuning (like `max_retries`) and connection pooling ( like `sandbox_client_max_workers`) parameters via `SandboxMixin`
 - **`HarborEnv`** — loads Harbor-format agent benchmark tasks
 - **`RLMEnv`** — implements Recursive Language Models for unbounded context processing. Execution supports both local and sandbox backends via `execution_backend` (`"local"` default, `"sandbox"` to run the REPL inside a Prime Sandbox). Context is still filesystem-based: a provided `context_dir` is copied into the working directory, or legacy JSON-serializable `context` data is written to `context.json`/`context.txt`. The RLM scaffolding prompt (filesystem summary, REPL workflow, tool docs) is injected into the first user message wrapped in `<RLM_SCAFFOLDING>...</RLM_SCAFFOLDING>`, preserving any external system prompt. The REPL language is configurable via `repl_language` (default: `bash`); use `repl_language="python"` to retain the Python REPL. Bash mode uses `call_bash_repl` and behaves like a terminal; Python mode uses `call_python_repl`. Local Python mode uses a best-effort filesystem jail that restricts access to the working directory; sandboxes run without the jail. Sub-LLM and root-tool interception for sandboxes is routed through a Prime Tunnel unless `interception_url` is provided. Tooling can be split via `tools` (shared), `root_tools` (REPL-only), and `sub_tools` (sub-LLM tools). Fixed root tools like `llm_batch` are always present and cannot be overridden. Tool ordering is fixed tools → shared tools → role-specific tools, with per-list deduplication by name. Root tools are callable only inside the REPL; sub-LLM tools use standard tool-calling.
diff --git a/docs/reference.md b/docs/reference.md
@@ -628,7 +628,7 @@ async def early_cleanup(self, state: State) -> None:
     ...
 ```
 
-Mark a method as a rollout cleanup handler.
+Mark a method as a rollout cleanup handler. Cleanup methods should be **idempotent**—safe to call multiple times—and handle errors gracefully to ensure cleanup completes even when resources are in unexpected states.
 
 ### @vf.teardown
 

diff --git a/environments/AGENTS.md b/environments/AGENTS.md
@@ -633,12 +633,14 @@ class MyGameEnv(vf.MultiTurnEnv):
     @vf.cleanup
     async def save_game_log(self, state: vf.State):
         await log_game_result(state["game_id"], state["score"])
-    
+
     @vf.teardown
     async def close_connections(self):
         await self.db_connection.close()
 ```
 
+> **Important:** Cleanup methods should be **idempotent**—safe to call multiple times—and handle errors gracefully. This ensures correct behavior when rollouts are cancelled or interrupted, and that cleanup completes even when resources are in unexpected states.
+
 ### Signaling Early Termination
 
 To end a rollout from within `env_response` (e.g., when the game ends), set `state["final_env_response"]`:
@@ -795,6 +797,6 @@ These require additional dependencies installed via extras (e.g., `uv add 'verif
 Newer and more experimental environment classes include:
 
 - **`GymEnv`** — universal runner for Gym-compatible environments (OpenAI Gym / Gymnasium API)
-- **`CliAgentEnv`** — runs custom agent code inside sandboxes, intercepting API requests. Accepts sandbox configuration parameters including `docker_image`, `cpu_cores`, `memory_gb`, `disk_size_gb`, `gpu_count`, `timeout_minutes`, `environment_vars`, and `labels` for sandbox categorization
+- **`CliAgentEnv`** — runs custom agent code inside sandboxes, intercepting API requests. Accepts sandbox configuration parameters including `docker_image`, `cpu_cores`, `memory_gb`, `disk_size_gb`, `gpu_count`, `timeout_minutes`, `environment_vars`, and `labels` for sandbox categorization. Also accepts retry tuning (like `max_retries`) and connection pooling ( like `sandbox_client_max_workers`) parameters via `SandboxMixin`
 - **`HarborEnv`** — loads Harbor-format agent benchmark tasks
 - **`RLMEnv`** — implements Recursive Language Models for unbounded context processing. Execution supports both local and sandbox backends via `execution_backend` (`"local"` default, `"sandbox"` to run the REPL inside a Prime Sandbox). Context is still filesystem-based: a provided `context_dir` is copied into the working directory, or legacy JSON-serializable `context` data is written to `context.json`/`context.txt`. The RLM scaffolding prompt (filesystem summary, REPL workflow, tool docs) is injected into the first user message wrapped in `<RLM_SCAFFOLDING>...</RLM_SCAFFOLDING>`, preserving any external system prompt. The REPL language is configurable via `repl_language` (default: `bash`); use `repl_language="python"` to retain the Python REPL. Bash mode uses `call_bash_repl` and behaves like a terminal; Python mode uses `call_python_repl`. Local Python mode uses a best-effort filesystem jail that restricts access to the working directory; sandboxes run without the jail. Sub-LLM and root-tool interception for sandboxes is routed through a Prime Tunnel unless `interception_url` is provided. Tooling can be split via `tools` (shared), `root_tools` (REPL-only), and `sub_tools` (sub-LLM tools). Fixed root tools like `llm_batch` are always present and cannot be overridden. Tool ordering is fixed tools → shared tools → role-specific tools, with per-list deduplication by name. Root tools are callable only inside the REPL; sub-LLM tools use standard tool-calling.
diff --git a/verifiers/envs/experimental/__init__.py b/verifiers/envs/experimental/__init__.py
@@ -0,0 +1,3 @@
+from verifiers.envs.experimental.sandbox_mixin import SandboxMixin
+
+__all__ = ["SandboxMixin"]
-Original file line number
+Diff line change
@@ Expand Up / @@ -628,7 +628,7 @@ async def early_cleanup(self, state: State) -> None: @@
         ...
     ```
-    Mark a method as a rollout cleanup handler.
+    Mark a method as a rollout cleanup handler. Cleanup methods should be **idempotent**—safe to call multiple times—and handle errors gracefully to ensure cleanup completes even when resources are in unexpected states.
     ### @vf.teardown
@@ Expand Down @@
Original file line number	Diff line number	Diff line change
		@@ -0,0 +1,3 @@
		from verifiers.envs.experimental.sandbox_mixin import SandboxMixin

		__all__ = ["SandboxMixin"]