Conversation
9b309db to
31f4e84
Compare
0ba6b27 to
8552e76
Compare
8552e76 to
3eee0ab
Compare
3eee0ab to
0d304d8
Compare
| ) | ||
| print( | ||
| f"event_loop_lag: med - {print_time(float(med_lag))}, p90 - {print_time(float(p90_lag))}, max - {print_time(float(max_lag))}" | ||
| ) |
There was a problem hiding this comment.
Empty event loop lags array causes crash
The code checks if event_loop_lags is not None: but doesn't handle the case where event_loop_lags is an empty list. If the evaluation completes faster than the 0.1-second measure_interval before any lag measurements are recorded, get_lags() returns an empty list. Calling np.max(), np.median(), or np.percentile() on an empty numpy array raises a ValueError because these reduction operations cannot handle zero-size arrays. The condition needs to also check that the list is non-empty before computing statistics.
There was a problem hiding this comment.
Per-tool metrics not tracked for subclass-added tools
The ToolMonitorRubric is created in ToolEnv.__init__ before subclasses add their tools. When SandboxEnv or PythonEnv call super().__init__(), self.tools is empty, so ToolMonitorRubric captures an empty tool_names list. Tools like bash and python are added via add_tool() only after the rubric is created, so their per-tool call counts won't be tracked. This is a regression from the old pattern where ToolRubric was explicitly created after the environment was fully initialized in math_python.py.
verifiers/envs/tool_env.py#L77-L78
verifiers/verifiers/envs/tool_env.py
Lines 77 to 78 in 0d304d8
verifiers/envs/sandbox_env.py#L149-L150
verifiers/verifiers/envs/sandbox_env.py
Lines 149 to 150 in 0d304d8
verifiers/envs/python_env.py#L201-L206
verifiers/verifiers/envs/python_env.py
Lines 201 to 206 in 0d304d8
Description
Logs event loop lag during
vf-evalas env monitoring util to detect if an env overloads the main event loop, thereby degrading performance.# low event loop lag uv run vf-eval gsm8k -n1 -r1 -v# high event loop lag (>1k parallel reqs) uv run vf-eval gsm8k -n-1 -r1 -c-1 -vType of Change
Testing
uv run pytestlocally.Checklist
Additional Notes
Note
Introduces performance monitoring in evaluation and standardized metrics across environments.
EventLoopLagMonitor(async_utils) integrated intorun_evaluationandprint_resultsto report event loop lag and detailed generation/scoring timings; addsprint_time()formatterMultiTurnEnv:num_turnsToolEnv:total_tool_callsand per-tool call counts (replaces removedToolRubric)SandboxEnv:sandbox_ready_wait_time,sandbox_command_execution_timePythonEnv:python_ready_wait_timeEnvironment.add_rubric()stacks rubrics automaticallyToolRubricexport/file;math_pythonstops manually adding it—ToolEnvnow tracks tool metrics by defaultdocs/evaluation.md, enricheddocs/environments.md(monitor rubrics), updated navigation and RTD redirectnum_turnsmetric in expectationsWritten by Cursor Bugbot for commit 0d304d8. This will update automatically on new commits. Configure here.