add per-variant endpoint concurrency with least-loaded dispatch#895
add per-variant endpoint concurrency with least-loaded dispatch#895
Conversation
| dispatchers[endpoint_id] = EndpointDispatcher(slots) | ||
| else: | ||
| dispatchers[endpoint_id] = NullEndpointDispatcher(resolved) | ||
|
|
There was a problem hiding this comment.
Dispatcher uses wrong endpoint configs
High Severity
_build_dispatchers builds one dispatcher per endpoint_id using the first EvalConfig seen, then reuses it for all evals with that endpoint_id. If later evals have different client_config (different endpoints source, keys/URLs, or overrides that change endpoint_configs), environment.evaluate runs requests using slot.config from the wrong eval.
Additional Locations (2)
| ), | ||
| ) | ||
| for cfg, ep in zip(resolved, endpoint_cfgs) | ||
| ] |
There was a problem hiding this comment.
Variant zip may drop configurations
Medium Severity
_build_dispatchers pairs resolved = resolve_client_configs(ec.client_config) with endpoint_cfgs = ec.client_config.endpoint_configs using zip(resolved, endpoint_cfgs). If resolve_client_configs ever returns a different length/order than endpoint_configs, variants can be silently dropped or mispaired, producing incorrect max_concurrent assignment per ClientConfig.
…-or-nothing concurrency
There was a problem hiding this comment.
Cursor Bugbot has reviewed your changes and found 1 potential issue.
Bugbot Autofix is OFF. To automatically fix reported issues with Cloud Agents, enable Autofix in the Cursor dashboard.
| sampling_args, | ||
| max_retries=max_retries, | ||
| state_columns=state_columns, | ||
| ) |
There was a problem hiding this comment.
Dispatcher path leaks HTTP clients in non-server mode
Low Severity
When the dispatcher path is used and self.env_client is None (non-server mode), each call to _dispatched_rollout/_dispatched_group passes slot.config (a ClientConfig) to run_rollout/run_group, which calls resolve_client(slot.config) creating a new HTTP client per rollout. These clients are never closed. The legacy path avoids this by pre-creating clients in local_endpoint_clients and closing them in the finally block. The standard eval flow uses server mode and isn't affected, but the generate() public API allows this combination.


Description
When multiple evals target the same endpoint (e.g., 8 vLLM nodes serving the same model), each eval creates its own semaphore independently. This means there's no shared concurrency control, and blind round-robin ignores node load, which causes head-of-line blocking when one node is slower.
Type of Change
Testing
uv run pytestlocally.Checklist
Additional Notes
Note
Medium Risk
Touches core evaluation scheduling/concurrency and changes how multi-variant endpoints are dispatched, which can impact throughput and fairness if misconfigured (e.g., incorrect
max_concurrentor oversized rollout groups).Overview
Adds per-variant concurrency limiting for endpoint registry variants via a new optional
max_concurrentfield, enabling least-loaded routing instead of round-robin when multiple replicas share anendpoint_id.Introduces
LeastLoadedDispatcher/EndpointSlotand wires it throughrun_evaluations��_build_dispatchers()��Environment.generate()/evaluate()so all evals targeting the sameendpoint_idshare global per-variant capacity; grouped scoring now reservescount=len(group)slots on a single variant and rejects groups larger than any variant.Extends endpoint loading and CLI config (
eval.py) to parse/validatemax_concurrentfrom TOML/Python registries, enforces all-or-nothing configuration across variants, ignores--max-concurrentwhen variant limits are active, adds tests for dispatcher behavior and registry parsing, and updates docs/skill guidance accordingly.Written by Cursor Bugbot for commit a95ddd5. This will update automatically on new commits. Configure here.