add per-variant endpoint concurrency with least-loaded dispatch by hallerite · Pull Request #895 · PrimeIntellect-ai/verifiers

hallerite · 2026-02-11T03:10:42Z

Description

When multiple evals target the same endpoint (e.g., 8 vLLM nodes serving the same model), each eval creates its own semaphore independently. This means there's no shared concurrency control, and blind round-robin ignores node load, which causes head-of-line blocking when one node is slower.

Type of Change

Bug fix (non-breaking change which fixes an issue)
New feature (non-breaking change which adds functionality)
Breaking change (fix or feature that would cause existing functionality to not work as expected)
Documentation update
Test improvement

Testing

All existing tests pass when running uv run pytest locally.
New tests have been added to cover the changes

Checklist

My code follows the style guidelines of this project as outlined in AGENTS.md
I have performed a self-review of my own code
I have commented my code, particularly in hard-to-understand areas
I have made corresponding changes to the documentation
My changes generate no new warnings
Any dependent changes have been merged and published

Additional Notes

Note

Medium Risk
Touches core evaluation scheduling/concurrency and changes how multi-variant endpoints are dispatched, which can impact throughput and fairness if misconfigured (e.g., incorrect max_concurrent or oversized rollout groups).

Overview
Adds per-variant concurrency limiting for endpoint registry variants via a new optional max_concurrent field, enabling least-loaded routing instead of round-robin when multiple replicas share an endpoint_id.

Introduces LeastLoadedDispatcher/EndpointSlot and wires it through run_evaluations �� _build_dispatchers() �� Environment.generate()/evaluate() so all evals targeting the same endpoint_id share global per-variant capacity; grouped scoring now reserves count=len(group) slots on a single variant and rejects groups larger than any variant.

Extends endpoint loading and CLI config (eval.py) to parse/validate max_concurrent from TOML/Python registries, enforces all-or-nothing configuration across variants, ignores --max-concurrent when variant limits are active, adds tests for dispatcher behavior and registry parsing, and updates docs/skill guidance accordingly.

^{Written by Cursor Bugbot for commit a95ddd5. This will update automatically on new commits. Configure here.}

…-concurrency

cursor · 2026-02-23T18:02:33Z

verifiers/utils/eval_utils.py

+            dispatchers[endpoint_id] = EndpointDispatcher(slots)
+        else:
+            dispatchers[endpoint_id] = NullEndpointDispatcher(resolved)
+


Dispatcher uses wrong endpoint configs

High Severity

_build_dispatchers builds one dispatcher per endpoint_id using the first EvalConfig seen, then reuses it for all evals with that endpoint_id. If later evals have different client_config (different endpoints source, keys/URLs, or overrides that change endpoint_configs), environment.evaluate runs requests using slot.config from the wrong eval.

Additional Locations (2)

verifiers/utils/eval_utils.py#L732-L750

verifiers/envs/environment.py#L1008-L1055

cursor · 2026-02-23T18:02:33Z

verifiers/utils/eval_utils.py

+                    ),
+                )
+                for cfg, ep in zip(resolved, endpoint_cfgs)
+            ]


Variant zip may drop configurations

Medium Severity

_build_dispatchers pairs resolved = resolve_client_configs(ec.client_config) with endpoint_cfgs = ec.client_config.endpoint_configs using zip(resolved, endpoint_cfgs). If resolve_client_configs ever returns a different length/order than endpoint_configs, variants can be silently dropped or mispaired, producing incorrect max_concurrent assignment per ClientConfig.

verifiers/utils/eval_utils.py

verifiers/types.py

verifiers/scripts/eval.py

…-or-nothing concurrency

verifiers/utils/async_utils.py

verifiers/utils/eval_utils.py

verifiers/utils/async_utils.py

cursor

Cursor Bugbot has reviewed your changes and found 1 potential issue.

^{Bugbot Autofix is OFF. To automatically fix reported issues with Cloud Agents, enable Autofix in the Cursor dashboard.}

cursor · 2026-02-24T13:34:12Z

verifiers/envs/environment.py

+                                sampling_args,
+                                max_retries=max_retries,
+                                state_columns=state_columns,
+                            )


Dispatcher path leaks HTTP clients in non-server mode

Low Severity

When the dispatcher path is used and self.env_client is None (non-server mode), each call to _dispatched_rollout/_dispatched_group passes slot.config (a ClientConfig) to run_rollout/run_group, which calls resolve_client(slot.config) creating a new HTTP client per rollout. These clients are never closed. The legacy path avoids this by pre-creating clients in local_endpoint_clients and closing them in the finally block. The standard eval flow uses server mode and isn't affected, but the generate() public API allows this combination.

hallerite added 2 commits February 11, 2026 03:07

add per-variant endpoint concurrency with least-loaded dispatch

51b05c7

Merge remote-tracking branch 'origin/main' into hallerite/per-variant…

a39017a

…-concurrency

hallerite marked this pull request as ready for review February 23, 2026 18:00