Skip to content

Comments

add per-variant endpoint concurrency with least-loaded dispatch#895

Open
hallerite wants to merge 9 commits intomainfrom
hallerite/per-variant-concurrency
Open

add per-variant endpoint concurrency with least-loaded dispatch#895
hallerite wants to merge 9 commits intomainfrom
hallerite/per-variant-concurrency

Conversation

@hallerite
Copy link
Contributor

@hallerite hallerite commented Feb 11, 2026

Description

When multiple evals target the same endpoint (e.g., 8 vLLM nodes serving the same model), each eval creates its own semaphore independently. This means there's no shared concurrency control, and blind round-robin ignores node load, which causes head-of-line blocking when one node is slower.

Type of Change

  • Bug fix (non-breaking change which fixes an issue)
  • New feature (non-breaking change which adds functionality)
  • Breaking change (fix or feature that would cause existing functionality to not work as expected)
  • Documentation update
  • Test improvement

Testing

  • All existing tests pass when running uv run pytest locally.
  • New tests have been added to cover the changes

Checklist

  • My code follows the style guidelines of this project as outlined in AGENTS.md
  • I have performed a self-review of my own code
  • I have commented my code, particularly in hard-to-understand areas
  • I have made corresponding changes to the documentation
  • My changes generate no new warnings
  • Any dependent changes have been merged and published

Additional Notes


Note

Medium Risk
Touches core evaluation scheduling/concurrency and changes how multi-variant endpoints are dispatched, which can impact throughput and fairness if misconfigured (e.g., incorrect max_concurrent or oversized rollout groups).

Overview
Adds per-variant concurrency limiting for endpoint registry variants via a new optional max_concurrent field, enabling least-loaded routing instead of round-robin when multiple replicas share an endpoint_id.

Introduces LeastLoadedDispatcher/EndpointSlot and wires it through run_evaluations �� _build_dispatchers() �� Environment.generate()/evaluate() so all evals targeting the same endpoint_id share global per-variant capacity; grouped scoring now reserves count=len(group) slots on a single variant and rejects groups larger than any variant.

Extends endpoint loading and CLI config (eval.py) to parse/validate max_concurrent from TOML/Python registries, enforces all-or-nothing configuration across variants, ignores --max-concurrent when variant limits are active, adds tests for dispatcher behavior and registry parsing, and updates docs/skill guidance accordingly.

Written by Cursor Bugbot for commit a95ddd5. This will update automatically on new commits. Configure here.

@hallerite hallerite marked this pull request as ready for review February 23, 2026 18:00
dispatchers[endpoint_id] = EndpointDispatcher(slots)
else:
dispatchers[endpoint_id] = NullEndpointDispatcher(resolved)

Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Dispatcher uses wrong endpoint configs

High Severity

_build_dispatchers builds one dispatcher per endpoint_id using the first EvalConfig seen, then reuses it for all evals with that endpoint_id. If later evals have different client_config (different endpoints source, keys/URLs, or overrides that change endpoint_configs), environment.evaluate runs requests using slot.config from the wrong eval.

Additional Locations (2)

Fix in Cursor Fix in Web

),
)
for cfg, ep in zip(resolved, endpoint_cfgs)
]
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Variant zip may drop configurations

Medium Severity

_build_dispatchers pairs resolved = resolve_client_configs(ec.client_config) with endpoint_cfgs = ec.client_config.endpoint_configs using zip(resolved, endpoint_cfgs). If resolve_client_configs ever returns a different length/order than endpoint_configs, variants can be silently dropped or mispaired, producing incorrect max_concurrent assignment per ClientConfig.

Fix in Cursor Fix in Web

Copy link

@cursor cursor bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Cursor Bugbot has reviewed your changes and found 1 potential issue.

Bugbot Autofix is OFF. To automatically fix reported issues with Cloud Agents, enable Autofix in the Cursor dashboard.

sampling_args,
max_retries=max_retries,
state_columns=state_columns,
)
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Dispatcher path leaks HTTP clients in non-server mode

Low Severity

When the dispatcher path is used and self.env_client is None (non-server mode), each call to _dispatched_rollout/_dispatched_group passes slot.config (a ClientConfig) to run_rollout/run_group, which calls resolve_client(slot.config) creating a new HTTP client per rollout. These clients are never closed. The legacy path avoids this by pre-creating clients in local_endpoint_clients and closing them in the finally block. The standard eval flow uses server mode and isn't affected, but the generate() public API allows this combination.

Fix in Cursor Fix in Web

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant