add elastic endpoint pool for dynamic GPU scavengin#957
add elastic endpoint pool for dynamic GPU scavengin#957
Conversation
…-or-nothing concurrency
There was a problem hiding this comment.
Cursor Bugbot has reviewed your changes and found 2 potential issues.
Bugbot Autofix is OFF. To automatically fix reported issues with Cloud Agents, enable Autofix in the Cursor dashboard.
| ``` | ||
|
|
||
| Leaf endpoint configuration used inside `ClientConfig.endpoint_configs`. Has the same fields as `ClientConfig` except `endpoint_configs` itself, preventing recursive nesting. | ||
| Leaf endpoint configuration used inside `ClientConfig.endpoint_configs`. Has the same fields as `ClientConfig` except `endpoint_configs` itself, preventing recursive nesting. The optional `max_concurrent` field limits how many concurrent requests this variant handles; see [Per-Variant Concurrency](evaluation.md#concurrency). |
There was a problem hiding this comment.
EvalConfig docs missing new elastic pool fields
Low Severity
The EvalConfig section in docs/reference.md is missing the three new fields added by this PR: elastic, elastic_poll_interval, and endpoints_path. These are user-facing configuration options for the elastic endpoint pool feature. The documentation rule requires updating reference docs when core user-facing functionality is modified.
Additional Locations (1)
Triggered by project rule: BugBot Instructions
|
|
||
| By default, scoring runs interleaved with generation. Use `--no-interleave-scoring` to score all rollouts after generation completes. | ||
|
|
||
| When per-variant `max_concurrent` limits are configured in the endpoint registry, the endpoint dispatcher manages concurrency globally across all variants and the `--max-concurrent` flag is ignored. |
There was a problem hiding this comment.
Elastic mode feature undocumented in evaluation docs and skills
Low Severity
The PR adds a new user-facing elastic endpoint pool feature (with elastic, elastic_poll_interval, and endpoints_path config fields), but neither docs/evaluation.md nor skills/evaluate-environments/SKILL.md documents the elastic mode itself. Only per-variant max_concurrent is documented. Users have no documentation for how to enable or configure elastic polling.
Additional Locations (1)
Triggered by project rule: BugBot Instructions


Description
elastic = truemode: a background task pollsendpoints.tomland updates the dispatcher's live endpoint list mid-runChanges
LeastLoadedDispatcher.update_variants()— swaps variant list under the condition lock, keyed byapi_base_urlElasticEndpointPool(new) — asyncio background task that callsload_endpoints()and pushes updated slots to the dispatcheracquire()so preempted-server failures re-acquire a slot on a live endpointEvalConfiggainselastic,elastic_poll_interval,endpoints_pathfieldsrun_evaluations()/run_evaluations_tui()Example Elastic Eval Config
An external sidecar manages endpoints.toml – adding/removing [[endpoint]] entries as GPU servers come and go. The eval job adapts automatically.
Type of Change
Testing
uv run pytestlocally.Checklist
Additional Notes
Note
Medium Risk
Touches core evaluation concurrency and request routing, and adds background hot-reload behavior; misconfiguration or edge cases could change throughput or cause unexpected blocking, though changes are guarded and covered by new tests.
Overview
Adds per-variant concurrency limits to endpoint registry variants via optional
max_concurrent, switching multi-variant dispatch from round-robin to least-loaded routing and enforcing an all-or-nothing configuration rule across variants.Introduces an opt-in elastic endpoint pool (
elastic=true) that pollsendpoints.tomlduring a run and updates the live variant set while preserving in-flight capacity; evaluation wiring now builds sharedLeastLoadedDispatcherinstances perendpoint_id, disables the global--max-concurrentsemaphore when dispatchers are active, and ensures retries re-acquire capacity so failures on removed/preempted endpoints can move to healthy replicas.Updates config/types/docs to support
max_concurrenton endpoints andelastic/elastic_poll_interval/endpoints_pathon eval configs, and adds tests covering dispatcher acquisition/release semantics, dynamic variant updates, and elastic pool reload behavior.Written by Cursor Bugbot for commit fbf26b2. This will update automatically on new commits. Configure here.