Skip to content

Comments

add elastic endpoint pool for dynamic GPU scavengin#957

Open
hallerite wants to merge 10 commits intomainfrom
hallerite/elastic
Open

add elastic endpoint pool for dynamic GPU scavengin#957
hallerite wants to merge 10 commits intomainfrom
hallerite/elastic

Conversation

@hallerite
Copy link
Contributor

@hallerite hallerite commented Feb 24, 2026

Description

  • Adds opt-in elastic = true mode: a background task polls endpoints.toml and updates the dispatcher's live endpoint list mid-run
  • Retries on preempted servers re-acquire from the dispatcher instead of retrying the same dead endpoint
  • New endpoints are picked up, removed endpoints are drained, in-flight concurrency counts are preserved

Changes

  • LeastLoadedDispatcher.update_variants() — swaps variant list under the condition lock, keyed by api_base_url
  • ElasticEndpointPool (new) — asyncio background task that calls load_endpoints() and pushes updated slots to the dispatcher
  • Dispatched retries moved outside acquire() so preempted-server failures re-acquire a slot on a live endpoint
  • EvalConfig gains elastic, elastic_poll_interval, endpoints_path fields
  • Pool lifecycle wired into run_evaluations() / run_evaluations_tui()

Example Elastic Eval Config

elastic = true
elastic_poll_interval = 10
endpoints_path = "endpoints.toml"

[[eval]]
env_id = "primeintellect/math-env"
endpoint_id = "zai-org/GLM-4.7-FP8"

An external sidecar manages endpoints.toml – adding/removing [[endpoint]] entries as GPU servers come and go. The eval job adapts automatically.

Type of Change

  • Bug fix (non-breaking change which fixes an issue)
  • New feature (non-breaking change which adds functionality)
  • Breaking change (fix or feature that would cause existing functionality to not work as expected)
  • Documentation update
  • Test improvement

Testing

  • All existing tests pass when running uv run pytest locally.
  • New tests have been added to cover the changes

Checklist

  • My code follows the style guidelines of this project as outlined in AGENTS.md
  • I have performed a self-review of my own code
  • I have commented my code, particularly in hard-to-understand areas
  • I have made corresponding changes to the documentation
  • My changes generate no new warnings
  • Any dependent changes have been merged and published

Additional Notes


Note

Medium Risk
Touches core evaluation concurrency and request routing, and adds background hot-reload behavior; misconfiguration or edge cases could change throughput or cause unexpected blocking, though changes are guarded and covered by new tests.

Overview
Adds per-variant concurrency limits to endpoint registry variants via optional max_concurrent, switching multi-variant dispatch from round-robin to least-loaded routing and enforcing an all-or-nothing configuration rule across variants.

Introduces an opt-in elastic endpoint pool (elastic=true) that polls endpoints.toml during a run and updates the live variant set while preserving in-flight capacity; evaluation wiring now builds shared LeastLoadedDispatcher instances per endpoint_id, disables the global --max-concurrent semaphore when dispatchers are active, and ensures retries re-acquire capacity so failures on removed/preempted endpoints can move to healthy replicas.

Updates config/types/docs to support max_concurrent on endpoints and elastic/elastic_poll_interval/endpoints_path on eval configs, and adds tests covering dispatcher acquisition/release semantics, dynamic variant updates, and elastic pool reload behavior.

Written by Cursor Bugbot for commit fbf26b2. This will update automatically on new commits. Configure here.

Copy link

@cursor cursor bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Cursor Bugbot has reviewed your changes and found 2 potential issues.

Bugbot Autofix is OFF. To automatically fix reported issues with Cloud Agents, enable Autofix in the Cursor dashboard.

```

Leaf endpoint configuration used inside `ClientConfig.endpoint_configs`. Has the same fields as `ClientConfig` except `endpoint_configs` itself, preventing recursive nesting.
Leaf endpoint configuration used inside `ClientConfig.endpoint_configs`. Has the same fields as `ClientConfig` except `endpoint_configs` itself, preventing recursive nesting. The optional `max_concurrent` field limits how many concurrent requests this variant handles; see [Per-Variant Concurrency](evaluation.md#concurrency).
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

EvalConfig docs missing new elastic pool fields

Low Severity

The EvalConfig section in docs/reference.md is missing the three new fields added by this PR: elastic, elastic_poll_interval, and endpoints_path. These are user-facing configuration options for the elastic endpoint pool feature. The documentation rule requires updating reference docs when core user-facing functionality is modified.

Additional Locations (1)

Fix in Cursor Fix in Web

Triggered by project rule: BugBot Instructions


By default, scoring runs interleaved with generation. Use `--no-interleave-scoring` to score all rollouts after generation completes.

When per-variant `max_concurrent` limits are configured in the endpoint registry, the endpoint dispatcher manages concurrency globally across all variants and the `--max-concurrent` flag is ignored.
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Elastic mode feature undocumented in evaluation docs and skills

Low Severity

The PR adds a new user-facing elastic endpoint pool feature (with elastic, elastic_poll_interval, and endpoints_path config fields), but neither docs/evaluation.md nor skills/evaluate-environments/SKILL.md documents the elastic mode itself. Only per-variant max_concurrent is documented. Users have no documentation for how to enable or configure elastic polling.

Additional Locations (1)

Fix in Cursor Fix in Web

Triggered by project rule: BugBot Instructions

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant