Skip to content

Withhold a ModelEndpoint until its ModelReplica is Ready#163

Merged
dennis-upbound merged 1 commit into
mainfrom
cold-shoulder
Jun 16, 2026
Merged

Withhold a ModelEndpoint until its ModelReplica is Ready#163
dennis-upbound merged 1 commit into
mainfrom
cold-shoulder

Conversation

@negz

@negz negz commented Jun 16, 2026

Copy link
Copy Markdown
Collaborator

Description of your changes

Fixes #102.

A ModelDeployment fans out into a ModelReplica and a ModelEndpoint per scheduled placement. compose-model-deployment composed the ModelEndpoint as soon as the placement was scheduled, from the cluster's gateway address alone, with no regard for whether the replica's model was actually serving. Once the endpoint's Service and EndpointSlice existed it advertised a backendName, ModelService picked it up, and the HTTPRoute routed traffic to it — while the destination pods were still pulling the engine image and loading weights. The workload cluster gateway returned 503s: on every deployment from scratch, and on scale-up for the share of traffic hitting each new replica until it warmed up.

This withholds the ModelEndpoint until its ModelReplica reports Ready=True. The replica's Ready tracks both the engine workloads serving and the remote Service and HTTPRoute that front them — the whole traffic path the endpoint advertises — so gating on it ensures routing only ever points at a backend that can serve. The endpoint is composed on the reconcile that first observes the replica Ready, and withdrawn again if the replica later goes not-Ready, pulling a dead backend out of rotation. This mirrors the existing handling of placements on clusters with no gateway address, which already get no endpoint.

I have:

  • Read and followed Modelplane's contribution process.
  • Run nix flake check (or ./nix.sh flake check) and made sure it passes.
  • Added or updated tests covering any composition function changes.
  • Signed off every commit with git commit -s.

A ModelDeployment fans out into a ModelReplica and a ModelEndpoint per
scheduled placement. compose-model-deployment composed the ModelEndpoint
as soon as the placement was scheduled, from the cluster's gateway
address alone, without regard for whether the replica's model was
actually serving. As soon as the endpoint's Service and EndpointSlice
existed it advertised a backendName, ModelService picked it up, and the
HTTPRoute routed traffic to it.

The destination pods were still warming up - pulling the engine image and
loading model weights - so the workload cluster gateway returned 503s.
This happened on every deployment from scratch, and on scale-up a share
of traffic 503'd for the duration of each new replica's warm-up.

This change withholds the ModelEndpoint until its ModelReplica reports
Ready=True. The replica's Ready tracks both the engine workloads serving
and the remote Service and HTTPRoute that front them - the whole traffic
path the endpoint advertises - so gating on it ensures routing only ever
points at a backend that can serve. The endpoint is composed on the
reconcile that first observes the replica Ready, and withdrawn again if
the replica later goes not-Ready, pulling a dead backend out of rotation.
This mirrors the existing behaviour for placements on clusters with no
gateway address, which already get no endpoint.

Fixes #102.

Signed-off-by: Nic Cope <nicc@rk0n.org>
Copilot AI review requested due to automatic review settings June 16, 2026 05:52

Copilot AI left a comment

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

This PR prevents ModelService/HTTPRoute from routing traffic to a newly scheduled replica before it can actually serve by withholding (and, if necessary, withdrawing) the corresponding ModelEndpoint until the ModelReplica reports Ready=True. This addresses the warm-up window 503s described in #102 by ensuring only ready backends are advertised for routing.

Changes:

  • Gate ModelEndpoint composition on the observed ModelReplica Ready condition, and omit the endpoint from desired state when the replica is not-ready (triggering deletion).
  • Update composition tests to cover “withhold until ready” and “withdraw when not-ready” behaviors, and adjust expectations across existing scenarios.

Reviewed changes

Copilot reviewed 2 out of 2 changed files in this pull request and generated no comments.

File Description
functions/compose-model-deployment/function/fn.py Withholds endpoint composition until the per-placement replica is observed Ready=True, preventing premature routing to warming backends.
functions/compose-model-deployment/tests/test_fn.py Adds/updates cases to validate endpoint withholding/withdrawal based on replica readiness and updates expected conditions/status.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

@dennis-upbound dennis-upbound merged commit 9328afb into main Jun 16, 2026
4 checks passed
@negz negz deleted the cold-shoulder branch June 16, 2026 16:55
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

503s during deployment and scale-up

3 participants