Withhold a ModelEndpoint until its ModelReplica is Ready#163
Merged
Conversation
A ModelDeployment fans out into a ModelReplica and a ModelEndpoint per scheduled placement. compose-model-deployment composed the ModelEndpoint as soon as the placement was scheduled, from the cluster's gateway address alone, without regard for whether the replica's model was actually serving. As soon as the endpoint's Service and EndpointSlice existed it advertised a backendName, ModelService picked it up, and the HTTPRoute routed traffic to it. The destination pods were still warming up - pulling the engine image and loading model weights - so the workload cluster gateway returned 503s. This happened on every deployment from scratch, and on scale-up a share of traffic 503'd for the duration of each new replica's warm-up. This change withholds the ModelEndpoint until its ModelReplica reports Ready=True. The replica's Ready tracks both the engine workloads serving and the remote Service and HTTPRoute that front them - the whole traffic path the endpoint advertises - so gating on it ensures routing only ever points at a backend that can serve. The endpoint is composed on the reconcile that first observes the replica Ready, and withdrawn again if the replica later goes not-Ready, pulling a dead backend out of rotation. This mirrors the existing behaviour for placements on clusters with no gateway address, which already get no endpoint. Fixes #102. Signed-off-by: Nic Cope <nicc@rk0n.org>
There was a problem hiding this comment.
Pull request overview
This PR prevents ModelService/HTTPRoute from routing traffic to a newly scheduled replica before it can actually serve by withholding (and, if necessary, withdrawing) the corresponding ModelEndpoint until the ModelReplica reports Ready=True. This addresses the warm-up window 503s described in #102 by ensuring only ready backends are advertised for routing.
Changes:
- Gate
ModelEndpointcomposition on the observedModelReplicaReadycondition, and omit the endpoint from desired state when the replica is not-ready (triggering deletion). - Update composition tests to cover “withhold until ready” and “withdraw when not-ready” behaviors, and adjust expectations across existing scenarios.
Reviewed changes
Copilot reviewed 2 out of 2 changed files in this pull request and generated no comments.
| File | Description |
|---|---|
| functions/compose-model-deployment/function/fn.py | Withholds endpoint composition until the per-placement replica is observed Ready=True, preventing premature routing to warming backends. |
| functions/compose-model-deployment/tests/test_fn.py | Adds/updates cases to validate endpoint withholding/withdrawal based on replica readiness and updates expected conditions/status. |
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
dennis-upbound
approved these changes
Jun 16, 2026
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Description of your changes
Fixes #102.
A ModelDeployment fans out into a ModelReplica and a ModelEndpoint per scheduled placement.
compose-model-deploymentcomposed the ModelEndpoint as soon as the placement was scheduled, from the cluster's gateway address alone, with no regard for whether the replica's model was actually serving. Once the endpoint's Service and EndpointSlice existed it advertised abackendName, ModelService picked it up, and the HTTPRoute routed traffic to it — while the destination pods were still pulling the engine image and loading weights. The workload cluster gateway returned 503s: on every deployment from scratch, and on scale-up for the share of traffic hitting each new replica until it warmed up.This withholds the ModelEndpoint until its ModelReplica reports
Ready=True. The replica'sReadytracks both the engine workloads serving and the remote Service and HTTPRoute that front them — the whole traffic path the endpoint advertises — so gating on it ensures routing only ever points at a backend that can serve. The endpoint is composed on the reconcile that first observes the replica Ready, and withdrawn again if the replica later goes not-Ready, pulling a dead backend out of rotation. This mirrors the existing handling of placements on clusters with no gateway address, which already get no endpoint.I have:
nix flake check(or./nix.sh flake check) and made sure it passes.git commit -s.