Skip to content

Gate serving-stack Gateway readiness on its LoadBalancer address#162

Merged
dennis-upbound merged 1 commit into
mainfrom
mind-the-gate
Jun 16, 2026
Merged

Gate serving-stack Gateway readiness on its LoadBalancer address#162
dennis-upbound merged 1 commit into
mainfrom
mind-the-gate

Conversation

@negz

@negz negz commented Jun 16, 2026

Copy link
Copy Markdown
Collaborator

Description of your changes

Fixes #121

On a fresh InferenceCluster a ModelDeployment never schedules — it sits at ReplicasScheduled=False / InsufficientCapacity because the cluster's status.gateway.address is never populated, even though the live Envoy Gateway on the workload cluster has had its address the whole time.

compose-serving-stack wraps the Gateway in a provider-kubernetes Object with the default readiness.policy: SuccessfulCreate, so it's Ready the instant it's applied. provider-kubernetes only re-observes an Object's manifest on its fast (~30s) poll while the Object is not Ready; a Ready Object re-observes only on the slow (~10m) drift poll. The Gateway's address is assigned asynchronously after the first observe, so the observed manifest stays frozen at a pre-address snapshot for up to ~10m.

This change gives the Gateway Object a DeriveFromCelQuery readiness policy gating on the observed status.addresses. While the address is absent the Object is not Ready, so provider-kubernetes keeps re-observing on its ~30s poll and the address propagates promptly. This mirrors the pattern compose-model-replica already uses.

I have:

  • Read and followed Modelplane's contribution process.
  • Run nix flake check (or ./nix.sh flake check) and made sure it passes.
  • Added or updated tests covering any composition function changes.
  • Signed off every commit with git commit -s.

Copilot AI review requested due to automatic review settings June 16, 2026 05:39

Copilot AI left a comment

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

This PR fixes a scheduling deadlock on fresh InferenceClusters by ensuring the composed Envoy Gateway (wrapped as a provider-kubernetes Object) is not considered ready until its LoadBalancer address is actually observed, allowing the address to propagate quickly into status.gateway.address for downstream scheduling.

Changes:

  • Add a DeriveFromCelQuery readiness policy to the composed Gateway Object, gated on status.addresses being present/non-empty.
  • Extend the serving-stack unit tests to validate readiness gating and status propagation behavior.

Reviewed changes

Copilot reviewed 2 out of 2 changed files in this pull request and generated 3 comments.

File Description
functions/compose-serving-stack/function/fn.py Adds a CEL readiness query and wires it into the composed Gateway Object to keep provider-kubernetes re-observing until the address appears.
functions/compose-serving-stack/tests/test_fn.py Adds/updates tests to cover the Gateway readiness gating and address surfacing behavior.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Comment thread functions/compose-serving-stack/function/fn.py
Comment thread functions/compose-serving-stack/function/fn.py
Comment thread functions/compose-serving-stack/tests/test_fn.py
On a fresh InferenceCluster a ModelDeployment never schedules: it stays
at ReplicasScheduled=False / InsufficientCapacity because the cluster's
status.gateway.address is never populated, even though the live Envoy
Gateway on the workload cluster has an address. The scheduler filters
out any cluster without a gateway address.

compose-serving-stack wraps the Envoy Gateway in a provider-kubernetes
Object with the default readiness.policy: SuccessfulCreate, so the
Object is Ready the instant it's applied. provider-kubernetes only
re-observes an Object's status.atProvider.manifest on its fast (~30s)
poll while the Object is not Ready; a Ready Object re-observes only on
the slow (~10m) drift poll. The Gateway's LoadBalancer address is
assigned asynchronously after the first observe, so the observed
manifest stays frozen at a pre-address snapshot, and the address fails
to propagate up the chain, for up to ~10m.

This change gives the Gateway Object a DeriveFromCelQuery readiness
policy that gates on the observed manifest's status.addresses. While
the address is absent the Object is not Ready, so provider-kubernetes
keeps re-observing on its ~30s poll and the address propagates promptly
instead of after the full drift interval. This mirrors the
DeriveFromCelQuery pattern compose-model-replica already uses for
workload readiness, and needs no alpha watch feature gate.

Fixes #121.

Signed-off-by: Nic Cope <nicc@rk0n.org>
@dennis-upbound dennis-upbound merged commit 95e3b1c into main Jun 16, 2026
3 checks passed
@negz negz deleted the mind-the-gate branch June 16, 2026 16:55
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

ModelDeployment never schedules on a fresh cluster: gateway Object's stale observed manifest blocks status.gateway.address

3 participants