Skip to content

ModelDeployment never schedules on a fresh cluster: gateway Object's stale observed manifest blocks status.gateway.address #121

Description

@negz

What happened?

On a fresh EKS InferenceCluster, a ModelDeployment never schedules. It stays at ReplicasScheduled=False with:

0 of 1 replicas scheduled (checked 1 clusters)
Reason: InsufficientCapacity

even though the cluster is Ready=True, reports a matching GPU pool, and has free nodes. The deployment never recovers on its own — no model is ever served, with no manual intervention.

The root cause is upstream of the scheduler: the cluster's status.gateway.address is never populated, so the scheduler treats the cluster as unschedulable.

The ModelDeployment scheduler filters out any cluster without a gateway address (_cluster_ready in compose-model-deployment):

def _cluster_ready(cluster):
    if not cluster.status.gateway or not cluster.status.gateway.address:
        return False
    return any(c.type == "Ready" and c.status == "True" for c in cluster.status.conditions or [])

On the affected cluster the field is empty:

$ kubectl get ic eks-us-west -o jsonpath='{.status.gateway}'
   # (empty)
$ kubectl get ic eks-us-west -o jsonpath='{.status.conditions[?(@.type=="Ready")]}'
   {"reason":"Available","status":"True","type":"Ready"}

The address propagation chain is: Envoy Gateway (on the workload cluster) gets a LoadBalancer address → compose-serving-stack reads it from the gateway provider-kubernetes Object's observed manifest and writes ServingStack.status.gateway.addresscompose-inference-cluster copies it to InferenceCluster.status.gateway.address → scheduler reads it.

The chain breaks at the first hop. The live Gateway on the workload cluster has its address:

$ kubectl --kubeconfig eks get gateway inference-gateway -n modelplane-system \
    -o jsonpath='{.status.addresses}'
[{"type":"Hostname","value":"...elb.amazonaws.com"}]

but the gateway Object's observed manifest is frozen at a pre-address snapshot, indefinitely:

$ kubectl get object.kubernetes.m.crossplane.io <gateway-obj> -n modelplane-system \
    -o jsonpath='{.status.atProvider.manifest.status}'
{"conditions":[{"lastTransitionTime":"1970-01-01T00:00:00Z","message":"Waiting for controller","reason":"Pending","status":"Unknown","type":"Accepted"},
               {"lastTransitionTime":"1970-01-01T00:00:00Z","message":"Waiting for controller","reason":"Pending","status":"Unknown","type":"Programmed"}]}

This snapshot persisted for >8 minutes while the live Gateway had its address the whole time.

Root cause

compose-serving-stack composes the Envoy Gateway as a provider-kubernetes Object with readiness.policy: SuccessfulCreate and the default watch: false (the _k8s_object helper sets neither field; the Gateway Object is built at functions/compose-serving-stack/function/fn.py compose_gateway).

provider-kubernetes (v1.2.5) refreshes an Object's status.atProvider.manifest only when its reconciler runs. Two settings in the namespaced object controller (internal/controller/namespaced/object/object.go) mean a reconcile almost never runs after the initial apply:

  1. WithEventFilter(resource.DesiredStateChanged()) — a reconcile is enqueued only when the Object's own desired state (spec/annotations) changes. The Gateway acquiring its LoadBalancer address is an external change and does not enqueue a reconcile.
  2. WithPollInterval(o.PollInterval) with WithPollIntervalHook — the periodic drift poll defaults to 10m. The hook shortens it to 30s only while the Object is not Ready=True. Because the Gateway Object uses readiness: SuccessfulCreate, it is Ready=True the moment it is created, so it polls on the full 10m interval.

Timeline on the affected cluster:

  • 06:43:45 — Gateway Object applied; first (and only timely) observe captures the manifest before Envoy programs it ("Waiting for controller").
  • 06:43:45readiness: SuccessfulCreateReady=True immediately ⇒ poll interval is the full 10m.
  • 06:44:36 — Envoy assigns the ELB address (live Gateway Programmed=True). This external change does not trigger a reconcile.
  • next re-observe is ~10 minutes later. The cluster has no gateway address, and the scheduler reports InsufficientCapacity, for that whole window.

Why this Object and not the others: the model-serving Object (a Deployment) captures live .status on the very first observe, because the Deployment controller populates .status within milliseconds of apply. The Gateway is the only serving-stack Object whose consumed status field is populated asynchronously by a controller after the apply round-trip, and whose value a downstream consumer (the scheduler) blocks on. So the staleness is invisible everywhere except here.

How can we reproduce it?

  1. Install Modelplane on a kind control plane.
  2. Create an InferenceGateway, an EKS InferenceClass, and an EKS InferenceCluster (examples/platform/inference-class-eks-l4.yaml, examples/platform/inference-cluster-eks.yaml). Wait for the cluster to reach Ready=True.
  3. Apply examples/deployment/model-deployment.yaml + model-service.yaml.
  4. Observe the ModelDeployment stuck at ReplicasScheduled=False / InsufficientCapacity, and kubectl get ic <name> -o jsonpath='{.status.gateway}' empty, despite the live Envoy Gateway on the workload cluster having an address.

Workaround (manual): force the gateway Object to reconcile, which makes provider-kubernetes re-observe the live manifest and propagate the address:

kubectl annotate object.kubernetes.m.crossplane.io <gateway-obj> -n modelplane-system \
  poke="$(date +%s)" --overwrite

Within a reconcile the address propagates up the chain, the cluster becomes schedulable, and the deployment serves. (Verified end to end: the model served an OpenAI chat completion after the poke.)

Possible fixes

  • Set spec.watch: true on the Gateway Object so provider-kubernetes reacts to the live Gateway's status changes. This is the purpose-built mechanism, but watch is an alpha field that is only honored when the provider's EnableAlphaWatches feature gate is on — which would also require a DeploymentRuntimeConfig enabling --enable-watches (and the matching RBAC) on provider-kubernetes. The provider currently runs with only EnableBetaManagementPolicies and EnableBetaServerSideApply.
  • Or change the Gateway Object's readiness.policy so it is not considered Ready until the address is observed (e.g. a DeriveFromObject-style policy that gates on status.addresses). While not Ready, the poll-interval hook drops to 30s, so the address would propagate within ~30s instead of ~10m. This keeps the deployment's time-to-serve bounded without enabling an alpha gate.

What environment did it happen in?

Modelplane version: v0.1.0-dev.1781072131.gb5e0089 (PR #101 branch)
Crossplane: v2.3.2
provider-kubernetes: v1.2.5 (default args; EnableAlphaWatches off)
Cluster source: EKS (provisioned), us-west-2, 1x NVIDIA L4 (g6.xlarge)
Kubernetes (control plane): kind v1.34.0

Metadata

Metadata

Assignees

Labels

bugSomething isn't working

Type

No type

Fields

No fields configured for issues without a type.

Projects

No projects

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions