ModelDeployment never schedules on a fresh cluster: gateway Object's stale observed manifest blocks status.gateway.address

### What happened?

On a fresh EKS `InferenceCluster`, a `ModelDeployment` never schedules. It stays at `ReplicasScheduled=False` with:

```
0 of 1 replicas scheduled (checked 1 clusters)
Reason: InsufficientCapacity
```

even though the cluster is `Ready=True`, reports a matching GPU pool, and has free nodes. The deployment never recovers on its own — no model is ever served, with no manual intervention.

The root cause is upstream of the scheduler: the cluster's `status.gateway.address` is never populated, so the scheduler treats the cluster as unschedulable.

The `ModelDeployment` scheduler filters out any cluster without a gateway address (`_cluster_ready` in `compose-model-deployment`):

```python
def _cluster_ready(cluster):
    if not cluster.status.gateway or not cluster.status.gateway.address:
        return False
    return any(c.type == "Ready" and c.status == "True" for c in cluster.status.conditions or [])
```

On the affected cluster the field is empty:

```
$ kubectl get ic eks-us-west -o jsonpath='{.status.gateway}'
   # (empty)
$ kubectl get ic eks-us-west -o jsonpath='{.status.conditions[?(@.type=="Ready")]}'
   {"reason":"Available","status":"True","type":"Ready"}
```

The address propagation chain is: Envoy `Gateway` (on the workload cluster) gets a LoadBalancer address → `compose-serving-stack` reads it from the `gateway` provider-kubernetes `Object`'s observed manifest and writes `ServingStack.status.gateway.address` → `compose-inference-cluster` copies it to `InferenceCluster.status.gateway.address` → scheduler reads it.

The chain breaks at the first hop. The live Gateway on the workload cluster has its address:

```
$ kubectl --kubeconfig eks get gateway inference-gateway -n modelplane-system \
    -o jsonpath='{.status.addresses}'
[{"type":"Hostname","value":"...elb.amazonaws.com"}]
```

but the `gateway` `Object`'s observed manifest is frozen at a pre-address snapshot, indefinitely:

```
$ kubectl get object.kubernetes.m.crossplane.io <gateway-obj> -n modelplane-system \
    -o jsonpath='{.status.atProvider.manifest.status}'
{"conditions":[{"lastTransitionTime":"1970-01-01T00:00:00Z","message":"Waiting for controller","reason":"Pending","status":"Unknown","type":"Accepted"},
               {"lastTransitionTime":"1970-01-01T00:00:00Z","message":"Waiting for controller","reason":"Pending","status":"Unknown","type":"Programmed"}]}
```

This snapshot persisted for >8 minutes while the live Gateway had its address the whole time.

### Root cause

`compose-serving-stack` composes the Envoy `Gateway` as a provider-kubernetes `Object` with `readiness.policy: SuccessfulCreate` and the default `watch: false` (the `_k8s_object` helper sets neither field; the Gateway `Object` is built at `functions/compose-serving-stack/function/fn.py` `compose_gateway`).

provider-kubernetes (v1.2.5) refreshes an `Object`'s `status.atProvider.manifest` only when its reconciler runs. Two settings in the namespaced object controller (`internal/controller/namespaced/object/object.go`) mean a reconcile almost never runs after the initial apply:

1. `WithEventFilter(resource.DesiredStateChanged())` — a reconcile is enqueued only when the `Object`'s own desired state (spec/annotations) changes. The Gateway acquiring its LoadBalancer address is an **external** change and does not enqueue a reconcile.
2. `WithPollInterval(o.PollInterval)` with `WithPollIntervalHook` — the periodic drift poll defaults to `10m`. The hook shortens it to `30s` **only while the `Object` is not `Ready=True`**. Because the Gateway `Object` uses `readiness: SuccessfulCreate`, it is `Ready=True` the moment it is created, so it polls on the full `10m` interval.

Timeline on the affected cluster:

- `06:43:45` — Gateway `Object` applied; first (and only timely) observe captures the manifest *before* Envoy programs it ("Waiting for controller").
- `06:43:45` — `readiness: SuccessfulCreate` ⇒ `Ready=True` immediately ⇒ poll interval is the full `10m`.
- `06:44:36` — Envoy assigns the ELB address (live Gateway `Programmed=True`). This external change does not trigger a reconcile.
- next re-observe is ~10 minutes later. The cluster has no gateway address, and the scheduler reports `InsufficientCapacity`, for that whole window.

Why this `Object` and not the others: the `model-serving` `Object` (a Deployment) captures live `.status` on the very first observe, because the Deployment controller populates `.status` within milliseconds of apply. The Gateway is the only serving-stack `Object` whose **consumed status field is populated asynchronously by a controller after the apply round-trip**, and whose value a downstream consumer (the scheduler) blocks on. So the staleness is invisible everywhere except here.

### How can we reproduce it?

1. Install Modelplane on a kind control plane.
2. Create an `InferenceGateway`, an EKS `InferenceClass`, and an EKS `InferenceCluster` (`examples/platform/inference-class-eks-l4.yaml`, `examples/platform/inference-cluster-eks.yaml`). Wait for the cluster to reach `Ready=True`.
3. Apply `examples/deployment/model-deployment.yaml` + `model-service.yaml`.
4. Observe the `ModelDeployment` stuck at `ReplicasScheduled=False / InsufficientCapacity`, and `kubectl get ic <name> -o jsonpath='{.status.gateway}'` empty, despite the live Envoy Gateway on the workload cluster having an address.

Workaround (manual): force the gateway `Object` to reconcile, which makes provider-kubernetes re-observe the live manifest and propagate the address:

```
kubectl annotate object.kubernetes.m.crossplane.io <gateway-obj> -n modelplane-system \
  poke="$(date +%s)" --overwrite
```

Within a reconcile the address propagates up the chain, the cluster becomes schedulable, and the deployment serves. (Verified end to end: the model served an OpenAI chat completion after the poke.)

### Possible fixes

- Set `spec.watch: true` on the Gateway `Object` so provider-kubernetes reacts to the live Gateway's status changes. This is the purpose-built mechanism, but `watch` is an alpha field that is only honored when the provider's `EnableAlphaWatches` feature gate is on — which would also require a `DeploymentRuntimeConfig` enabling `--enable-watches` (and the matching RBAC) on provider-kubernetes. The provider currently runs with only `EnableBetaManagementPolicies` and `EnableBetaServerSideApply`.
- Or change the Gateway `Object`'s `readiness.policy` so it is not considered `Ready` until the address is observed (e.g. a `DeriveFromObject`-style policy that gates on `status.addresses`). While not `Ready`, the poll-interval hook drops to `30s`, so the address would propagate within ~30s instead of ~10m. This keeps the deployment's time-to-serve bounded without enabling an alpha gate.

### What environment did it happen in?

Modelplane version: `v0.1.0-dev.1781072131.gb5e0089` (PR #101 branch)
Crossplane: v2.3.2
provider-kubernetes: v1.2.5 (default args; `EnableAlphaWatches` off)
Cluster source: EKS (provisioned), us-west-2, 1x NVIDIA L4 (g6.xlarge)
Kubernetes (control plane): kind v1.34.0


Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

ModelDeployment never schedules on a fresh cluster: gateway Object's stale observed manifest blocks status.gateway.address #121

What happened?

Root cause

How can we reproduce it?

Possible fixes

What environment did it happen in?

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

Uh oh!

ModelDeployment never schedules on a fresh cluster: gateway Object's stale observed manifest blocks status.gateway.address #121

Description

What happened?

Root cause

How can we reproduce it?

Possible fixes

What environment did it happen in?

Metadata

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

Issue actions