ServingStack NVIDIA DRA driver fails on Cilium clusters: service-account name exceeds 63-char CiliumIdentity label limit

### What happened?

On a cluster whose CNI is **Cilium** (seen on Nebius Managed Kubernetes, but it applies to any Cilium cluster), the ServingStack's **NVIDIA DRA driver kubelet-plugin pod never starts**, so the InferenceCluster's `BackendReady` condition stays `False` forever and no GPU is ever bound. Everything else in the serving stack (cert-manager, Envoy Gateway, LeaderWorkerSet, Prometheus, NFD) comes up fine; the `gpu.nvidia.com` DeviceClass even registers — but the kubelet plugin is stuck `Pending` and publishes **zero ResourceSlices**.

### Root cause

The pod can't get a network sandbox:

```
Warning  FailedCreatePodSandBox  kubelet  Failed to create pod sandbox: rpc error: code = Unknown
desc = failed to setup network for sandbox "...": plugin type="cilium-cni" failed (add):
unable to create endpoint: Cilium API client timeout exceeded
```

The cilium agent log on the node shows why:

```
level=warning msg="Key allocation attempt failed" error="unable to allocate ID 36359 for key
[... k8s:io.cilium.k8s.policy.serviceaccount=nebius-llama-23b5b4e31ad4-dra-driver-nvidia-gpu-service-account-kubeletplugin ...]:
CiliumIdentity.cilium.io \"36359\" is invalid: metadata.labels: Invalid value:
\"nebius-llama-23b5b4e31ad4-dra-driver-nvidia-gpu-service-account-kubeletplugin\":
must be no more than 63 characters" subsys=allocator
```

Cilium derives a `CiliumIdentity` for every pod and copies the pod's **service-account name** into the **label value** `io.cilium.k8s.policy.serviceaccount`. Kubernetes [label values are capped at 63 characters](https://kubernetes.io/docs/concepts/overview/working-with-objects/labels/#syntax-and-character-set). The DRA driver's service account is named:

```
<helm-release>-dra-driver-nvidia-gpu-service-account-kubeletplugin   # 76 chars
```

where `<helm-release>` is `<inferencecluster-name>-<hash>` (here `nebius-llama-23b5b4e31ad4`). The name exceeds 63 chars, so Cilium rejects the identity → the pod never gets an endpoint → never schedules → DRA never functions.

The chart's own suffix is already 51 chars (`-dra-driver-nvidia-gpu-service-account-kubeletplugin`); with the release-hash prefix it exceeds 63 **regardless of how short the InferenceCluster name is**, so it can't be worked around from the API — it needs a `fullnameOverride` / short `serviceAccount.name` on the release.

**Offending code:** `functions/compose-serving-stack/function/fn.py:606` (`compose_dra_driver`) builds the `dra-driver-nvidia-gpu` Helm release (`fn.py:~635`) without overriding the chart's fullname / serviceAccount, so the SA inherits the long composed release name.

On EKS/GKE this is harmless — their CNIs don't key a per-pod identity on the SA name — which is why it hasn't shown up before.

### Upstream status (Cilium)

This is a known Cilium bug: [cilium/cilium#16579](https://github.com/cilium/cilium/issues/16579). It was fixed in [cilium/cilium#39552](https://github.com/cilium/cilium/pull/39552), which **removes the `io.cilium.k8s.policy.serviceaccount` label from `CiliumIdentity` entirely** (the label wasn't actually consumed by Cilium), rather than truncating or hashing the name.

The upstream fix does **not** resolve this issue for us in practice:

- It was merged to `main` (2025-05-20) and **deliberately not backported**, so it only exists in Cilium builds cut from main after that date.
- The affected clusters run **managed Cilium** (e.g. Nebius), where we don't choose the Cilium build and can't reconfigure identity-relevant labels.

So we can't rely on the upstream fix reaching the clusters that hit this. The fix has to live on the modelplane side.

### Fix

In `compose_dra_driver`, pass the chart a short, fixed `fullnameOverride` (and/or `serviceAccount.name`, e.g. `dra-gpu`) so the kubelet-plugin SA stays well under 63 chars independent of the InferenceCluster name.

**Recommendation / general rule:** any name that may be reused as a Kubernetes **label value** or **DNS-1123 label** must stay **≤63 characters**, with headroom for suffixes that charts/controllers append. The ServiceAccount object itself permits up to 253 chars, which is why creation succeeds — but anything that copies the name into a 63-char field (as Cilium does) then fails downstream. This is a recurring ecosystem footgun, not Cilium-specific (e.g. [aws/aws-cdk#23643](https://github.com/aws/aws-cdk/issues/23643) — "serviceAccountName should not be used as a label value; it can exceed 63 characters"). Keeping generated names short by default avoids the whole class of failures across CNIs and controllers.

Refs:
- Kubernetes — [Labels and Selectors (63-char value limit)](https://kubernetes.io/docs/concepts/overview/working-with-objects/labels/#syntax-and-character-set)
- [aws/aws-cdk#23643](https://github.com/aws/aws-cdk/issues/23643)

### How can we reproduce it?

1. Have a Cilium-CNI cluster (e.g. Nebius mk8s 1.34) with a GPU node (NVIDIA driver preinstalled).
2. Put its kubeconfig in a Secret in `modelplane-system`, label the GPU node `modelplane.ai/pool=<pool>`.
3. Apply an `InferenceClass` (DRA `gpu.nvidia.com`) and an `InferenceCluster` (`source: Existing`) — using a normal-length name like `nebius-llama`:
   ```
   kubectl apply -f inference-class.yaml -f inference-cluster.yaml
   ```
4. Watch the ServingStack install on the workload cluster:
   ```
   kubectl --kubeconfig <workload> -n dra-driver-nvidia-gpu get pods
   # dra-driver-nvidia-gpu-kubelet-plugin-xxxxx   0/1   Pending  (FailedCreatePodSandBox, cilium timeout)
   kubectl --kubeconfig <workload> get resourceslices   # none
   kubectl get inferencecluster <name> -o jsonpath='{..conditions[?(@.type=="BackendReady")].status}'  # False (Installing) forever
   ```
5. Confirm in the cilium agent log: `CiliumIdentity ... must be no more than 63 characters` on the `...service-account-kubeletplugin` SA.

### What environment did it happen in?

Modelplane version: v0.1.0-rc.1
- Crossplane: v2.3.2
- provider-helm: v1.2.0; provider-kubernetes: v1.2.1
- Workload cluster: Nebius Managed Kubernetes, Kubernetes v1.34.8, **CNI = Cilium**
- Control plane: KIND
- Inference backend: vLLM v0.7.3 (Llama 3.1 8B), single NVIDIA H200
- NVIDIA DRA driver chart: `dra-driver-nvidia-gpu` v0.4.0 (`NVIDIA_DRIVER_ROOT=/`)


Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

ServingStack NVIDIA DRA driver fails on Cilium clusters: service-account name exceeds 63-char CiliumIdentity label limit #215

What happened?

Root cause

Upstream status (Cilium)

Fix

How can we reproduce it?

What environment did it happen in?

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

Uh oh!

ServingStack NVIDIA DRA driver fails on Cilium clusters: service-account name exceeds 63-char CiliumIdentity label limit #215

Description

What happened?

Root cause

Upstream status (Cilium)

Fix

How can we reproduce it?

What environment did it happen in?

Metadata

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

Issue actions