Skip to content

ServingStack NVIDIA DRA driver fails on Cilium clusters: service-account name exceeds 63-char CiliumIdentity label limit #215

Description

@pluna

What happened?

On a cluster whose CNI is Cilium (seen on Nebius Managed Kubernetes, but it applies to any Cilium cluster), the ServingStack's NVIDIA DRA driver kubelet-plugin pod never starts, so the InferenceCluster's BackendReady condition stays False forever and no GPU is ever bound. Everything else in the serving stack (cert-manager, Envoy Gateway, LeaderWorkerSet, Prometheus, NFD) comes up fine; the gpu.nvidia.com DeviceClass even registers — but the kubelet plugin is stuck Pending and publishes zero ResourceSlices.

Root cause

The pod can't get a network sandbox:

Warning  FailedCreatePodSandBox  kubelet  Failed to create pod sandbox: rpc error: code = Unknown
desc = failed to setup network for sandbox "...": plugin type="cilium-cni" failed (add):
unable to create endpoint: Cilium API client timeout exceeded

The cilium agent log on the node shows why:

level=warning msg="Key allocation attempt failed" error="unable to allocate ID 36359 for key
[... k8s:io.cilium.k8s.policy.serviceaccount=nebius-llama-23b5b4e31ad4-dra-driver-nvidia-gpu-service-account-kubeletplugin ...]:
CiliumIdentity.cilium.io \"36359\" is invalid: metadata.labels: Invalid value:
\"nebius-llama-23b5b4e31ad4-dra-driver-nvidia-gpu-service-account-kubeletplugin\":
must be no more than 63 characters" subsys=allocator

Cilium derives a CiliumIdentity for every pod and copies the pod's service-account name into the label value io.cilium.k8s.policy.serviceaccount. Kubernetes label values are capped at 63 characters. The DRA driver's service account is named:

<helm-release>-dra-driver-nvidia-gpu-service-account-kubeletplugin   # 76 chars

where <helm-release> is <inferencecluster-name>-<hash> (here nebius-llama-23b5b4e31ad4). The name exceeds 63 chars, so Cilium rejects the identity → the pod never gets an endpoint → never schedules → DRA never functions.

The chart's own suffix is already 51 chars (-dra-driver-nvidia-gpu-service-account-kubeletplugin); with the release-hash prefix it exceeds 63 regardless of how short the InferenceCluster name is, so it can't be worked around from the API — it needs a fullnameOverride / short serviceAccount.name on the release.

Offending code: functions/compose-serving-stack/function/fn.py:606 (compose_dra_driver) builds the dra-driver-nvidia-gpu Helm release (fn.py:~635) without overriding the chart's fullname / serviceAccount, so the SA inherits the long composed release name.

On EKS/GKE this is harmless — their CNIs don't key a per-pod identity on the SA name — which is why it hasn't shown up before.

Upstream status (Cilium)

This is a known Cilium bug: cilium/cilium#16579. It was fixed in cilium/cilium#39552, which removes the io.cilium.k8s.policy.serviceaccount label from CiliumIdentity entirely (the label wasn't actually consumed by Cilium), rather than truncating or hashing the name.

The upstream fix does not resolve this issue for us in practice:

  • It was merged to main (2025-05-20) and deliberately not backported, so it only exists in Cilium builds cut from main after that date.
  • The affected clusters run managed Cilium (e.g. Nebius), where we don't choose the Cilium build and can't reconfigure identity-relevant labels.

So we can't rely on the upstream fix reaching the clusters that hit this. The fix has to live on the modelplane side.

Fix

In compose_dra_driver, pass the chart a short, fixed fullnameOverride (and/or serviceAccount.name, e.g. dra-gpu) so the kubelet-plugin SA stays well under 63 chars independent of the InferenceCluster name.

Recommendation / general rule: any name that may be reused as a Kubernetes label value or DNS-1123 label must stay ≤63 characters, with headroom for suffixes that charts/controllers append. The ServiceAccount object itself permits up to 253 chars, which is why creation succeeds — but anything that copies the name into a 63-char field (as Cilium does) then fails downstream. This is a recurring ecosystem footgun, not Cilium-specific (e.g. aws/aws-cdk#23643 — "serviceAccountName should not be used as a label value; it can exceed 63 characters"). Keeping generated names short by default avoids the whole class of failures across CNIs and controllers.

Refs:

How can we reproduce it?

  1. Have a Cilium-CNI cluster (e.g. Nebius mk8s 1.34) with a GPU node (NVIDIA driver preinstalled).
  2. Put its kubeconfig in a Secret in modelplane-system, label the GPU node modelplane.ai/pool=<pool>.
  3. Apply an InferenceClass (DRA gpu.nvidia.com) and an InferenceCluster (source: Existing) — using a normal-length name like nebius-llama:
    kubectl apply -f inference-class.yaml -f inference-cluster.yaml
    
  4. Watch the ServingStack install on the workload cluster:
    kubectl --kubeconfig <workload> -n dra-driver-nvidia-gpu get pods
    # dra-driver-nvidia-gpu-kubelet-plugin-xxxxx   0/1   Pending  (FailedCreatePodSandBox, cilium timeout)
    kubectl --kubeconfig <workload> get resourceslices   # none
    kubectl get inferencecluster <name> -o jsonpath='{..conditions[?(@.type=="BackendReady")].status}'  # False (Installing) forever
    
  5. Confirm in the cilium agent log: CiliumIdentity ... must be no more than 63 characters on the ...service-account-kubeletplugin SA.

What environment did it happen in?

Modelplane version: v0.1.0-rc.1

  • Crossplane: v2.3.2
  • provider-helm: v1.2.0; provider-kubernetes: v1.2.1
  • Workload cluster: Nebius Managed Kubernetes, Kubernetes v1.34.8, CNI = Cilium
  • Control plane: KIND
  • Inference backend: vLLM v0.7.3 (Llama 3.1 8B), single NVIDIA H200
  • NVIDIA DRA driver chart: dra-driver-nvidia-gpu v0.4.0 (NVIDIA_DRIVER_ROOT=/)

Metadata

Metadata

Labels

ProvisioningProvisioning componentbugSomething isn't working

Type

No type

Fields

No fields configured for issues without a type.

Projects

No projects

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions