Skip to content

Name each serving-stack Helm release after its chart#314

Merged
dennis-upbound merged 1 commit into
modelplaneai:mainfrom
dennis-upbound:serving-stack-sa-name-length
Jul 1, 2026
Merged

Name each serving-stack Helm release after its chart#314
dennis-upbound merged 1 commit into
modelplaneai:mainfrom
dennis-upbound:serving-stack-sa-name-length

Conversation

@dennis-upbound

Copy link
Copy Markdown
Collaborator

Description of your changes

Fixes #215.

provider-helm names each Helm release after the composed resource, whose generated name is <inferencecluster>-<hash>. That name is long, unpredictable, and length-capped, and it flows into every chart's fullname and from there into the names of the resources the chart creates. A chart truncates its own fullname to 63 chars and then appends suffixes — the NVIDIA DRA driver's kubelet-plugin ServiceAccount is <fullname>-service-account-kubeletplugin — so the final name can exceed 63 even when the fullname did not. Kubernetes label values are capped at 63 chars, so any consumer that copies such a name into a label value rejects it. On Cilium clusters the per-pod CiliumIdentity keys on the ServiceAccount name, so the DRA driver's kubelet-plugin pod never gets a network sandbox, never schedules, and no GPU is ever bound.

This isn't specific to the DRA driver. Rendering every chart at a realistic release name shows the DRA driver overflows at essentially any InferenceCluster name, while node-feature-discovery and lws overflow once the name is a little longer, and the kube-prometheus-stack node-exporter subchart overflows the same way (it ignores the chart's fullnameOverride and uses the release name directly). The shared cause is the release name, so the fix belongs there rather than per chart.

_helm_release now sets crossplane.io/external-name to mp-<chart>, which provider-helm uses verbatim as the release name. Each release is named for its chart (mp-dra-driver-nvidia-gpu) instead of the InferenceCluster, so every derived name is short and independent of the InferenceCluster name: the budget below holds for any deployment, not only the ones tested. The mp- prefix reserves a release-name namespace for the releases this function owns; without it a release named cert-manager would adopt or collide with a cert-manager release a user already runs in that namespace, and mp-cert-manager cannot.

Limit Tightest case Length
Helm release name ≤ 53 mp-node-feature-discovery 25
ServiceAccount name ≤ 63 mp-dra-driver-nvidia-gpu-service-account-kubeletplugin 54

The DRA driver is the binding constraint at 54/63: it appends the longest suffix (-service-account-kubeletplugin, 30 chars) to a release name that already contains the chart name. Every other chart is smaller. The release name is stable across chart-version upgrades, so provider-helm upgrades in place rather than treating each version as a new release, and the composed resource keeps its generated, hashed metadata.name, so managed-resource uniqueness in the control plane is unchanged.

This does change release identity. On an existing install provider-helm installs fresh releases under the new mp-<chart> names and orphans the old <inferencecluster>-<hash> releases, so migrating a running stack means a teardown and reinstall. That's acceptable pre-1.0, and the Cilium clusters this fixes are non-functional today regardless.

I have:

  • Read and followed Modelplane's contribution process.
  • Run nix flake check (or ./nix.sh flake check) and made sure it passes.
  • Added or updated tests covering any composition function changes.
  • Signed off every commit with git commit -s.

Comment on lines +145 to +164
The release is named "mp-<chart>" via the crossplane.io/external-name
annotation, which provider-helm uses verbatim as the Helm release name. Left
unset, the release would inherit the composed resource's generated name
(<inferencecluster>-<hash>) - long and unpredictable, and the source of a
whole class of failures: charts derive their resource names from the release
name, truncate their own fullname to 63 chars, then *append* suffixes like
"-service-account-kubeletplugin", so the final name can exceed 63. Kubernetes
label values are capped at 63 chars, so anything that copies such a name into
a label value rejects it - e.g. Cilium keys a per-pod CiliumIdentity on the
ServiceAccount name and the pod then never gets a network sandbox. See issue
#215.

A fixed "mp-<chart>" name keeps every derived name well under 63 regardless
of the InferenceCluster name: the longest is the DRA driver's kubelet-plugin
ServiceAccount, "mp-dra-driver-nvidia-gpu-service-account-kubeletplugin" at
54 chars. Each chart runs once per workload cluster in its own namespace, so
"mp-<chart>" is a unique, stable release name (stable across chart-version
upgrades, so provider-helm upgrades in place rather than reinstalling). The
composed resource keeps its own generated metadata.name, unique in the
control plane. The "mp-" prefix is reserved for releases this function owns."""

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nit: This information is important but the comment feels ~3x as long as it needs to be to communicate it.

Copilot AI left a comment

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

This PR fixes Helm chart-derived Kubernetes resource names exceeding the 63-character label-value limit (notably breaking NVIDIA DRA driver pods on Cilium clusters) by making provider-helm use short, stable Helm release names that do not depend on the generated composed-resource name.

Changes:

  • Set each Helm Release’s crossplane.io/external-name to mp-<chart> so provider-helm uses a short, predictable Helm release name.
  • Add test coverage to ensure release names remain within Helm’s 53-character limit and that worst-case ServiceAccount name suffixing stays within 63 characters even with an oversized ServingStack/cluster name.
  • Update existing test fixtures to include the new crossplane.io/external-name annotation expectations.

Reviewed changes

Copilot reviewed 2 out of 2 changed files in this pull request and generated no comments.

File Description
functions/compose-serving-stack/function/fn.py Forces stable mp-<chart> Helm release names via crossplane.io/external-name to prevent downstream 63-char overflows.
functions/compose-serving-stack/tests/test_fn.py Updates expected Release metadata and adds a regression test that validates name-length constraints independent of XR name.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

]
self.assertNotIn("nvidiaDriverRoot", dra_values)

async def test_release_names_short_regardless_of_inferencecluster_name(self) -> None:

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do we need this new test? The cases above cover the behavior and would capture a regression.

@negz negz left a comment

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks @dennis-upbound! Few nits but looks good overall.

provider-helm names each release after the composed resource, whose
generated name is <inferencecluster>-<hash>. That name flows into every
chart's fullname and from there into the chart's resource names. A chart
truncates its fullname to 63 chars and then appends suffixes - the NVIDIA
DRA driver's kubelet-plugin ServiceAccount is
<fullname>-service-account-kubeletplugin - so the final name can exceed 63
even when the fullname did not. Kubernetes label values are capped at 63
chars, so a consumer that copies the name into a label value rejects it: on
Cilium clusters the per-pod CiliumIdentity keys on the ServiceAccount name,
so the DRA driver's kubelet-plugin pod never gets a network sandbox and no
GPU is bound. node-feature-discovery, lws, and the kube-prometheus-stack
node-exporter subchart overflow the same way once the name is long enough.

_helm_release now sets crossplane.io/external-name to mp-<chart>, which
provider-helm uses verbatim as the release name. Naming each release for
its chart makes every derived name short and independent of the
InferenceCluster name; the longest is the DRA driver's kubelet-plugin
ServiceAccount at 54 of 63 chars. The mp- prefix reserves a release-name
namespace for releases this function owns, so an mp-cert-manager release
can't adopt or collide with a cert-manager release a user already runs in
that namespace. The name is stable across chart-version upgrades, so
provider-helm upgrades in place. The composed resource keeps its generated
metadata.name, so control-plane uniqueness is unchanged.

A new test runs the function with an oversized InferenceCluster name and
asserts every release is named mp-<chart>, so the release name - and every
name a chart derives from it - stays within the 63-char label limit no
matter how long the InferenceCluster name is.

This changes release identity. On an existing install provider-helm
installs fresh releases under the new names and orphans the old
<inferencecluster>-<hash> releases, so migrating a running stack needs a
teardown and reinstall.

Fixes modelplaneai#215.

Signed-off-by: Dennis Ramdass <dennis@upbound.io>
@dennis-upbound dennis-upbound force-pushed the serving-stack-sa-name-length branch from 813bc8d to ec6bb40 Compare July 1, 2026 18:40
@dennis-upbound dennis-upbound merged commit 384c1e5 into modelplaneai:main Jul 1, 2026
4 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

ServingStack NVIDIA DRA driver fails on Cilium clusters: service-account name exceeds 63-char CiliumIdentity label limit

3 participants