Improve CEL validation across Modelplane's APIs by negz · Pull Request #160 · modelplaneai/modelplane

negz · 2026-06-16T02:53:15Z

Fixes #28.

Modelplane's XRDs use discriminated unions throughout: an enum field determines which sibling config block is required (e.g. cluster.source: GKE requires a cluster.gke block). Most of these relationships were only documented in prose, so the API server accepted resources that set a discriminator without its matching block, and the error surfaced later when a composition function couldn't find the expected config.

This enforces those unions with x-kubernetes-validations CEL rules, matching the pattern ModelCache already used. InferenceCluster, InferenceClass, InferenceGateway, GKECluster, and EKSCluster each gain rules tying their discriminator to the block it selects.

The same reasoning extends to the resources Modelplane composes rather than users author. A ModelReplica or ServingStack is machine-generated, but the API server still serves and accepts it, so without CEL the consuming function has to re-check the producer's output in code or risk crashing on a malformed hand-written resource. ModelReplica now validates its resolved engine model (the Standalone-or-Leader/Worker shape, worker only on a Worker, the PrefillDecode pairing), so the backends can trust the shape they switch on. ServingStack now requires a Kubeconfig secret, replacing an in-code check that emitted a warning and composed nothing when one was absent.

spec.engines and spec.serving follow the model in Unopinionated ModelDeployments; the ModelReplica rules encode the same invariants against the scheduler-resolved form.

These rules are enforced by the API server, so nix flake check exercises the function code but not the CEL itself. I haven't applied these XRDs to a live cluster and crafted resources to confirm the rules accept and reject as intended. E2E is expensive to run, and this is cheap to iterate on, so I'd rather merge and validate it E2E alongside other changes, then fix any rule that's too strict or too loose.

I have:

Read and followed Modelplane's contribution process.
Run nix flake check (or ./nix.sh flake check) and made sure it passes.
~~Added or updated tests covering any composition function changes.~~ The CEL rules are enforced by the API server, not the function; the existing serving-stack tests cover the simplified code path.
Signed off every commit with git commit -s.

Several XRDs use discriminated unions: an enum field determines which sibling config block is required (e.g. cluster.source GKE requires a cluster.gke block). Most of these relationships were only documented in prose, so the API server accepted resources that set a discriminator without its matching config block. The error surfaced late, when a composition function couldn't find the expected config. This commit adds x-kubernetes-validations CEL rules to enforce the discriminator-to-config relationship at admission, matching the pattern already used on ModelCache: - InferenceCluster: cluster.source gates cluster.gke/eks/existing. - InferenceClass: provisioning.provider gates provisioning.gke/eks. - InferenceGateway: spec.backend gates spec.traefik, and traefik.loadBalancer gates traefik.metallb. - GKECluster and EKSCluster: nodePools[].role GPU requires the pool's gpu and zones blocks. ModelReplica is left alone. It's the scheduler-resolved form of ModelDeployment, which already validates these invariants, and whether to duplicate them onto the generated resource is a separate question. Fixes #28. Signed-off-by: Nic Cope <nicc@rk0n.org>

Copilot

Pull request overview

This PR strengthens API-level validation for Modelplane’s CRDs by adding x-kubernetes-validations (CEL) rules that enforce discriminated-union invariants and required companion blocks, reducing the chance of late failures in composition functions and backends.

Changes:

Add CEL rules to multiple XRDs to require the correct sibling config block based on a discriminator field (e.g., backend/provider/source).
Add CEL rules to ModelReplica to enforce scheduler-resolved engine shape invariants (Prefill/Decode phases, member roles, deviceRequests).
Simplify compose-serving-stack by relying on ServingStack CEL validation for required kubeconfig secrets (with a noted compatibility risk for pre-existing resources).

Reviewed changes

Copilot reviewed 8 out of 8 changed files in this pull request and generated 4 comments.

Show a summary per file

File	Description
functions/compose-serving-stack/function/fn.py	Removes in-code kubeconfig secret guarding and assumes schema validation guarantees presence.
apis/servingstacks/definition.yaml	Adds CEL rule requiring a Kubeconfig secret entry in `spec.secrets`.
apis/modelreplicas/definition.yaml	Adds CEL rules enforcing resolved engine/member invariants and PrefillDecode phase requirements.
apis/inferencegateways/definition.yaml	Adds CEL rules enforcing backend/config block coupling (Traefik + MetalLB config).
apis/inferenceclusters/definition.yaml	Adds CEL rules enforcing `cluster.source` discriminated-union requirements (Existing/GKE/EKS blocks).
apis/inferenceclasses/definition.yaml	Adds CEL rules enforcing provisioning provider discriminated-union requirements (GKE/EKS blocks).
apis/gkeclusters/definition.yaml	Adds CEL rules requiring GPU-specific fields when node pool role is GPU.
apis/eksclusters/definition.yaml	Adds CEL rules requiring GPU-specific fields when node pool role is GPU.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

ModelReplica is machine-generated: the ModelDeployment compose function fans one deployment into replicas, resolving the scheduler's placement onto each. Users never write one. But the XRD is served and namespaced, so the API server still accepts a hand-written or buggy-producer replica, and its schema carried only a single CEL rule (at least one member per engine claims a device). That left the consuming function trusting a shape it couldn't rely on. The backends already assume the engine model holds: select_backend switches on a Standalone member being present and otherwise dereferences a Leader and Worker, the llm-d backend reads worker.nodes to size the gang, and the disaggregated routing path looks up the Prefill and Decode engines by phase. None of these are checked; a malformed replica makes them crash or compose the wrong workload rather than fail at admission. This commit adds the engine-model invariants to the ModelReplica schema, written against its resolved shape, so the function can trust its input the same way it trusts any observed resource read back from the API server: - An engine is one Standalone member, or one Leader and one or more Workers. - worker may only be set on a Worker member. - PrefillDecode serving requires exactly one Prefill and one Decode engine, and engine phase may only be set under PrefillDecode. These mirror the ModelDeployment rules because both describe the same engine model - ModelDeployment validates the user's input against it, ModelReplica validates the resolved form. Towards #28. Signed-off-by: Nic Cope <nicc@rk0n.org>

ServingStack's spec.secrets needs a Kubeconfig entry: the kubeconfig provides the cluster endpoint and CA cert that every composed ProviderConfig authenticates with. The schema only required the array to be non-empty (minItems: 1), so a ServingStack carrying just a GCPServiceAccountKey passed admission, and the compose function checked for the Kubeconfig in code, emitting a warning and composing nothing when it was absent - a soft failure on a machine-generated resource. This commit adds an x-kubernetes-validations rule requiring at least one secret of type Kubeconfig, so the API server rejects a ServingStack that lacks one and the compose function can trust one is present. The in-code check and its early return are removed; build_provider_configs now reads the Kubeconfig secret directly. Towards #28. Signed-off-by: Nic Cope <nicc@rk0n.org>

dennis-upbound

nice!

Copilot AI review requested due to automatic review settings June 16, 2026 02:53

Copilot started reviewing on behalf of negz June 16, 2026 02:53 View session

Copilot AI reviewed Jun 16, 2026

View reviewed changes

Comment thread functions/compose-serving-stack/function/fn.py

Comment thread functions/compose-serving-stack/function/fn.py

Comment thread functions/compose-serving-stack/function/fn.py

Comment thread apis/modelreplicas/definition.yaml Outdated

negz added 2 commits June 15, 2026 20:15

negz force-pushed the cel-block branch from bb8f53c to b2ae233 Compare June 16, 2026 03:24

dennis-upbound approved these changes Jun 16, 2026

View reviewed changes

dennis-upbound merged commit 2cdb0a4 into main Jun 16, 2026
3 checks passed

negz mentioned this pull request Jun 16, 2026

EKS has no autoscaler installed #166

Closed

negz deleted the cel-block branch June 16, 2026 16:55

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Improve CEL validation across Modelplane's APIs#160

Improve CEL validation across Modelplane's APIs#160
dennis-upbound merged 3 commits into
mainfrom
cel-block

negz commented Jun 16, 2026

Uh oh!

Copilot AI left a comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

dennis-upbound left a comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

Uh oh!

Conversation

negz commented Jun 16, 2026

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull request overview

Reviewed changes

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

dennis-upbound left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants