Skip to content

Improve CEL validation across Modelplane's APIs#160

Merged
dennis-upbound merged 3 commits into
mainfrom
cel-block
Jun 16, 2026
Merged

Improve CEL validation across Modelplane's APIs#160
dennis-upbound merged 3 commits into
mainfrom
cel-block

Conversation

@negz

@negz negz commented Jun 16, 2026

Copy link
Copy Markdown
Collaborator

Fixes #28.

Modelplane's XRDs use discriminated unions throughout: an enum field determines which sibling config block is required (e.g. cluster.source: GKE requires a cluster.gke block). Most of these relationships were only documented in prose, so the API server accepted resources that set a discriminator without its matching block, and the error surfaced later when a composition function couldn't find the expected config.

This enforces those unions with x-kubernetes-validations CEL rules, matching the pattern ModelCache already used. InferenceCluster, InferenceClass, InferenceGateway, GKECluster, and EKSCluster each gain rules tying their discriminator to the block it selects.

The same reasoning extends to the resources Modelplane composes rather than users author. A ModelReplica or ServingStack is machine-generated, but the API server still serves and accepts it, so without CEL the consuming function has to re-check the producer's output in code or risk crashing on a malformed hand-written resource. ModelReplica now validates its resolved engine model (the Standalone-or-Leader/Worker shape, worker only on a Worker, the PrefillDecode pairing), so the backends can trust the shape they switch on. ServingStack now requires a Kubeconfig secret, replacing an in-code check that emitted a warning and composed nothing when one was absent.

spec.engines and spec.serving follow the model in Unopinionated ModelDeployments; the ModelReplica rules encode the same invariants against the scheduler-resolved form.

These rules are enforced by the API server, so nix flake check exercises the function code but not the CEL itself. I haven't applied these XRDs to a live cluster and crafted resources to confirm the rules accept and reject as intended. E2E is expensive to run, and this is cheap to iterate on, so I'd rather merge and validate it E2E alongside other changes, then fix any rule that's too strict or too loose.

I have:

  • Read and followed Modelplane's contribution process.
  • Run nix flake check (or ./nix.sh flake check) and made sure it passes.
  • Added or updated tests covering any composition function changes. The CEL rules are enforced by the API server, not the function; the existing serving-stack tests cover the simplified code path.
  • Signed off every commit with git commit -s.

Several XRDs use discriminated unions: an enum field determines which
sibling config block is required (e.g. cluster.source GKE requires a
cluster.gke block). Most of these relationships were only documented in
prose, so the API server accepted resources that set a discriminator
without its matching config block. The error surfaced late, when a
composition function couldn't find the expected config.

This commit adds x-kubernetes-validations CEL rules to enforce the
discriminator-to-config relationship at admission, matching the pattern
already used on ModelCache:

- InferenceCluster: cluster.source gates cluster.gke/eks/existing.
- InferenceClass: provisioning.provider gates provisioning.gke/eks.
- InferenceGateway: spec.backend gates spec.traefik, and
  traefik.loadBalancer gates traefik.metallb.
- GKECluster and EKSCluster: nodePools[].role GPU requires the pool's
  gpu and zones blocks.

ModelReplica is left alone. It's the scheduler-resolved form of
ModelDeployment, which already validates these invariants, and whether
to duplicate them onto the generated resource is a separate question.

Fixes #28.

Signed-off-by: Nic Cope <nicc@rk0n.org>
Copilot AI review requested due to automatic review settings June 16, 2026 02:53

Copilot AI left a comment

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

This PR strengthens API-level validation for Modelplane’s CRDs by adding x-kubernetes-validations (CEL) rules that enforce discriminated-union invariants and required companion blocks, reducing the chance of late failures in composition functions and backends.

Changes:

  • Add CEL rules to multiple XRDs to require the correct sibling config block based on a discriminator field (e.g., backend/provider/source).
  • Add CEL rules to ModelReplica to enforce scheduler-resolved engine shape invariants (Prefill/Decode phases, member roles, deviceRequests).
  • Simplify compose-serving-stack by relying on ServingStack CEL validation for required kubeconfig secrets (with a noted compatibility risk for pre-existing resources).

Reviewed changes

Copilot reviewed 8 out of 8 changed files in this pull request and generated 4 comments.

Show a summary per file
File Description
functions/compose-serving-stack/function/fn.py Removes in-code kubeconfig secret guarding and assumes schema validation guarantees presence.
apis/servingstacks/definition.yaml Adds CEL rule requiring a Kubeconfig secret entry in spec.secrets.
apis/modelreplicas/definition.yaml Adds CEL rules enforcing resolved engine/member invariants and PrefillDecode phase requirements.
apis/inferencegateways/definition.yaml Adds CEL rules enforcing backend/config block coupling (Traefik + MetalLB config).
apis/inferenceclusters/definition.yaml Adds CEL rules enforcing cluster.source discriminated-union requirements (Existing/GKE/EKS blocks).
apis/inferenceclasses/definition.yaml Adds CEL rules enforcing provisioning provider discriminated-union requirements (GKE/EKS blocks).
apis/gkeclusters/definition.yaml Adds CEL rules requiring GPU-specific fields when node pool role is GPU.
apis/eksclusters/definition.yaml Adds CEL rules requiring GPU-specific fields when node pool role is GPU.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Comment thread functions/compose-serving-stack/function/fn.py
Comment thread functions/compose-serving-stack/function/fn.py
Comment thread functions/compose-serving-stack/function/fn.py
Comment thread apis/modelreplicas/definition.yaml Outdated
negz added 2 commits June 15, 2026 20:15
ModelReplica is machine-generated: the ModelDeployment compose function
fans one deployment into replicas, resolving the scheduler's placement
onto each. Users never write one. But the XRD is served and namespaced,
so the API server still accepts a hand-written or buggy-producer replica,
and its schema carried only a single CEL rule (at least one member per
engine claims a device).

That left the consuming function trusting a shape it couldn't rely on.
The backends already assume the engine model holds: select_backend
switches on a Standalone member being present and otherwise dereferences
a Leader and Worker, the llm-d backend reads worker.nodes to size the
gang, and the disaggregated routing path looks up the Prefill and Decode
engines by phase. None of these are checked; a malformed replica makes
them crash or compose the wrong workload rather than fail at admission.

This commit adds the engine-model invariants to the ModelReplica schema,
written against its resolved shape, so the function can trust its input
the same way it trusts any observed resource read back from the API
server:

- An engine is one Standalone member, or one Leader and one or more
  Workers.
- worker may only be set on a Worker member.
- PrefillDecode serving requires exactly one Prefill and one Decode
  engine, and engine phase may only be set under PrefillDecode.

These mirror the ModelDeployment rules because both describe the same
engine model - ModelDeployment validates the user's input against it,
ModelReplica validates the resolved form.

Towards #28.

Signed-off-by: Nic Cope <nicc@rk0n.org>
ServingStack's spec.secrets needs a Kubeconfig entry: the kubeconfig
provides the cluster endpoint and CA cert that every composed
ProviderConfig authenticates with. The schema only required the array to
be non-empty (minItems: 1), so a ServingStack carrying just a
GCPServiceAccountKey passed admission, and the compose function checked
for the Kubeconfig in code, emitting a warning and composing nothing when
it was absent - a soft failure on a machine-generated resource.

This commit adds an x-kubernetes-validations rule requiring at least one
secret of type Kubeconfig, so the API server rejects a ServingStack that
lacks one and the compose function can trust one is present. The in-code
check and its early return are removed; build_provider_configs now reads
the Kubeconfig secret directly.

Towards #28.

Signed-off-by: Nic Cope <nicc@rk0n.org>

@dennis-upbound dennis-upbound left a comment

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nice!

@dennis-upbound dennis-upbound merged commit 2cdb0a4 into main Jun 16, 2026
3 checks passed
@negz negz deleted the cel-block branch June 16, 2026 16:55
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Add CEL validation rules for discriminated unions

3 participants