Improve CEL validation across Modelplane's APIs#160
Merged
Conversation
Several XRDs use discriminated unions: an enum field determines which sibling config block is required (e.g. cluster.source GKE requires a cluster.gke block). Most of these relationships were only documented in prose, so the API server accepted resources that set a discriminator without its matching config block. The error surfaced late, when a composition function couldn't find the expected config. This commit adds x-kubernetes-validations CEL rules to enforce the discriminator-to-config relationship at admission, matching the pattern already used on ModelCache: - InferenceCluster: cluster.source gates cluster.gke/eks/existing. - InferenceClass: provisioning.provider gates provisioning.gke/eks. - InferenceGateway: spec.backend gates spec.traefik, and traefik.loadBalancer gates traefik.metallb. - GKECluster and EKSCluster: nodePools[].role GPU requires the pool's gpu and zones blocks. ModelReplica is left alone. It's the scheduler-resolved form of ModelDeployment, which already validates these invariants, and whether to duplicate them onto the generated resource is a separate question. Fixes #28. Signed-off-by: Nic Cope <nicc@rk0n.org>
There was a problem hiding this comment.
Pull request overview
This PR strengthens API-level validation for Modelplane’s CRDs by adding x-kubernetes-validations (CEL) rules that enforce discriminated-union invariants and required companion blocks, reducing the chance of late failures in composition functions and backends.
Changes:
- Add CEL rules to multiple XRDs to require the correct sibling config block based on a discriminator field (e.g., backend/provider/source).
- Add CEL rules to ModelReplica to enforce scheduler-resolved engine shape invariants (Prefill/Decode phases, member roles, deviceRequests).
- Simplify
compose-serving-stackby relying on ServingStack CEL validation for required kubeconfig secrets (with a noted compatibility risk for pre-existing resources).
Reviewed changes
Copilot reviewed 8 out of 8 changed files in this pull request and generated 4 comments.
Show a summary per file
| File | Description |
|---|---|
| functions/compose-serving-stack/function/fn.py | Removes in-code kubeconfig secret guarding and assumes schema validation guarantees presence. |
| apis/servingstacks/definition.yaml | Adds CEL rule requiring a Kubeconfig secret entry in spec.secrets. |
| apis/modelreplicas/definition.yaml | Adds CEL rules enforcing resolved engine/member invariants and PrefillDecode phase requirements. |
| apis/inferencegateways/definition.yaml | Adds CEL rules enforcing backend/config block coupling (Traefik + MetalLB config). |
| apis/inferenceclusters/definition.yaml | Adds CEL rules enforcing cluster.source discriminated-union requirements (Existing/GKE/EKS blocks). |
| apis/inferenceclasses/definition.yaml | Adds CEL rules enforcing provisioning provider discriminated-union requirements (GKE/EKS blocks). |
| apis/gkeclusters/definition.yaml | Adds CEL rules requiring GPU-specific fields when node pool role is GPU. |
| apis/eksclusters/definition.yaml | Adds CEL rules requiring GPU-specific fields when node pool role is GPU. |
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
ModelReplica is machine-generated: the ModelDeployment compose function fans one deployment into replicas, resolving the scheduler's placement onto each. Users never write one. But the XRD is served and namespaced, so the API server still accepts a hand-written or buggy-producer replica, and its schema carried only a single CEL rule (at least one member per engine claims a device). That left the consuming function trusting a shape it couldn't rely on. The backends already assume the engine model holds: select_backend switches on a Standalone member being present and otherwise dereferences a Leader and Worker, the llm-d backend reads worker.nodes to size the gang, and the disaggregated routing path looks up the Prefill and Decode engines by phase. None of these are checked; a malformed replica makes them crash or compose the wrong workload rather than fail at admission. This commit adds the engine-model invariants to the ModelReplica schema, written against its resolved shape, so the function can trust its input the same way it trusts any observed resource read back from the API server: - An engine is one Standalone member, or one Leader and one or more Workers. - worker may only be set on a Worker member. - PrefillDecode serving requires exactly one Prefill and one Decode engine, and engine phase may only be set under PrefillDecode. These mirror the ModelDeployment rules because both describe the same engine model - ModelDeployment validates the user's input against it, ModelReplica validates the resolved form. Towards #28. Signed-off-by: Nic Cope <nicc@rk0n.org>
ServingStack's spec.secrets needs a Kubeconfig entry: the kubeconfig provides the cluster endpoint and CA cert that every composed ProviderConfig authenticates with. The schema only required the array to be non-empty (minItems: 1), so a ServingStack carrying just a GCPServiceAccountKey passed admission, and the compose function checked for the Kubeconfig in code, emitting a warning and composing nothing when it was absent - a soft failure on a machine-generated resource. This commit adds an x-kubernetes-validations rule requiring at least one secret of type Kubeconfig, so the API server rejects a ServingStack that lacks one and the compose function can trust one is present. The in-code check and its early return are removed; build_provider_configs now reads the Kubeconfig secret directly. Towards #28. Signed-off-by: Nic Cope <nicc@rk0n.org>
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Fixes #28.
Modelplane's XRDs use discriminated unions throughout: an enum field determines which sibling config block is required (e.g.
cluster.source: GKErequires acluster.gkeblock). Most of these relationships were only documented in prose, so the API server accepted resources that set a discriminator without its matching block, and the error surfaced later when a composition function couldn't find the expected config.This enforces those unions with
x-kubernetes-validationsCEL rules, matching the pattern ModelCache already used. InferenceCluster, InferenceClass, InferenceGateway, GKECluster, and EKSCluster each gain rules tying their discriminator to the block it selects.The same reasoning extends to the resources Modelplane composes rather than users author. A ModelReplica or ServingStack is machine-generated, but the API server still serves and accepts it, so without CEL the consuming function has to re-check the producer's output in code or risk crashing on a malformed hand-written resource. ModelReplica now validates its resolved engine model (the Standalone-or-Leader/Worker shape,
workeronly on a Worker, the PrefillDecode pairing), so the backends can trust the shape they switch on. ServingStack now requires a Kubeconfig secret, replacing an in-code check that emitted a warning and composed nothing when one was absent.spec.enginesandspec.servingfollow the model in Unopinionated ModelDeployments; the ModelReplica rules encode the same invariants against the scheduler-resolved form.These rules are enforced by the API server, so
nix flake checkexercises the function code but not the CEL itself. I haven't applied these XRDs to a live cluster and crafted resources to confirm the rules accept and reject as intended. E2E is expensive to run, and this is cheap to iterate on, so I'd rather merge and validate it E2E alongside other changes, then fix any rule that's too strict or too loose.I have:
nix flake check(or./nix.sh flake check) and made sure it passes.Added or updated tests covering any composition function changes.The CEL rules are enforced by the API server, not the function; the existing serving-stack tests cover the simplified code path.git commit -s.