Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add OpenAPI Validation across API types #6398

Open
killianmuldoon opened this issue Apr 8, 2022 · 30 comments
Open

Add OpenAPI Validation across API types #6398

killianmuldoon opened this issue Apr 8, 2022 · 30 comments
Assignees
Labels
area/api Issues or PRs related to the APIs kind/api-change Categorizes issue or PR as related to adding, removing, or otherwise changing an API kind/feature Categorizes issue or PR as related to a new feature. priority/important-longterm Important over the long term, but may not be staffed and/or may need multiple releases to complete. triage/accepted Indicates an issue or PR is ready to be actively worked on.

Comments

@killianmuldoon
Copy link
Contributor

Currently validation for our APIs is done, for the most part, in our webhooks. OpenAPI validation could be added (possibly using kubebuilder tags) to make this validation part of our OpenAPI definition. Example validations are max / min, regex matching and enum enforcement - but there's a lot of possibilities enabled by kubebuilder tags.

We currently don't use this type of validation across our API types, but we do have some examples in MachineHealthCheck, MachinePool, ClusterResourceSet and our Bootstrap types.

Putting validation in our OpenAPI specs would move implementation of validation from our webhooks to the Kubernetes tooling (kubectl and/or kube-apiserver). We would still require webhooks for a number of validations which cannot be done with OpenAPI.

Impact
From a UX perspective the change would be, in an ideal scenario, small. We wouldn't be able to customize error messages, but error messages would be more consistent. Users would be able to understand most of our validation by reading the OpenAPI spec instead of reading godoc comments (which may be out of sync with the actual implementation in the webhook).

In reality UX impact could be large and negative because of how defaulting currently works. In our current implementation webhook Defaulting happens before Validation. Users are able to leave fields blank, even though they are required in our validation, and have sane, valid defaults applied.

Currently we have the DefaultValidate test used across our API types which validates this process:

func DefaultValidateTest(object DefaultingValidator) func(*testing.T) {

This is used to test: Machines, MachineSets, MachineHealthChecks, MachineDeployments, ClusterResourceSets, KubeadmConfigs, KubeadmConfigTemplates, MachinePools, KubeadmControlPlane, KubeadmControlPlaneTemplate.

If we move validation to OpenAPI validation will happen before webhooks are called which would require changes in how we do defaulting and which fields we expect to be defined by users.

Defaulting could alternatively, in some cases, be done in the OpenAPI schema, but the process doesn't have a good UX right now. There's an issue open about improving this in Kubernetes at kubernetes/kubernetes#108768

This can also cause rollouts in fields that previously weren't defaulted (#6095 h/t @sbueringer ) meaning that a large implementation and testing effort could be needed to ensure the changes aren't causing new rollouts.

If we end up moving only some validation and defaulting to our OpenAPI spec we end up with validation in multiple places. Which doesn't really improve the explicitness of our API but, in the worst case scenario, could be a large implementation effort and have a negative impact on UX.

Original discussion here: #6383 (review)

As the discussion is on an experimetal API we could try to implement OpenAPI validation on that API only to assess the impact and rolling it out to more parts of the overall CAPI API.

Should we try to move more validation to the OpenAPI spec?
@fabriziopandini @JoelSpeed

/kind feature
/area api

@k8s-ci-robot k8s-ci-robot added kind/feature Categorizes issue or PR as related to a new feature. area/api Issues or PRs related to the APIs labels Apr 8, 2022
@fabriziopandini
Copy link
Member

I like the idea of being more explicit in the OpenAPI spec, but TBG I have some concerns about embracing kubebuilder tags for validation because

  • Our current test about defaulting, validation and conversion doesn't trigger open API validation, it this could expose us to errors to slip in as already happened in the past
  • We will end up managing two set of rules (the one implemented in the OpenAPI spec, the one implemented in webhooks)

Given the above reasons, I think we should not put this discussion in the critical path for #6383 and take some more time to figure out to preserve current test coverage while improving out OpenAPI spec

@enxebre
Copy link
Member

enxebre commented Apr 11, 2022

Immutability is one important lack in current CRD OpenAPI validation.
Also the fact that any conceptually required field that you want to default needs to be declared optional for client side validation to not complain e.g kubectl, is suboptimal and confusing.

@vincepri
Copy link
Member

Also the fact that any conceptually required field that you want to default needs to be declared optional for client side validation to not complain e.g kubectl, is suboptimal and confusing.

A defaulted field is optional from a user perspective as they don't have to specify it, with open api a required field can be defaulted and no error occurs because defaulting happens client-side before hitting the api server

@JoelSpeed
Copy link
Contributor

A defaulted field is optional from a user perspective as they don't have to specify it, with open api a required field can be defaulted and no error occurs because defaulting happens client-side before hitting the api server

I just tested this on K8s 1.23 and I don't think that's correct. I added a required field with a default, left it out of the resource and got an error saying the field was required, if I make it optional, the default applies correctly. Seems the defaulting happens after the validation

@fabriziopandini fabriziopandini added the kind/api-change Categorizes issue or PR as related to adding, removing, or otherwise changing an API label Apr 12, 2022
@JoelSpeed
Copy link
Contributor

We would still require webhooks for a number of validations which cannot be done with OpenAPI.

Is it possible to enumerate a list of the type of validations we expect to carry over in webhooks, were we to make this change?

If we move validation to OpenAPI validation will happen before webhooks are called which would require changes in how we do defaulting and which fields we expect to be defined by users.

Can you elaborate on this point? Is there conditional defaulting (or other defaulting) happening that cannot be handled by OpenAPI? Keen to know where the gaps are in the use case for Cluster API

This can also cause rollouts in fields that previously weren't defaulted (#6095 h/t @sbueringer ) meaning that a large implementation and testing effort could be needed to ensure the changes aren't causing new rollouts.

Adding or changing the behaviour of a default for a field breaks the assumptions of clients writing and consuming this API, so it's a breaking change. As far as I understand, that doesn't change whether we have OpenAPI defaults/validation vs Webhook defaults/validation, does it?

@killianmuldoon
Copy link
Contributor Author

Is it possible to enumerate a list of the type of validations we expect to carry over in webhooks, were we to make this change?

I think so, but it's non-trivial IMO. We'd definitely want to do this well before we'd be updating the API.

Can you elaborate on this point? Is there conditional defaulting (or other defaulting) happening that cannot be handled by OpenAPI? Keen to know where the gaps are in the use case for Cluster API

One of the ones that comes to mind is where we copy a value during defaulting:
e.g

m.Labels[ClusterLabelName] = m.Spec.ClusterName

There's also some places where we e.g. normalize versions:

// tolerate version strings without a "v" prefix: prepend it if it's not there.
if m.Spec.Template.Spec.Version != nil && !strings.HasPrefix(*m.Spec.Template.Spec.Version, "v") {
normalizedVersion := "v" + *m.Spec.Template.Spec.Version
m.Spec.Template.Spec.Version = &normalizedVersion
}

Would we be able to accomplish these with kubebuilder defaulting?

@JoelSpeed
Copy link
Contributor

Would we be able to accomplish these with kubebuilder defaulting?

Conditional stuff like this is out of scope of openapi defaulting

@enxebre
Copy link
Member

enxebre commented Apr 13, 2022

with open api a required field can be defaulted and no error occurs because defaulting happens client-side before hitting the api server

@vincepri That's actually not true afaik, structural schema CRDs don't publish defaults so client-side can't know what it would be defaulted on the server. This results in a surprising UX for fields that are both required and defaulted failing client side validation if they are not set explicitly.

@sbueringer
Copy link
Member

sbueringer commented Apr 26, 2022

A defaulted field is optional from a user perspective as they don't have to specify it, with open api a required field can be defaulted and no error occurs because defaulting happens client-side before hitting the api server

I just tested this on K8s 1.23 and I don't think that's correct. I added a required field with a default, left it out of the resource and got an error saying the field was required, if I make it optional, the default applies correctly. Seems the defaulting happens after the validation

Joel and further down Alberto are correct. kubectl only does OpenAPI validation but no defaulting. I opened an issue for this in k/k: kubernetes/kubernetes#108768.

But in my opinion this is a mostly separate discussion. Independent of if we add more OpenAPI validation or not we will have optional and required fields like today and this will be an issue until it is resolved in k/k. Of course by adding more OpenAPI validation we could theoretically hit this case more often, but I think only in rare edge cases (where the defaulting webhook "fixes up" field values, e.g. adding the v prefix on versions). The most common case of this issue is optional vs required and for those cases we already have OpenAPI validation today on required fields.

So for me it comes down to:

  • adding more OpenAPI validation to the API types to make restrictions/validations more obvious when looking at the CRD / OpenAPI schema vs.
  • keeping it as is and basically doing almost all validations in validation webhooks (except some basic validations we get out of the box like optional/required)

From a maintainer perspective I definitely prefer the validation in code as:

  • it's just a lot more straightforward to unit test (I'm aware we could unit test OpenAPI validation via envtest, but it is slower)
  • it keeps the entire validation logic in one place and thus the entire validation logic is easier to comprehend (while of course the simple OpenAPI validations we could add are less obvious as folks have to look at the code)

From a user perspective it might be better to surface the validations we could express via OpenAPI schema in the OpenAPI schema of the CRDs directly, although this doesn't provide a full picture so it might be misleading. I wonder if having a good description would be better.

Is it possible to enumerate a list of the type of validations we expect to carry over in webhooks, were we to make this change?

A few examples of validations we cannot move to OpenAPI:

I'm aware that there is some ongoing work on an expression language for validation in CRDs which is really nice, but this would only work on newer Kubernetes versions which we cannot assume for a very long time (link: kubernetes/enhancements#2876)

I think there are a lot more cases and looking through our webhook code I'm not sure we could replace a significant percentage through OpenAPI validation.

@JoelSpeed
Copy link
Contributor

@sbueringer thanks for taking the time to think about this, that's a very thorough write up

I just wanted to add some extra context for completeness to your list of things we cannot move.

  • Immutability is on the backlog for CRDs WIP: Introduce support for Immutable fields in CRDs kubernetes/kubernetes#83743. I've spoken to a few of the API SMEs about this and the upshot was to wait for the SSA stuff to stabilise to avoid too many changes at once, but, in theory this feature is coming at some point

  • Validations which check if the combination of values for multiple fields are correct

    I believe this one can be solved with the CEL when that eventually lands right? (I know you've mentioned that this is a long time away but I wanted to add that context assuming I'm correct)

  • Similar case: validations which ensure that the namespace in ownerRefs is the same as in the top-level resource

    I don't think this is correct, there is no namespace field in an owner reference, that was a deliberate API design decision when the feature was introduced

@sbueringer
Copy link
Member

sbueringer commented Apr 27, 2022

I believe this one can be solved with the CEL when that eventually lands right? (I know you've mentioned that this is a long time away but I wanted to add that context assuming I'm correct)

I think you're correct, a lot of of those should be implementable with CEL, not sure if we have something which is too complex for that.

I don't think this is correct, there is no namespace field in an owner reference, that was a deliberate API design decision when the feature was introduced

My bad, I meant ObjectReference instead of OwnerRefs (e.g. https://github.com/kubernetes-sigs/cluster-api/blob/main/api/v1beta1/clusterclass_types.go#L445-L449 but also at least in MD, MS, Cluster). This might be solved by stop using the not-recommended ObjectReference upstream type and using our own one with only the useful fields instead. In cases were namespace must be always the same as the one of the top-level API type it doesn't make sense to have it configurable. (xref: #6024)

@JoelSpeed
Copy link
Contributor

This might be solved by stop using the not-recommended ObjectReference upstream type and using our own one with only the useful fields instead.

Ahh this makes a lot more sense now 😅 We should definitely do that in the next API version

@vincepri
Copy link
Member

I stand corrected :) Thanks folks for digging through these and getting all the information together!

@sbueringer
Copy link
Member

We should take a look at how/if ServerSideValidation changes the situation: kubernetes/kubernetes#108889

@fabriziopandini fabriziopandini added triage/accepted Indicates an issue or PR is ready to be actively worked on. and removed triage/accepted Indicates an issue or PR is ready to be actively worked on. labels Jul 29, 2022
@apelisse
Copy link

Hey there, wg-api-expression lead here, we're generally interested in building more expressive APIs,

Just read this thread, a few comments:

  • The whole discussion about required/defaulted fields. The default field in openapi is "informational", it tells you what value the server will use if no value is specified, that sounds entirely incompatible with a required field. Also it's not an indication that clients should default themselves to that value. It's easy to fix though, just don't make the field required if it has a default (we could forbid that in the apiserver).
  • Immutability, this will be doable entirely with CEL, we're writing a blog post to document the different use-cases. I agree it's not doable until CEL is enabled.
  • CEL for various things that are too complicated for OpenAPI. Again, not ready yet :-/

We're generally trying to make sure people don't have to maintain a validating webhook, so I'd love to hear the use-cases that you can't get rid of.

@killianmuldoon
Copy link
Contributor Author

We're generally trying to make sure people don't have to maintain a validating webhook, so I'd love to hear the use-cases that you can't get rid of.

The trickiest cases we have rely on the existence of objects, or of fields in other objects. When we create a cluster with ClusterClass field specified we run a get from the API server to ensure the ClusterClass exists, that it has some required fields and that variable definitions in the Cluster meet some of the contstraints in the ClusterClass.

Seeing as it relies on the spec of another object it seems like this would be hard to do without a webhook or something very expressive.

@sbueringer
Copy link
Member

sbueringer commented Aug 15, 2022

Also commented on the linked k/k issue, but just to mention it here. I think this statement is not correct The default field in openapi is "informational". The APIserver first runs defaulting (based on the default fields) then validation on the server-side while kubectl only runs validation.

Thus we get inconsistent results on client- and server-side

@apelisse
Copy link

Agreed, but I think it's wrong that fields are both required and defaulted, we can't fix that in the api now, but we should strongly recommend not having these.

@k8s-triage-robot
Copy link

The Kubernetes project currently lacks enough contributors to adequately respond to all issues and PRs.

This bot triages issues and PRs according to the following rules:

  • After 90d of inactivity, lifecycle/stale is applied
  • After 30d of inactivity since lifecycle/stale was applied, lifecycle/rotten is applied
  • After 30d of inactivity since lifecycle/rotten was applied, the issue is closed

You can:

  • Mark this issue or PR as fresh with /remove-lifecycle stale
  • Mark this issue or PR as rotten with /lifecycle rotten
  • Close this issue or PR with /close
  • Offer to help out with Issue Triage

Please send feedback to sig-contributor-experience at kubernetes/community.

/lifecycle stale

@k8s-ci-robot k8s-ci-robot added the lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. label Nov 13, 2022
@fabriziopandini
Copy link
Member

/lifecycle frozen

@k8s-ci-robot k8s-ci-robot removed the lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. label Nov 14, 2022
@k8s-ci-robot k8s-ci-robot added the lifecycle/frozen Indicates that an issue or PR should not be auto-closed due to staleness. label Nov 14, 2022
@fabriziopandini
Copy link
Member

/triage accepted

@k8s-ci-robot k8s-ci-robot added the triage/accepted Indicates an issue or PR is ready to be actively worked on. label Nov 30, 2022
@k8s-triage-robot
Copy link

This issue has not been updated in over 1 year, and should be re-triaged.

You can:

  • Confirm that this issue is still relevant with /triage accepted (org members only)
  • Close this issue with /close

For more details on the triage process, see https://www.kubernetes.dev/docs/guide/issue-triage/

/remove-triage accepted

@k8s-ci-robot k8s-ci-robot added needs-triage Indicates an issue or PR lacks a `triage/foo` label and requires one. and removed triage/accepted Indicates an issue or PR is ready to be actively worked on. labels Jan 19, 2024
@fabriziopandini
Copy link
Member

/priority important-longterm

@k8s-ci-robot k8s-ci-robot added the priority/important-longterm Important over the long term, but may not be staffed and/or may need multiple releases to complete. label Apr 12, 2024
@fabriziopandini fabriziopandini removed the lifecycle/frozen Indicates that an issue or PR should not be auto-closed due to staleness. label Apr 16, 2024
@fabriziopandini
Copy link
Member

The Cluster API project currently lacks enough active contributors to adequately respond to all issues and PRs.

@sbueringer summarized our options in #7634

  1. All validation should be done in webhooks (with only very few exceptions like: optional/required, maybe enum)
  2. Validation should be done as much as possible in OpenAPI, everything else in webhooks

I think that given the current status of the project/number of maintainers, and the fact that no one is raising concerns about the expressivity of our OpenAPI schema, only 1 is viable because it is basically the current state and we can keep improving stuff incrementally in a surgical way.

I there won't be comments, I will close this issue and proceed as described above

@fabriziopandini
Copy link
Member

/close
due to lack of feedback

@k8s-ci-robot
Copy link
Contributor

@fabriziopandini: Closing this issue.

In response to this:

/close
due to lack of feedback

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository.

@sbueringer
Copy link
Member

sbueringer commented Nov 11, 2024

/reopen

I think we'll have to settle this debate :).

Recently this came up here:

If we decide to add more OpenAPI validation including CEL:

  • Move the MDR unique validation to CEL (// +kubebuilder:validation:XValidation:rule="self.all(x, self.exists_one(y, x == y))",message="entries in machines must be unique")

I'll see if I can get around to it after my PTO (will basically come down to early next year)

@k8s-ci-robot k8s-ci-robot reopened this Nov 11, 2024
@k8s-ci-robot
Copy link
Contributor

@sbueringer: Reopened this issue.

In response to this:

/reopen

I think we'll have to settle this debate :).

Recently this came up here:

I'll see if I can get around to it after my PTO (will basically come down to early next year)

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository.

@sbueringer sbueringer self-assigned this Nov 11, 2024
@JoelSpeed
Copy link
Contributor

Fwiw other projects (like gateway API) are using CEL and relying on beta meaning that CEL is enabled in the cluster, so there's precedence for using CEL even though it could in theory be disabled in a supported version if the end user decided to

@sbueringer
Copy link
Member

sbueringer commented Nov 13, 2024

/triage accepted

I'll do some research once I get around to it and then bring this topic up for discussion

@k8s-ci-robot k8s-ci-robot added triage/accepted Indicates an issue or PR is ready to be actively worked on. and removed needs-triage Indicates an issue or PR lacks a `triage/foo` label and requires one. labels Nov 13, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
area/api Issues or PRs related to the APIs kind/api-change Categorizes issue or PR as related to adding, removing, or otherwise changing an API kind/feature Categorizes issue or PR as related to a new feature. priority/important-longterm Important over the long term, but may not be staffed and/or may need multiple releases to complete. triage/accepted Indicates an issue or PR is ready to be actively worked on.
Projects
None yet
Development

No branches or pull requests

9 participants