-
Notifications
You must be signed in to change notification settings - Fork 1.5k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
KEP-2149: Cluster id api review request #3084
KEP-2149: Cluster id api review request #3084
Conversation
Skipping CI for Draft Pull Request. |
/label api-review |
E.g.
|
Agreed - this seems nice and convenient. We can still claim k8s.io and other 1st party suffixes (we have to) but claiming bare for first party makes sense. Given that we had previously required suffixes for everything, I don't think this will be controversial.
Yes, I believe maxLength validation has been supported since CRD went v1. Huge +1 on "no webhooks required" for this apigroup. Users can install their own if they want, but the basic API should be simple |
Reserved bare names and provided examples in ff80ff9. I confirmed That being said the OpenAPI |
couldn't the |
@liggitt yes, though I brought ASCII up only for the purposes of discussing the limitations of I don't have very good context on whether restricting input to ASCII is generally acceptable in "arbitrary text" type fields in Kubernetes or out-of-tree APIs. I just checked live against a cluster I had up that annotation values are not restricted to ASCII, while things like names/namespaces of course are (as they must be RFC 1123 subdomains). To me the |
Agree that ASCII seems like an artificial restriction here (and in fact, we probably should have done unicode all over the place, but that's a different argument). I am not a unicode expert, but the IIUC the max unicode code point is 4 bytes, right? So, worst case, can we say that the payload is 128k bytes (as few as 32k unicode code points) or 128K code points (up to 512k bytes)? Which one does openAPI validation gurantee? |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Should we use the "bare" names for the well-known properties defined above?
Yes 4 bytes is the max. Honestly I'm link spelunking back and forth between OpenAPI and JSON spec and the language of "Unicode character" vs "Unicode code point" seems inconsistent or at best very confusing, but based on some open API client code examples and as much of the spec as I understand, I think all that OpenAPI can promise is Unicode code point count (which is not necessarily the same as perceived "characters" -- for example 🇺🇸 is 2 Unicode code points). Therefore I think it's best for us to state it in Unicode code point count. I changed the text to try and get all this across in 4f2a5b8; as for 128k code points vs 32k code points, I assume our bottleneck is etcd value limits which both of those*4 bytes are very low compared to so I went with the higher limit, 128k code points.
|
Personally I've become partial to the |
I agree... I wish we'd consistently included kubernetes suffixes on existing annotations and API groups |
I don't have a very strong opinion on which we should use, bare names, or Either way, the important part is that we require users to provide a suffix, and that's been done |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm OK with suffix being required.
@@ -300,10 +300,12 @@ _For example, [CAPN's virtualcluster project](https://github.com/kubernetes-sigs | |||
The `ClusterProperty` resource provides a way to store identification related, cluster scoped information for multi-cluster tools while creating flexibility for implementations. A cluster may have multiple `ClusterProperty`s, each holding a different identification related value. Each property contains the following information: | |||
|
|||
* **Name** - a well known or custom name to identify the property. | |||
* **Value** - a property-dependent string, up to 128 KB. | |||
* **Value** - a property-dependent string, up 128k Unicode code points (see _Note_). |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Uggh I hate this topic. I can't keep it straight in my head.
If we are asserting length in OpenAPI, do we need to assert an ecoding (e.g. UTF8)? What if the value is NOT valid unicode - what will the validator do? E.g. can I craft a value of entirely invalid bytes, and will that trip up the validator?
Sorry to keep poking this, but if we get it wrong it's a trap for consumers.
IMO:
If it's a string we should validate that it is actually all unicode characters in some encoding, of some max length in code points.
If it's an array of bytes we should validate the length in bytes and nothing more.
Thoughts?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
tl;dr I think we leave the wording as is. UTF-32 is weird but I don't think we should solve that here
Ok tested this empirically on a kind
cluster using a CRD with ClusterProperty.value
's maxLength set to 3
(see describe crd
), and trying to insert one CR whose value
uses 2 code points ("🇺🇸"), one CR whose value
uses 4 code points ("🇺🇸🇺🇸"), and also trying with my YAML encoded in UTF-8, UTF-16, and UTF-32 (methodology).
If we are asserting length in OpenAPI, do we need to assert an ecoding (e.g. UTF8)?
Empirically I confirmed that OpenAPIv3 maxLength is based on the number of Unicode code points of the decoded string because "🇺🇸" passes validation while "🇺🇸🇺🇸" does not [1]. This is regardless of if my YAML was originally in UTF-8 or UTF-16 (and FYI "🇺🇸" uses 8 and 10 bytes in UTF-8 and UTF-16 respectively).
Watching from the outside my best guess is by the time OpenAPIv3's maxLength validation code gets involved, the input is already decoded, though there is a way to specify encoding validation in OpenAPIv3 BUT it is not exposed in the k8s CRD version of OpenAPIv3 (ref) and I also empirically observed it was not allowed when I tried anyways. I don't think we need to enforce an encoding since ultimately we are limiting based on code points -- as long as someone's bytes can get decoded into Unicode code points by the time the validator checks, OpenAPIv3 maxLength doesn't care about the encoding and neither do I.
Empirically I also observed using UTF-8 OR UTF-16 encoded YAML both work fine, but something doesn't seem to be able to decode UTF-32 as I cannot apply my YAML if it's in UTF-32 [1]. Assuming my methodology to get to UTF-32 encoded YAML file was all correct, then from the error thrown it seems like the yaml->json converter library used by kubectl maybe can't sniff that my file was UTF-32. I'm not sure k8s generally really cares if people have their YAML in UTF-32 but if they do this might be worth investigating more outside of the context of this KEP.
[1] See the output.
What if the value is NOT valid unicode - what will the validator do
I observe that if I put an invalid UTF-8 byte sequence in as the value
aka something that cannot be interpreted into a Unicode code point when decoding from UTF-8, it gets validated against number of resulting replacement characters ("�") which is one per invalid byte. I think for our purposes this means it didn't blow up and that is acceptable (as I'm sure we don't intend to support invalid bytes in UTF-* for this field). Specifically I tried cases 3.1.3 and 3.5.1 in this UTF-8 test file.
Importantly regarding the methodology of this test, I did not copy paste between browsers / text editors but instead painstakingly (though probably not in the most efficient way) manipulated it in a Python3 interpreter (which is the language I know the best when it comes to the quirks of the string implementation). In summary I wget
ted that file, read that file into a Python interpreter as bytes only, concatenated the bad bytes into the value
field of the bytes of my YAML, and wrote it to file as bytes and never decoded / reencoded it, and even double checked by reading it in as bytes again that the byte sequence was not the byte sequence for the replacement character -- in other words, I feel very confident the bad bytes were not turned into "�" until k8s did so sometime during kubectl apply
.
IMO:
If it's a string we should validate that it is actually all unicode characters in some encoding, of some max length in code points.
It is a string, agree on max length is against code points, and for reasons above I don't think we should enforce an encoding and also we can't easily.
If it's an array of bytes we should validate the length in bytes and nothing more.
It's not. The only openings this way seem to be OpenAPIv3 formats "byte" (base64 encoded) or "binary" (currently not supported by the k8s CRD implementation of OpenAPIv3), and which would require everyone to write CRs in a non-human readable way like this which we would never do:
apiVersion: about.x-k8s.io/v1alpha1
kind: ClusterProperty
metadata:
name: clusterproperty-fits
spec:
value: "8J+HuvCfh7g=" # base64(flag emoji in UTF-8)
For the literary Unicode fans in the house: https://twitter.com/FakeUnicode/status/893305253420027904?s=20&t=18SvHLy3NyffBZGWXXOq1w |
Thanks! /lgtm |
/assign @JeremyOT |
Thanks Laura! /lgtm |
[APPROVALNOTIFIER] This PR is APPROVED This pull-request has been approved by: JeremyOT, lauralorenz, thockin The full list of commands accepted by this bot can be found here. The pull request process is described here
Needs approval from an approver in each of these files:
Approvers can indicate their approval by writing |
about.k8s.io
, containing KindClusterProperty
in KEP-2149: ClusterID for ClusterSet Identificiationabout.k8s.io
API group;ClusterProperty
kind name;id.k8s.io
andclusterset.k8s.io
well-known CRs) were voted on by surveys sent to the SIG-Multicluster mailing list: