-
Notifications
You must be signed in to change notification settings - Fork 659
Add top-level Labels and Resources Structed fields to HeadGroupSpec and WorkerGroupSpec
#4106
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add top-level Labels and Resources Structed fields to HeadGroupSpec and WorkerGroupSpec
#4106
Conversation
Signed-off-by: Ryan O'Leary <ryanaoleary@google.com>
Signed-off-by: Ryan O'Leary <ryanaoleary@google.com>
|
Running which isn't a top-level API object. I'm wondering if there's a preference for changing it to a |
Signed-off-by: Ryan O'Leary <ryanaoleary@google.com>
kevin85421
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We should try to avoid handling the logic of overriding or merging user configurations. It’s hard to ensure correct behavior and makes Ray Autoscaler more complex. My suggestions:
Add validations in ValidateRayClusterSpec:
- Resources
- If users specify both (1)
num-cpus/num-gpus/memory/resourcesinrayStartParamsand (2){Head|Worker}GroupSpec.Resources, we should fail validation and avoid reconciling anything. Users should only use (1) or (2).
- If users specify both (1)
- Labels
- If users specify
labelsinrayStartParams, we should fail the validation because we plan not to handle the string parsing in Ray Autoscaler as @edoakes said. Only{Head|Worker}GroupSpec.Labelsis allowed.
- If users specify
cc @Future-Outlier @rueian Could one of you open an issue to track updating the compatible Ray versions (because of Ray Autoscaler)? And @rueian, could you work on adding support in Ray Autoscaler for Resources / Labels?
| sort.Strings(keys) | ||
|
|
||
| for _, k := range keys { | ||
| labels = append(labels, fmt.Sprintf("%s=%s", k, groupLabels[k])) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Do we need to validate that there is no , in the k and groupLabels[k]?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
opened here, thank you!
#4113
Signed-off-by: Ryan O'Leary <ryanaoleary@google.com>
|
To summarize my points in today's community sync,
My proposals for
|
Based on my memory, we should also remove labels from the env variables in Ray if we want to go with proposal 2. |
We should avoid Ray reading env vars to set Ray labels no matter which proposals. |
Would it be so bad for Ray autoscaler to check labels in both places (API field and rayStartParams) for some period of time? I think even if we agree that |
No strong opinions if @rueian is willing to maintain the complexity in Ray Autoscaler. I propose removing it for now because I think most users don't use label-based scheduling, and even fewer users use label-based scheduling with the Autoscaler at this moment. It's a good timing to get rid of some potential tech debt, and we decide not to support Have we announced anything about label-based scheduling publicly? |
|
I think leaving the old parsing code untouched in the Ray autoscaler is okay, or just put a TODO for removing it someday :) What we really need to do is reject any label set from other than direct value in the new top-level labels field. Setting labels via env variables, variable expansions, and rayStartParams["labels"] should all be rejected by KubeRay. This way, we can make sure the Ray autoscaler can get the labels directly from the new top-level labels field. |
Proposal 2 sounds good to me too. This PR should contain all the required changes currently and is good to review, I'll update ray-project/ray#57260 to include TODO comments to remove the string parsing logic. I'll also update the label setting logic so that we don't consider environment variables. |
Where are labels currently being set in env vars in KubeRay? We set some default labels in Ray core based on env vars (i.e. accelerator-type for TPU is set using an env var that's set automatically by GKE), but I can remove the ones that we no longer plan to set using KubeRay (i.e. from this closed PR: #3699). Is there anywhere else that env vars are being considered that we need to remove? |
Hi, I don't see we support this in kuberay now. |
Future-Outlier
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
cc @rueian and @andrewsykim for merge
kevin85421
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
briefly review one pass. Overall LG
| return fmt.Errorf("rayStartParams['labels'] is not supported for %s group. Please use the top-level Labels field instead", groupName) | ||
| } | ||
|
|
||
| // Validate syntax for the top-level Labels field is parsable. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
For the label values, please make sure it is also a valid K8s label value. See https://kubernetes.io/docs/concepts/overview/working-with-objects/labels/#syntax-and-character-set for more details. I expected there are some utils we can use.
In addition, can we validate it in the OpenAPI level if possible? This is one of the main reason we decide to update CRD instead of asking users to specify in rayStartParams: #4106 (comment).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Maybe IsLabelValue? https://github.com/kubernetes/kubernetes/blob/master/staging/src/k8s.io/apimachinery/pkg/api/validate/content/kube.go#L85
You should check whether it is designed for user-facing.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
+1 We should do full validation for the label keys and values so that we can fail early when applying the CR.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I ran into some issues adding the Open API level validation, since it seems the rule cost of checking the labels is too high:
Error from server (Invalid): error when creating "STDIN": CustomResourceDefinition.apiextensions.k8s.io "rayservices.ray.io" is invalid: [spec.versions[0].schema.openAPIV3Schema.properties[spec].properties[rayClusterConfig].properties[workerGroupSpecs].items.properties[labels].x-kubernetes-validations[0].rule: Forbidden: estimated rule cost exceeds budget by factor of 6.5x (try simplifying the rule, or adding maxItems, maxProperties, and maxLength where arrays, maps, and strings are declared), spec.versions[0].schema.openAPIV3Schema.properties[spec].properties[rayClusterConfig].properties[workerGroupSpecs].items.properties[labels].x-kubernetes-validations[1].rule: Forbidden: estimated rule cost exceeds budget by factor of 6.5x (try simplifying the rule, or adding maxItems, maxProperties, and maxLength where arrays, maps, and strings are declared), spec.versions[0].schema.openAPIV3Schema.properties[spec].properties[rayClusterConfig].properties[workerGroupSpecs].items.properties[labels].x-kubernetes-validations[0].rule: Forbidden: contributed to estimated rule cost total exceeding cost limit for entire OpenAPIv3 schema, spec.versions[0].schema.openAPIV3Schema.properties[spec].properties[rayClusterConfig].properties[workerGroupSpecs].items.properties[labels].x-kubernetes-validations[1].rule: Forbidden: contributed to estimated rule cost total exceeding cost limit for entire OpenAPIv3 schema, spec.versions[0].schema.openAPIV3Schema: Forbidden: x-kubernetes-validations estimated rule cost total for entire OpenAPIv3 schema exceeds budget by factor of 1.303381x (try simplifying the rule, or adding maxItems, maxProperties, and maxLength where arrays, maps, and strings are declared)]
the above error message was for rules that looked similar to this:
// +kubebuilder:validation:MaxProperties=100
// +kubebuilder:validation:XValidation:rule="self.all(key, size(key) <= 63)", message="label key must be no more than 63 characters"
// +kubebuilder:validation:XValidation:rule="self.all(value, size(value) <= 63)", message="label value must be no more than 63 characters"
// +kubebuilder:validation:XValidation:rule="self.all(key, key.matches('^([a-z0-9]([-a-z0-9]*[a-z0-9])?(\\\\.[a-z0-9]([-a-z0-9]*[a-z0-9])?)*/)?(([A-Za-z0-9][-A-Za-z0-9_.]*)?[A-Za-z0-9])$'))", message="a valid label key must be an alphanumeric string, with optional slashes, dots, dashes, and underscores"
// +kubebuilder:validation:XValidation:rule="self.all(value, value.matches('^(([A-Za-z0-9][-A-Za-z0-9_.]*)?[A-Za-z0-9])?$'))", message="a valid label value must be an empty string or an alphanumeric string, with optional dots, dashes, and underscores"
I tried setting MaxProperties even lower to values like 10 and removing the syntax rules, but this still ran into the limit. I also think it wouldn't be good to limit the number of possible labels to such a low number. I'm wondering if there's a way to increase the rule cost budget?
For now I just added the validation in the controller, if there are rules that we can think to add that are accepted by OpenAPI I can add those as well. Done in: 998e2d9
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This workaround also seems possible: kubernetes/kubernetes#121162 but would involve us defining a separate Label struct with key and value string fields.
Wondering if anyone knows the best practice @andrewsykim @Future-Outlier @kevin85421
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't have a strong opinion on whether it must be at the OpenAPI level, because the validation in the controller also works pretty well.
I mentioned the OpenAPI validation because @andrewsykim highlighted this during the community sync as a benefit of using a CRD API instead of rayStartParams.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
[]RayNodeLabel (with a key field) is typically preferred to a map[string]... in Kubernetes APIs, largely because it is better future proofed against future changes. The patch mechanisms will understand that the slice is a logical map so long as you use // +listType and // +listMapKey.
EDIT: We are aware that it is not possible to set a maxLength limit on map keys. This is easier on future versions of OpenAPI which natively support validations on map keys.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@jpbetz agreed that list types are usually preferred, but in this specific case we are introducing an API for Ray node labels so we want to mirror the map[string]string format that is more common for labels in k8s.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'd missed the significance of this. I've opened kubernetes/kubernetes#134684 to track then lack of key validation support, which is available in JSON Schema 2020-12 via propertyNames.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
thanks @jpbetz!
Co-authored-by: Kai-Hsun Chen <kaihsun@apache.org> Signed-off-by: Ryan O'Leary <113500783+ryanaoleary@users.noreply.github.com>
Signed-off-by: Ryan O'Leary <ryanaoleary@google.com>
kevin85421
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
High level LGTM. Note that I didn't review the implementation details.
andrewsykim
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM, @ryanaoleary please add webhook-level validation in a follow-up PR. We should follow-up with @jpbetz in the future about re-introducing OpenAPI validation for maps with future versions of OpenAPI
I'll follow up on this. https://json-schema.org/understanding-json-schema/reference/object#propertyNames was introduced JSON Schema recently but k8s doesn't yet support it. I'll need to check if supporting this is compliant with the OpenAPI 3.0 spec. |
…Autoscaling config (#57260) In KubeRay v1.5, new structured `Resources` and `Labels` fields will be added to both the `HeadGroupSpec` and `WorkerGroupSpec` to enable users to explicitly define these values without relying on `rayStartParams`. This support is implemented in ray-project/kuberay#4106. In order for this change to work with the Ray autoscaler, this PR adds support for checking the `resources` and `labels` field (with precedence over the values `rayStartParams` and k8s constainer spec) and generating the resulting autoscaling config. This change is compatible with older versions of KubeRay that do not have these new fields, since we fall back to the previous logic when the top-level fields aren't specified. Example: For a group spec specified like this: ``` workerGroupSpecs: - groupName: worker-group-1 replicas: 1 resources: GPU: 8 CPU: 16 labels: ray.io/zone: us-west2-a ray.io/region: us-west2 ``` The resulting autoscaling config for worker-group-1 would be: ``` "worker-group-1": { "labels": {" ray.io/zone": "us-west2-a", " ray.io/region": "us-west2"}, ... "resources": { "CPU": 16, GPU: 8, }, }, ``` ## Related issue number Contributes to #51564 --------- Signed-off-by: Ryan O'Leary <ryanaoleary@google.com>
…Autoscaling config (ray-project#57260) In KubeRay v1.5, new structured `Resources` and `Labels` fields will be added to both the `HeadGroupSpec` and `WorkerGroupSpec` to enable users to explicitly define these values without relying on `rayStartParams`. This support is implemented in ray-project/kuberay#4106. In order for this change to work with the Ray autoscaler, this PR adds support for checking the `resources` and `labels` field (with precedence over the values `rayStartParams` and k8s constainer spec) and generating the resulting autoscaling config. This change is compatible with older versions of KubeRay that do not have these new fields, since we fall back to the previous logic when the top-level fields aren't specified. Example: For a group spec specified like this: ``` workerGroupSpecs: - groupName: worker-group-1 replicas: 1 resources: GPU: 8 CPU: 16 labels: ray.io/zone: us-west2-a ray.io/region: us-west2 ``` The resulting autoscaling config for worker-group-1 would be: ``` "worker-group-1": { "labels": {" ray.io/zone": "us-west2-a", " ray.io/region": "us-west2"}, ... "resources": { "CPU": 16, GPU: 8, }, }, ``` ## Related issue number Contributes to ray-project#51564 --------- Signed-off-by: Ryan O'Leary <ryanaoleary@google.com>
…Autoscaling config (ray-project#57260) In KubeRay v1.5, new structured `Resources` and `Labels` fields will be added to both the `HeadGroupSpec` and `WorkerGroupSpec` to enable users to explicitly define these values without relying on `rayStartParams`. This support is implemented in ray-project/kuberay#4106. In order for this change to work with the Ray autoscaler, this PR adds support for checking the `resources` and `labels` field (with precedence over the values `rayStartParams` and k8s constainer spec) and generating the resulting autoscaling config. This change is compatible with older versions of KubeRay that do not have these new fields, since we fall back to the previous logic when the top-level fields aren't specified. Example: For a group spec specified like this: ``` workerGroupSpecs: - groupName: worker-group-1 replicas: 1 resources: GPU: 8 CPU: 16 labels: ray.io/zone: us-west2-a ray.io/region: us-west2 ``` The resulting autoscaling config for worker-group-1 would be: ``` "worker-group-1": { "labels": {" ray.io/zone": "us-west2-a", " ray.io/region": "us-west2"}, ... "resources": { "CPU": 16, GPU: 8, }, }, ``` ## Related issue number Contributes to ray-project#51564 --------- Signed-off-by: Ryan O'Leary <ryanaoleary@google.com> Signed-off-by: xgui <xgui@anyscale.com>
…Autoscaling config (#57260) In KubeRay v1.5, new structured `Resources` and `Labels` fields will be added to both the `HeadGroupSpec` and `WorkerGroupSpec` to enable users to explicitly define these values without relying on `rayStartParams`. This support is implemented in ray-project/kuberay#4106. In order for this change to work with the Ray autoscaler, this PR adds support for checking the `resources` and `labels` field (with precedence over the values `rayStartParams` and k8s constainer spec) and generating the resulting autoscaling config. This change is compatible with older versions of KubeRay that do not have these new fields, since we fall back to the previous logic when the top-level fields aren't specified. Example: For a group spec specified like this: ``` workerGroupSpecs: - groupName: worker-group-1 replicas: 1 resources: GPU: 8 CPU: 16 labels: ray.io/zone: us-west2-a ray.io/region: us-west2 ``` The resulting autoscaling config for worker-group-1 would be: ``` "worker-group-1": { "labels": {" ray.io/zone": "us-west2-a", " ray.io/region": "us-west2"}, ... "resources": { "CPU": 16, GPU: 8, }, }, ``` ## Related issue number Contributes to #51564 --------- Signed-off-by: Ryan O'Leary <ryanaoleary@google.com> Signed-off-by: elliot-barn <elliot.barnwell@anyscale.com>
…Autoscaling config (ray-project#57260) In KubeRay v1.5, new structured `Resources` and `Labels` fields will be added to both the `HeadGroupSpec` and `WorkerGroupSpec` to enable users to explicitly define these values without relying on `rayStartParams`. This support is implemented in ray-project/kuberay#4106. In order for this change to work with the Ray autoscaler, this PR adds support for checking the `resources` and `labels` field (with precedence over the values `rayStartParams` and k8s constainer spec) and generating the resulting autoscaling config. This change is compatible with older versions of KubeRay that do not have these new fields, since we fall back to the previous logic when the top-level fields aren't specified. Example: For a group spec specified like this: ``` workerGroupSpecs: - groupName: worker-group-1 replicas: 1 resources: GPU: 8 CPU: 16 labels: ray.io/zone: us-west2-a ray.io/region: us-west2 ``` The resulting autoscaling config for worker-group-1 would be: ``` "worker-group-1": { "labels": {" ray.io/zone": "us-west2-a", " ray.io/region": "us-west2"}, ... "resources": { "CPU": 16, GPU: 8, }, }, ``` ## Related issue number Contributes to ray-project#51564 --------- Signed-off-by: Ryan O'Leary <ryanaoleary@google.com>
…Autoscaling config (ray-project#57260) In KubeRay v1.5, new structured `Resources` and `Labels` fields will be added to both the `HeadGroupSpec` and `WorkerGroupSpec` to enable users to explicitly define these values without relying on `rayStartParams`. This support is implemented in ray-project/kuberay#4106. In order for this change to work with the Ray autoscaler, this PR adds support for checking the `resources` and `labels` field (with precedence over the values `rayStartParams` and k8s constainer spec) and generating the resulting autoscaling config. This change is compatible with older versions of KubeRay that do not have these new fields, since we fall back to the previous logic when the top-level fields aren't specified. Example: For a group spec specified like this: ``` workerGroupSpecs: - groupName: worker-group-1 replicas: 1 resources: GPU: 8 CPU: 16 labels: ray.io/zone: us-west2-a ray.io/region: us-west2 ``` The resulting autoscaling config for worker-group-1 would be: ``` "worker-group-1": { "labels": {" ray.io/zone": "us-west2-a", " ray.io/region": "us-west2"}, ... "resources": { "CPU": 16, GPU: 8, }, }, ``` ## Related issue number Contributes to ray-project#51564 --------- Signed-off-by: Ryan O'Leary <ryanaoleary@google.com> Signed-off-by: Aydin Abiar <aydin@anyscale.com>
Why are these changes needed?
This PR lifts both resources and labels into explicit structured fields for the
HeadGroupSpecandWorkerGroupSpec. When these optional fields are specified, they override their respective values in therayStartCommandfor Pods created by KubeRay for that group. Additionally, labels specified at the top-levelLabelsfield are merged with the K8s labels on the Pod for observability.The discussion and rationale for this change is discussed more in #3699. The labels part of this change will help enable the autoscaling use case with
label_selectorsin Ray core.Related issue number
Contributes to ray-project/ray#51564
Checks