-
Notifications
You must be signed in to change notification settings - Fork 52
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
A different API to express colocation and exclusiveness #75
Comments
The antiAffinity rule that you posted for UC1 is hiding the fact that you would insert the following (IIUC): key: job-name
operator: NotIn
values:
- my-job It might be better to keep the "exclusive" semantics:
|
Correct, it is implicit.
So my thinking is that the "exclusive" semantics are supported using the AntiAffinityPods and AntiAffinityNamespaces selectors since they offer greater flexibility to tune and define exclusiveness (e.g., exclusive against all pods or only other jobs etc.). |
This is my worry then. There is a hidden addition to the antiaffinity rules that is not obvious. |
We have to have that implicitly added in all cases though |
Yes, but IMO it's easier to explain if you say: "exclusive: true" of sorts, compared to mutating matchExpressions that the user provides. |
Right, but I feel that we anyway need to provide a knob (a pod selector) to allow users to specify exclusive against what, and by definition it shouldn't be against itself. So having another parameter to say "exclusive" will be redundant in that sense, right? |
In that case, I don't think the name |
It is a selector on the pods, hence the name; any other suggestions? |
I would be tempted to call it |
Aldo's proposal is quite appealing in its simplicity:
When there are needs for more advanced anti-affinity rules, maybe that can be left for classic anti-affinity expressed on the pod template ? (after we have MatchLabelKeys as pointed out in #27 ) Btw. consider that there can be a hierarchy of more and more collocated topologies, e.g. a rack within group of racks within a cluster. Do you want to support a usecase of "I want the most collocated I can get, but at most X" ? See max-distance in GCP compact placement APIs for inspiration: https://cloud.google.com/sdk/gcloud/reference/beta/compute/resource-policies/create/group-placement. |
@alculquicondor quick clarifying question about your suggestion:
Is the below interpretation correct or am I misunderstanding? |
We don't have a canonical way of defining hierarchy using node labels, so I don't think we can reliably expose a max-distance API unless we predefine a hierarchal topology. |
No, it means the Job itself is colocated in a single domain (node pool), but it can coexist with others. If we are to go with an explicit enum, then it would probably be: |
I don't think it is doable in a reliable and non-surprising way because we need to have a selector term that prevents the job from excluding itself. The API in the main post make that selector term explicit, attached to the specified topology and the namespace where this applies. |
But, it can be done by allowing users to define a list of colocation rules, with the outermost topology being strict while the others being preferred:
if we want to support more levels, then the API should allow setting pod-affinity weights (with higher weights for the inner most ones as you pointed out offline). At this point, the colocation API drifts closer to pod-affinity, so its value becomes less obvious compared to just using pod-affinity directly :) |
Yes, but at that point it's better to enhance the pod spec to support what we need. In the meantime, we jobset can support the minimal requirement: probably just Exclusive and Strict. |
Exactly. Btw. since BestEffort/Strict is only the choice for the outermost (and inner are always BestEffort), mode could be outside the list:
I'm not sure how the list should support exclusive, though. I guess exclusive makes most sense when there is one. Btw. why would jobs want exclusive actually? Because of some noisy neighbors? I suspect what we really want is an all-or-nothing mechanism to avoid deadlocks and exclusive is just a workaround for that? Like "I would be fine to share the rack with someone if we both fit, but I want to avoid deadlocks when two of us fit partially -- so I will keep the rack exclusive as a workaround".
I didn't follow this. The value would be much simpler API for the user. Implementation would set pod affinities and weights, but the user would only deal with 'colocation' API not with pod affinities, right? That said, it also makes a lot of sense if you decide to focus on something simpler in the beginning. |
Does this mean one pod per machine? Or exclusive use of some machine? There are many uses for our workflows for which we would absolutely not want to share a machine as it would impact performance. |
Actually, we didn't actually change the API (it's still an annotation), but just the implementation of it, to improve scheduling throughput for large scale training. Feel free to re-open this if you want to explore this further. |
This issue is about having a proper API, not the optimizations. |
The Kubernetes project currently lacks enough contributors to adequately respond to all issues. This bot triages un-triaged issues according to the following rules:
You can:
Please send feedback to sig-contributor-experience at kubernetes/community. /lifecycle stale |
/remove-lifecycle stale |
The Kubernetes project currently lacks enough contributors to adequately respond to all issues. This bot triages un-triaged issues according to the following rules:
You can:
Please send feedback to sig-contributor-experience at kubernetes/community. /lifecycle stale |
The Kubernetes project currently lacks enough active contributors to adequately respond to all issues. This bot triages un-triaged issues according to the following rules:
You can:
Please send feedback to sig-contributor-experience at kubernetes/community. /lifecycle rotten |
/remove-lifecycle rotten |
The Kubernetes project currently lacks enough contributors to adequately respond to all issues. This bot triages un-triaged issues according to the following rules:
You can:
Please send feedback to sig-contributor-experience at kubernetes/community. /lifecycle stale |
The existing
spec.replicatedJobs[*].exclusive
API allows users to express that the jobs have 1:1 assignment to a domain. For example, one and only one job per rack.This is currently expressed as:
This API is simple compared to the complex pod affinity and anti-affinity rules that a user would have needed to add to achieve the same semantics (see #27).
However, one issue with this API is that it is limited to one use case (exclusive 1:1 placement); I would like to discuss making it a bit more generic to support the following use cases without losing simplicity:
Use Cases
UC1: Exclusive 1:1 job to domain assignment (what the current API offer)
UC2: Each job is colocated in one domain, but not necessarily exclusive to it (i.e., more than on job could land on the same domain)
UC3: Allow users to express either preference or required assignment?
What other use cases?
API options
UC3 can be supported by extending the existing API; however, the same can't be said with regards to UC2 since the type name "Exclusive" doesn't lend itself to that UC, even if we do #40
Option 1
UC1
UC2
what other options?
The text was updated successfully, but these errors were encountered: