-
Notifications
You must be signed in to change notification settings - Fork 15.2k
KEP-4671, KEP-5598: Introducing Workload Aware Scheduling blog post #53012
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Changes from all commits
File filter
Filter by extension
Conversations
Jump to
Diff view
Diff view
There are no files selected for viewing
| Original file line number | Diff line number | Diff line change |
|---|---|---|
| @@ -0,0 +1,184 @@ | ||
| --- | ||
| layout: blog | ||
| draft: true | ||
| title: "Kubernetes v1.35: Introducing Workload Aware Scheduling" | ||
| date: 2025-XX-XX | ||
| slug: introducing-workload-aware-scheduling | ||
| author: > | ||
| Maciej Skoczeń (Google), | ||
| Dominik Marciński (Google) | ||
| --- | ||
|
|
||
| Scheduling large workloads is a much more complex and fragile operation than scheduling a single Pod, | ||
| as it often requires considering all Pods together instead of scheduling each one independently. | ||
| For example, when scheduling a machine learning batch job, you often need to place each worker strategically, | ||
| such as on the same rack, to make the entire process as efficient as possible. | ||
| At the same time, the Pods that are part of such a workload are very often identical | ||
| from the scheduling perspective, which fundamentally changes how this process should look. | ||
|
|
||
| There are many custom schedulers adapted to perform workload scheduling efficiently, | ||
| but considering how common and important workload scheduling is to Kubernetes users, | ||
| especially in the AI era with the growing number of use cases, | ||
| it is high time to make workloads a first-class citizen for `kube-scheduler` and support them natively. | ||
|
|
||
| ## Workload aware scheduling | ||
|
|
||
| The recent 1.35 release of Kubernetes delivered the first tranche of *workload aware scheduling* improvements. | ||
| These are part of a wider effort that is aiming to improve scheduling and management of workloads. | ||
| The effort will span over many SIGs and releases, and is supposed to gradually expand | ||
| capabilities of the system toward reaching the north star goal, | ||
| which is seamless workload scheduling and management in Kubernetes including, | ||
| but not limited to, preemption and autoscaling. | ||
|
|
||
| Kubernetes v1.35 introduces the Workload API that you can use to describe the desired shape | ||
| as well as scheduling-oriented requirements of the workload. It comes with an initial implementation | ||
| of *gang scheduling* that instructs the `kube-scheduler` to schedule gang Pods in the *all-or-nothing* fashion. | ||
| Finally, we improved scheduling of identical Pods (that typically make a gang) to speed up the process | ||
| thanks to the *opportunistic batching* feature. | ||
|
|
||
| ## Workload API | ||
|
Member
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Shall we link the workload doc page somewhere?
Member
Author
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Maybe we could, but I think it's not that important as opportunistic batching doc, which will list all restrictions.
Member
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. We really should. People will want to find it.
Member
Author
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. I'll add the links in a follow up, after merging the docs
Member
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Thanks! |
||
|
|
||
| The new Workload API resource is part of the `scheduling.k8s.io/v1alpha1` | ||
| {{< glossary_tooltip text="API group" term_id="api-group" >}}. | ||
| This resource acts as a structured, machine-readable definition of the scheduling requirements | ||
| of a multi-Pod application. While user-facing workloads like Jobs define what to run, the Workload resource | ||
| determines how a group of Pods should be scheduled and how its placement should be managed | ||
| throughout its lifecycle. | ||
|
|
||
| A Workload allows you to define a group of Pods and apply a scheduling policy to them. | ||
| Here is what a gang scheduling configuration looks like. You can define a `podGroup` named `workers` | ||
| and apply the `gang` policy with a `minCount` of 4. | ||
|
Comment on lines
+48
to
+50
Member
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Shall we say that Workload API must be created before Pod is deployed?
Member
Author
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Workload doesn't have to be created before the pods. Pods wait on PreEnqueue for workload to be created. Added this info to the GangScheduling description |
||
|
|
||
| ```yaml | ||
| apiVersion: scheduling.k8s.io/v1alpha1 | ||
| kind: Workload | ||
| metadata: | ||
| name: training-job-workload | ||
| namespace: some-ns | ||
| spec: | ||
| podGroups: | ||
| - name: workers | ||
| policy: | ||
| gang: | ||
| # The gang is schedulable only if 4 pods can run at once | ||
| minCount: 4 | ||
| ``` | ||
|
|
||
| When you create your Pods, you link them to this Workload using the new `workloadRef` field: | ||
|
|
||
| ```yaml | ||
| apiVersion: v1 | ||
| kind: Pod | ||
| metadata: | ||
| name: worker-0 | ||
| namespace: some-ns | ||
| spec: | ||
| workloadRef: | ||
| name: training-job-workload | ||
| podGroup: workers | ||
| ... | ||
| ``` | ||
|
|
||
| ## How gang scheduling works | ||
macsko marked this conversation as resolved.
Show resolved
Hide resolved
|
||
|
|
||
| The `gang` policy enforces *all-or-nothing* placement. Without gang scheduling, | ||
| a Job might be partially scheduled, consuming resources without being able to run, | ||
| leading to resource wastage and potential deadlocks. | ||
|
|
||
| When you create Pods that are part of a gang-scheduled pod group, the scheduler's `GangScheduling` | ||
| plugin manages the lifecycle independently for each pod group (or replica key): | ||
|
|
||
| 1. When you create your Pods (or a controller makes them for you), | ||
| the scheduler blocks them from scheduling, until: | ||
| * The referenced Workload object is created. | ||
| * The referenced pod group exists in a Workload. | ||
| * The number of pending Pods in that group meets your `minCount`. | ||
|
|
||
| 2. Once enough Pods arrive, the scheduler tries to place them. However, | ||
| instead of binding them to nodes immediately, the Pods wait at a `Permit` gate. | ||
|
|
||
| 3. The scheduler checks if it has found valid assignments for the entire group (at least the `minCount`). | ||
| * If there is room for the group, the gate opens, and all Pods are bound to nodes. | ||
| * If only a subset of the group pods was successfully scheduled within a timeout (set to 5 minutes), | ||
|
Member
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Do we have an option to configure this timeout in v1.35 ?
Member
Author
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. It's not configurable for now |
||
| the scheduler rejects **all** of the Pods in the group. | ||
| They go back to the queue, freeing up the reserved resources for other workloads. | ||
|
|
||
| We'd like to point out that that while this is a first implementation, the Kubernetes project firmly | ||
| intends to improve and expand the gang scheduling algorithm in future releases. | ||
| Benefits we hope to deliver include a single-cycle scheduling phase for a whole gang, | ||
| workload-level preemption, and more, moving towards the north star goal. | ||
|
|
||
| ## Opportunistic batching | ||
|
|
||
| In addition to explicit gang scheduling, v1.35 introduces *opportunistic batching*. | ||
| This is a Beta feature that improves scheduling latency for identical Pods. | ||
|
|
||
| Unlike gang scheduling, this feature does not require the Workload API | ||
| or any explicit opt-in on the user's part. It works opportunistically within the scheduler | ||
| by identifying Pods that have identical scheduling requirements (container images, resource requests, | ||
| affinities, etc.). When the scheduler processes a Pod, it can reuse the feasibility calculations | ||
| for subsequent identical Pods in the queue, significantly speeding up the process. | ||
|
|
||
|
Member
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Maybe explicitly say that you benefit from the optimization if it is relevant, without taking any special steps. A new paragraph to say just that, even though you already implied it.
Member
Author
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Added |
||
| Most users will benefit from this optimization automatically, without taking any special steps, | ||
| provided their Pods meet the following criteria. | ||
|
|
||
| ### Restrictions | ||
|
Member
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Do we actually mention specific restrictions on the blog post? This section doesn't explain all limitations anyway (e.g., the scheduler config etc) Also, the limitation will likely change in the (near?) future.
Member
Author
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Updated this section not to list directly any requirements |
||
|
|
||
| Opportunistic batching works under specific conditions. All fields used by the `kube-scheduler` | ||
| to find a placement must be identical between Pods. Additionally, using some features | ||
| disables the batching mechanism for those Pods to ensure correctness. | ||
|
|
||
| Note that you may need to review your `kube-scheduler` configuration | ||
| to ensure it is not implicitly disabling batching for your workloads. | ||
|
|
||
| See the [docs](TODO) for more details about restrictions. | ||
|
Member
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. I presume we should wait for these to land in the dev-1.35 branch? Exceptions are not hard to get, IMO (the deadline for blog articles is formally today).
Contributor
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Yes, I think we can merge it as a draft if needed and then add the link to the doc later before publishing it. |
||
|
|
||
| ## The north star vision | ||
|
|
||
| The project has a broad ambition to deliver workload aware scheduling. | ||
| These new APIs and scheduling enhancements are just the first steps. | ||
| In the near future, the effort aims to tackle: | ||
|
|
||
| * Introducing a workload scheduling phase | ||
| * Improved support for multi-node [DRA](/docs/concepts/scheduling-eviction/dynamic-resource-allocation/) | ||
| and topology aware scheduling | ||
| * Workload-level preemption | ||
|
Member
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Do we have preemption in scope of Workload API ?
Member
Author
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. The workload-level preemption will be a separate feature (KEP) with additional Workload API changes |
||
| * Improved integration between scheduling and autoscaling | ||
| * Improved interaction with external workload schedulers | ||
| * Managing placement of workloads throughout their entire lifecycle | ||
|
Member
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Does this topic cover the Topology Awareness for Workload placement that we discussed?
Member
Author
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. I think "Improved support for multi-node DRA and topology aware scheduling" covers that |
||
| * Multi-workload scheduling simulations | ||
|
|
||
macsko marked this conversation as resolved.
Show resolved
Hide resolved
|
||
| And more. The priority and implementation order of these focus areas | ||
| are subject to change. Stay tuned for further updates. | ||
|
|
||
| ## Getting started | ||
|
|
||
| To try the workload aware scheduling improvements: | ||
|
|
||
| * Workload API: Enable the | ||
| [`GenericWorkload`](/docs/reference/command-line-tools-reference/feature-gates/#GenericWorkload) | ||
| feature gate on both `kube-apiserver` and `kube-scheduler`, and ensure the `scheduling.k8s.io/v1alpha1` | ||
| {{< glossary_tooltip text="API group" term_id="api-group" >}} is enabled. | ||
| * Gang scheduling: Enable the | ||
| [`GangScheduling`](/docs/reference/command-line-tools-reference/feature-gates/#GangScheduling) | ||
| feature gate on `kube-scheduler` (requires the Workload API to be enabled). | ||
| * Opportunistic batching: As a Beta feature, it is enabled by default in v1.35. | ||
| You can disable it using the | ||
| [`OpportunisticBatching`](/docs/reference/command-line-tools-reference/feature-gates/#OpportunisticBatching) | ||
| feature gate on `kube-scheduler` if needed. | ||
|
|
||
| We encourage you to try out workload aware scheduling in your test clusters | ||
| and share your experiences to help shape the future of Kubernetes scheduling. | ||
|
Member
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. How should people share their experience?
Member
Author
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Good point, added a list of places where the feedback can be sent |
||
| You can send your feedback by: | ||
|
|
||
| * Reaching out via [Slack (#sig-scheduling)](https://kubernetes.slack.com/archives/C09TP78DV). | ||
| * Commenting on the [workload aware scheduling tracking issue](https://github.com/kubernetes/kubernetes/issues/132192) | ||
| * Filing a new [issue](https://github.com/kubernetes/enhancements/issues) in the Kubernetes repository. | ||
|
|
||
|
Member
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Optionally, mention that custom schedulers are also welcome to honor Workload API objects, and |
||
| ## Learn more | ||
|
Member
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more.
Member
Author
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. I see other blog post don't link to the youtube recordings. I'm not sure if we should add them |
||
|
|
||
| * Read the KEPs for | ||
| [Workload API and gang scheduling](https://github.com/kubernetes/enhancements/tree/master/keps/sig-scheduling/4671-gang-scheduling) and | ||
| [Opportunistic batching](https://github.com/kubernetes/enhancements/tree/master/keps/sig-scheduling/5598-opportunistic-batching). | ||
| * Track the [Workload aware scheduling issue](https://github.com/kubernetes/kubernetes/issues/132192) | ||
| for recent updates. | ||
Uh oh!
There was an error while loading. Please reload this page.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If we describe some typical TAS usecase, I think it'd be easier for a complete newbie around these workload aware scheduling stuff to understand/imagine why scheduling needs to consider a group of pods. e.g.,
For example, when scheduling a machine learning batch job, you often want to schedule each worker wisely, e.g., on the same rack etc, to make the entire process as efficient as possible.There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Good idea, added