Skip to content
Merged
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
184 changes: 184 additions & 0 deletions content/en/blog/_posts/2025-xx-xx-gang-scheduling.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,184 @@
---
layout: blog
draft: true
title: "Kubernetes v1.35: Introducing Workload Aware Scheduling"
date: 2025-XX-XX
slug: introducing-workload-aware-scheduling
author: >
Maciej Skoczeń (Google),
Dominik Marciński (Google)
---

Scheduling large workloads is a much more complex and fragile operation than scheduling a single Pod,
as it often requires considering all Pods together instead of scheduling each one independently.
Copy link
Member

@sanposhiho sanposhiho Dec 2, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If we describe some typical TAS usecase, I think it'd be easier for a complete newbie around these workload aware scheduling stuff to understand/imagine why scheduling needs to consider a group of pods. e.g., For example, when scheduling a machine learning batch job, you often want to schedule each worker wisely, e.g., on the same rack etc, to make the entire process as efficient as possible.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good idea, added

For example, when scheduling a machine learning batch job, you often need to place each worker strategically,
such as on the same rack, to make the entire process as efficient as possible.
At the same time, the Pods that are part of such a workload are very often identical
from the scheduling perspective, which fundamentally changes how this process should look.

There are many custom schedulers adapted to perform workload scheduling efficiently,
but considering how common and important workload scheduling is to Kubernetes users,
especially in the AI era with the growing number of use cases,
it is high time to make workloads a first-class citizen for `kube-scheduler` and support them natively.

## Workload aware scheduling

The recent 1.35 release of Kubernetes delivered the first tranche of *workload aware scheduling* improvements.
These are part of a wider effort that is aiming to improve scheduling and management of workloads.
The effort will span over many SIGs and releases, and is supposed to gradually expand
capabilities of the system toward reaching the north star goal,
which is seamless workload scheduling and management in Kubernetes including,
but not limited to, preemption and autoscaling.

Kubernetes v1.35 introduces the Workload API that you can use to describe the desired shape
as well as scheduling-oriented requirements of the workload. It comes with an initial implementation
of *gang scheduling* that instructs the `kube-scheduler` to schedule gang Pods in the *all-or-nothing* fashion.
Finally, we improved scheduling of identical Pods (that typically make a gang) to speed up the process
thanks to the *opportunistic batching* feature.

## Workload API
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Shall we link the workload doc page somewhere?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe we could, but I think it's not that important as opportunistic batching doc, which will list all restrictions.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We really should. People will want to find it.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'll add the links in a follow up, after merging the docs

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks!


The new Workload API resource is part of the `scheduling.k8s.io/v1alpha1`
{{< glossary_tooltip text="API group" term_id="api-group" >}}.
This resource acts as a structured, machine-readable definition of the scheduling requirements
of a multi-Pod application. While user-facing workloads like Jobs define what to run, the Workload resource
determines how a group of Pods should be scheduled and how its placement should be managed
throughout its lifecycle.

A Workload allows you to define a group of Pods and apply a scheduling policy to them.
Here is what a gang scheduling configuration looks like. You can define a `podGroup` named `workers`
and apply the `gang` policy with a `minCount` of 4.
Comment on lines +48 to +50
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Shall we say that Workload API must be created before Pod is deployed?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Workload doesn't have to be created before the pods. Pods wait on PreEnqueue for workload to be created. Added this info to the GangScheduling description


```yaml
apiVersion: scheduling.k8s.io/v1alpha1
kind: Workload
metadata:
name: training-job-workload
namespace: some-ns
spec:
podGroups:
- name: workers
policy:
gang:
# The gang is schedulable only if 4 pods can run at once
minCount: 4
```

When you create your Pods, you link them to this Workload using the new `workloadRef` field:

```yaml
apiVersion: v1
kind: Pod
metadata:
name: worker-0
namespace: some-ns
spec:
workloadRef:
name: training-job-workload
podGroup: workers
...
```

## How gang scheduling works

The `gang` policy enforces *all-or-nothing* placement. Without gang scheduling,
a Job might be partially scheduled, consuming resources without being able to run,
leading to resource wastage and potential deadlocks.

When you create Pods that are part of a gang-scheduled pod group, the scheduler's `GangScheduling`
plugin manages the lifecycle independently for each pod group (or replica key):

1. When you create your Pods (or a controller makes them for you),
the scheduler blocks them from scheduling, until:
* The referenced Workload object is created.
* The referenced pod group exists in a Workload.
* The number of pending Pods in that group meets your `minCount`.

2. Once enough Pods arrive, the scheduler tries to place them. However,
instead of binding them to nodes immediately, the Pods wait at a `Permit` gate.

3. The scheduler checks if it has found valid assignments for the entire group (at least the `minCount`).
* If there is room for the group, the gate opens, and all Pods are bound to nodes.
* If only a subset of the group pods was successfully scheduled within a timeout (set to 5 minutes),
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do we have an option to configure this timeout in v1.35 ?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It's not configurable for now

the scheduler rejects **all** of the Pods in the group.
They go back to the queue, freeing up the reserved resources for other workloads.

We'd like to point out that that while this is a first implementation, the Kubernetes project firmly
intends to improve and expand the gang scheduling algorithm in future releases.
Benefits we hope to deliver include a single-cycle scheduling phase for a whole gang,
workload-level preemption, and more, moving towards the north star goal.

## Opportunistic batching

In addition to explicit gang scheduling, v1.35 introduces *opportunistic batching*.
This is a Beta feature that improves scheduling latency for identical Pods.

Unlike gang scheduling, this feature does not require the Workload API
or any explicit opt-in on the user's part. It works opportunistically within the scheduler
by identifying Pods that have identical scheduling requirements (container images, resource requests,
affinities, etc.). When the scheduler processes a Pod, it can reuse the feasibility calculations
for subsequent identical Pods in the queue, significantly speeding up the process.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe explicitly say that you benefit from the optimization if it is relevant, without taking any special steps. A new paragraph to say just that, even though you already implied it.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Added

Most users will benefit from this optimization automatically, without taking any special steps,
provided their Pods meet the following criteria.

### Restrictions
Copy link
Member

@sanposhiho sanposhiho Dec 2, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do we actually mention specific restrictions on the blog post? This section doesn't explain all limitations anyway (e.g., the scheduler config etc) Also, the limitation will likely change in the (near?) future.
Maybe just say "Opportunistic batching works under specific conditions" and link to a doc page? so readers can focus on the overview of WAS initiative.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Updated this section not to list directly any requirements


Opportunistic batching works under specific conditions. All fields used by the `kube-scheduler`
to find a placement must be identical between Pods. Additionally, using some features
disables the batching mechanism for those Pods to ensure correctness.

Note that you may need to review your `kube-scheduler` configuration
to ensure it is not implicitly disabling batching for your workloads.

See the [docs](TODO) for more details about restrictions.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I presume we should wait for these to land in the dev-1.35 branch?

Exceptions are not hard to get, IMO (the deadline for blog articles is formally today).

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, I think we can merge it as a draft if needed and then add the link to the doc later before publishing it.


## The north star vision

The project has a broad ambition to deliver workload aware scheduling.
These new APIs and scheduling enhancements are just the first steps.
In the near future, the effort aims to tackle:

* Introducing a workload scheduling phase
* Improved support for multi-node [DRA](/docs/concepts/scheduling-eviction/dynamic-resource-allocation/)
and topology aware scheduling
* Workload-level preemption
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do we have preemption in scope of Workload API ?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The workload-level preemption will be a separate feature (KEP) with additional Workload API changes

* Improved integration between scheduling and autoscaling
* Improved interaction with external workload schedulers
* Managing placement of workloads throughout their entire lifecycle
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Does this topic cover the Topology Awareness for Workload placement that we discussed?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think "Improved support for multi-node DRA and topology aware scheduling" covers that

* Multi-workload scheduling simulations

And more. The priority and implementation order of these focus areas
are subject to change. Stay tuned for further updates.

## Getting started

To try the workload aware scheduling improvements:

* Workload API: Enable the
[`GenericWorkload`](/docs/reference/command-line-tools-reference/feature-gates/#GenericWorkload)
feature gate on both `kube-apiserver` and `kube-scheduler`, and ensure the `scheduling.k8s.io/v1alpha1`
{{< glossary_tooltip text="API group" term_id="api-group" >}} is enabled.
* Gang scheduling: Enable the
[`GangScheduling`](/docs/reference/command-line-tools-reference/feature-gates/#GangScheduling)
feature gate on `kube-scheduler` (requires the Workload API to be enabled).
* Opportunistic batching: As a Beta feature, it is enabled by default in v1.35.
You can disable it using the
[`OpportunisticBatching`](/docs/reference/command-line-tools-reference/feature-gates/#OpportunisticBatching)
feature gate on `kube-scheduler` if needed.

We encourage you to try out workload aware scheduling in your test clusters
and share your experiences to help shape the future of Kubernetes scheduling.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

How should people share their experience?

Copy link
Member Author

@macsko macsko Nov 27, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good point, added a list of places where the feedback can be sent

You can send your feedback by:

* Reaching out via [Slack (#sig-scheduling)](https://kubernetes.slack.com/archives/C09TP78DV).
* Commenting on the [workload aware scheduling tracking issue](https://github.com/kubernetes/kubernetes/issues/132192)
* Filing a new [issue](https://github.com/kubernetes/enhancements/issues) in the Kubernetes repository.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Optionally, mention that custom schedulers are also welcome to honor Workload API objects, and podGroup fields in Pods.

## Learn more
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

maybe add a link to some Kubecon sessions? the one by @wojtek-t @erictune or/and our maintainer session.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I see other blog post don't link to the youtube recordings. I'm not sure if we should add them


* Read the KEPs for
[Workload API and gang scheduling](https://github.com/kubernetes/enhancements/tree/master/keps/sig-scheduling/4671-gang-scheduling) and
[Opportunistic batching](https://github.com/kubernetes/enhancements/tree/master/keps/sig-scheduling/5598-opportunistic-batching).
* Track the [Workload aware scheduling issue](https://github.com/kubernetes/kubernetes/issues/132192)
for recent updates.