Skip to content
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
1 change: 1 addition & 0 deletions content/en/docs/concepts/scheduling-eviction/_index.md
Original file line number Diff line number Diff line change
Expand Up @@ -24,6 +24,7 @@ of terminating one or more Pods on Nodes.
* [Scheduler Performance Tuning](/docs/concepts/scheduling-eviction/scheduler-perf-tuning/)
* [Resource Bin Packing for Extended Resources](/docs/concepts/scheduling-eviction/resource-bin-packing/)
* [Pod Scheduling Readiness](/docs/concepts/scheduling-eviction/pod-scheduling-readiness/)
* [Gang Scheduling](/docs/concepts/scheduling-eviction/gang-scheduling/)
* [Descheduler](https://github.com/kubernetes-sigs/descheduler#descheduler-for-kubernetes)

## Pod Disruption
Expand Down
50 changes: 50 additions & 0 deletions content/en/docs/concepts/scheduling-eviction/gang-scheduling.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,50 @@
---
title: Gang Scheduling
content_type: concept
weight: 70
---

<!-- overview -->
{{< feature-state feature_gate_name="GangScheduling" >}}

Gang scheduling ensures that a group of Pods are scheduled on an "all-or-nothing" basis.
If the cluster cannot accommodate the entire group (or a defined minimum number of Pods),
none of the Pods are bound to a node.

This feature depends on the [Workload API](/docs/concepts/workloads/workload-api/).
Ensure the [`GenericWorkload`](/docs/reference/command-line-tools-reference/feature-gates/#GenericWorkload)
feature gate and the `scheduling.k8s.io/v1alpha1`
{{< glossary_tooltip text="API group" term_id="api-group" >}} are enabled in the cluster.

<!-- body -->

## How it works

When the `GangScheduling` plugin is enabled, the scheduler alters the lifecycle for Pods belonging
to a `gang` [pod group policy](/docs/concepts/workloads/workload-api/policies/) within
a [Workload](/docs/concepts/workloads/workload-api/).
The process follows these steps independently for each pod group and its replica key:

1. The scheduler holds Pods in the `PreEnqueue` phase until:
* The referenced Workload object is created.
* The referenced pod group exists in a Workload.
* The number of Pods that have been created for the specific group
is at least equal to the `minCount`.

Pods do not enter the active scheduling queue until all of these conditions are met.

2. Once the quorum is met, the scheduler attempts to find placements for all Pods in the group.
All assigned Pods wait at the `WaitOnPermit` gate during this process.
Note that in the Alpha phase of this feature, finding a placement is based on pod-by-pod scheduling,
rather than a single-cycle approach.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
rather than a single-cycle approach.
rather than a more sophisticated logic capable of scheduling all required pods at once.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We can still update the merged docs, even after docs freeze

The key thing about the deadline is that we must have docs that are at least good enough ahead of the upcoming release.


3. If the scheduler finds valid placements for at least `minCount` Pods,
it allows all of them to be bound to their assigned nodes. If it cannot find placements for the entire group
within a fixed timeout of 5 minutes, none of the Pods are scheduled.
Instead, they are moved to the unschedulable queue to wait for cluster resources to free up,
allowing other workloads to be scheduled in the meantime.

## {{% heading "whatsnext" %}}

* Learn about the [Workload API](/docs/concepts/workloads/workload-api/).
* See how to [reference a Workload](/docs/concepts/workloads/pods/workload-reference/) in a Pod.
12 changes: 12 additions & 0 deletions content/en/docs/concepts/workloads/_index.md
Original file line number Diff line number Diff line change
Expand Up @@ -66,6 +66,18 @@ of Kubernetes' core. For example, if you wanted to run a group of Pods for your
stop work unless _all_ the Pods are available (perhaps for some high-throughput distributed task),
then you can implement or install an extension that does provide that feature.

## Workload placement

{{< feature-state feature_gate_name="GenericWorkload" >}}

While standard workload resources (like Deployments and Jobs) manage the lifecycle of Pods,
you may have complex scheduling requirements where groups of Pods must be treated as a single unit.

The [Workload API](/docs/concepts/workloads/workload-api/) allows you to define a group of Pods
and apply advanced scheduling policies to them, such as [gang scheduling](/docs/concepts/scheduling-eviction/gang-scheduling/).
This is particularly useful for batch processing and machine learning workloads
where "all-or-nothing" placement is required.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
where "all-or-nothing" placement is required.
where "all-or-nothing" scheduling is required.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think placement is OK, TBH.


## {{% heading "whatsnext" %}}

As well as reading about each API kind for workload management, you can read how to
Expand Down
12 changes: 12 additions & 0 deletions content/en/docs/concepts/workloads/pods/_index.md
Original file line number Diff line number Diff line change
Expand Up @@ -155,6 +155,18 @@ Here are some examples of workload resources that manage one or more Pods:
* {{< glossary_tooltip text="StatefulSet" term_id="statefulset" >}}
* {{< glossary_tooltip text="DaemonSet" term_id="daemonset" >}}

### Specifying a Workload reference

{{< feature-state feature_gate_name="GenericWorkload" >}}

By default, Kubernetes schedules every Pod individually. However, some tightly-coupled applications
need a group of Pods to be scheduled simultaneously to function correctly.

You can link a Pod to a [Workload](/docs/concepts/workloads/workload-api/) object
using a [Workload reference](/docs/concepts/workloads/pods/workload-reference/).
This tells the `kube-scheduler` that the Pod is part of a specific group,
enabling it to make coordinated placement decisions for the entire group at once.

### Pod templates

Controllers for {{< glossary_tooltip text="workload" term_id="workload" >}} resources create Pods
Expand Down
80 changes: 80 additions & 0 deletions content/en/docs/concepts/workloads/pods/workload-reference.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,80 @@
---
title: Workload Reference
content_type: concept
weight: 90
---

<!-- overview -->
{{< feature-state feature_gate_name="GenericWorkload" >}}

You can link a Pod to a [Workload](/docs/concepts/workloads/workload-api/) object
to indicate that the Pod belongs to a larger application or group. This enables the scheduler to make decisions
based on the group's requirements rather than treating the Pod as an independent entity.

<!-- body -->

## Specifying a Workload reference

When the [`GenericWorkload`]((/docs/reference/command-line-tools-reference/feature-gates/#GenericWorkload))
feature gate is enabled, you can use the `spec.workloadRef` field in your Pod manifest.
This field establishes a link to a specific pod group defined within a Workload resource
in the same namespace.

```yaml
apiVersion: v1
kind: Pod
metadata:
name: worker-0
namespace: some-ns
spec:
workloadRef:
# The name of the Workload object in the same namespace
name: training-job-workload
# The name of the specific pod group inside that Workload
podGroup: workers
```
### Pod group replicas
For more complex scenarios, you can replicate a single pod group into multiple, independent scheduling units.
You achieve this using the `podGroupReplicaKey` field within a Pod's `workloadRef`. This key acts as a label
to create logical subgroups.

For example, if you have a pod group with `minCount: 2` and you create four Pods: two with `podGroupReplicaKey: "0"`
and two with `podGroupReplicaKey: "1"`, they will be treated as two independent groups of two Pods.
```yaml
spec:
workloadRef:
name: training-job-workload
podGroup: workers
# All workers with the replica key "0" will be scheduled together as one group.
podGroupReplicaKey: "0"
```
### Behavior
When you define a `workloadRef`, the Pod behaves differently depending on the
[policy](/docs/concepts/workloads/workload-api/policies/) defined in the referenced pod group.

* If the referenced group uses the `basic` policy, the workload reference acts primarily as a grouping label.
* If the referenced group uses the `gang` policy
(and the [`GangScheduling`]((/docs/reference/command-line-tools-reference/feature-gates/#GangScheduling)) feature gate is enabled),
the Pod enters a gang scheduling lifecycle. It will wait for other Pods in the group to be created
and scheduled before binding to a node.

### Missing references

The scheduler validates the `workloadRef` before making any placement decisions.

If a Pod references a Workload that does not exist, or a pod group that is not defined within that Workload,
the Pod will remain pending. It is not considered for placement until you create the missing Workload object
or recreate it to include the missing `PodGroup` definition.
Copy link
Member

@lmktfy lmktfy Nov 29, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

For beta, try for this:

Suggested change
or recreate it to include the missing `PodGroup` definition.
or recreate it to include the missing pod group definition.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We can still update the merged docs, even after docs freeze

The key thing about the deadline is that we must have docs that are at least good enough ahead of the upcoming release.


This behavior applies to all Pods with a `workloadRef`, regardless of whether the eventual policy will be `basic` or `gang`,
as the scheduler requires the Workload definition to determine the policy.

## {{% heading "whatsnext" %}}

* Learn about the [Workload API](/docs/concepts/workloads/workload-api/).
* Read the details of [pod group policies](/docs/concepts/workloads/workload-api/policies/).
71 changes: 71 additions & 0 deletions content/en/docs/concepts/workloads/workload-api/_index.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,71 @@
---
title: "Workload API"
weight: 20
simple_list: true
---

<!-- overview -->
{{< feature-state feature_gate_name="GenericWorkload" >}}

The Workload API resource allows you to describe the scheduling requirements and structure of a multi-Pod application.
While workload controllers provide runtime behavior for the workloads,
the Workload API is supposed to provide scheduling constraints for the "true" workloads, such as Job and others.

<!-- body -->

## What is a Workload?

The Workload API resource is part of the `scheduling.k8s.io/v1alpha1`
{{< glossary_tooltip text="API group" term_id="api-group" >}}
(and your cluster must have that API group enabled, as well as the `GenericWorkload`
[feature gate](/docs/reference/command-line-tools-reference/feature-gates/),
before you can benefit from this API).
This resource acts as a structured, machine-readable definition of the scheduling requirements
of a multi-Pod application. While user-facing workloads like [Jobs](/docs/concepts/workloads/controllers/job/)
define what to run, the Workload resource determines how a group of Pods should be scheduled
and how its placement should be managed throughout its lifecycle.

## API structure

A Workload allows you to define a group of Pods and apply a scheduling policy to them.
It consists of two sections: a list of pod groups and a reference to a controller.

### Pod groups

The `podGroups` list defines the distinct components of your workload.
For example, a machine learning job might have a `driver` group and a `worker` group.

Each entry in `podGroups` must have:
1. A unique `name` that can be used in the Pod's [Workload reference](/docs/concepts/workloads/pods/workload-reference/).
2. A [scheduling policy](/docs/concepts/workloads/workload-api/policies/) (`basic` or `gang`).

```yaml
apiVersion: scheduling.k8s.io/v1alpha1
kind: Workload
metadata:
name: training-job-workload
namespace: some-ns
spec:
controllerRef:
apiGroup: batch
kind: Job
name: training-job
podGroups:
- name: workers
policy:
gang:
# The gang is schedulable only if 4 pods can run at once
minCount: 4
```
### Referencing a workload controlling object
The `controllerRef` field links the Workload back to the specific high-level object defining the application,
such as a [Job](/docs/concepts/workloads/controllers/job/) or a custom CRD. This is useful for observability and tooling.
This data is not used to schedule or manage the Workload.

## {{% heading "whatsnext" %}}

* See how to [reference a Workload](/docs/concepts/workloads/pods/workload-reference/) in a Pod.
* Learn about [pod group policies](/docs/concepts/workloads/workload-api/policies/).
* Read about [gang scheduling](/docs/concepts/scheduling-eviction/gang-scheduling/) algorithm.
57 changes: 57 additions & 0 deletions content/en/docs/concepts/workloads/workload-api/policies.md
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: should update https://kubernetes.io/docs/concepts/policy/ to hyperlink here

Original file line number Diff line number Diff line change
@@ -0,0 +1,57 @@
---
title: Pod Group Policies
content_type: concept
weight: 10
---

<!-- overview -->
{{< feature-state feature_gate_name="GenericWorkload" >}}

Every pod group defined in a [Workload](/docs/concepts/workloads/workload-api/)
must declare a scheduling policy. This policy dictates how the scheduler treats the collection of Pods.

<!-- body -->

## Policy types

The API currently supports two policy types: `basic` and `gang`.
You must specify exactly one policy for each group.

### Basic policy

The `basic` policy instructs the scheduler to treat all Pods in the group as independent entities,
scheduling them using the standard Kubernetes behavior.

The main reason to use the `basic` policy is to organize the Pods within your Workload
for better observability and management.

This policy can be used for groups of a Workload that do not require simultaneous startup
but logically belong to the application, or to open the way for future group constraints
that do not imply "all-or-nothing" placement.

```yaml
policy:
basic: {}
```
### Gang policy
The `gang` policy enforces "all-or-nothing" scheduling. This is essential for tightly-coupled workloads
where partial startup results in deadlocks or wasted resources.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
where partial startup results in deadlocks or wasted resources.
need a group of Pods to be scheduled simultaneously to function correctly. Partial startup results in resource waste and may even lead to deadlocks.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We can still update the merged docs, even after docs freeze

The key thing about the deadline is that we must have docs that are at least good enough ahead of the upcoming release.


This can be used for [Jobs](/docs/concepts/workloads/controllers/job/)
or any other batch process where all workers must run concurrently to make progress.

The `gang` policy requires a `minCount` parameter:

```yaml
policy:
gang:
# The number of Pods that must be schedulable simultaneously
# for the group to be admitted.
minCount: 4
```

## {{% heading "whatsnext" %}}

* Read about [gang scheduling](/docs/concepts/scheduling-eviction/gang-scheduling/) algorithm.
Original file line number Diff line number Diff line change
@@ -0,0 +1,16 @@
---
title: GangScheduling
content_type: feature_gate
_build:
list: never
render: false

stages:
- stage: alpha
defaultValue: false
fromVersion: "1.35"
---

Enables the GangScheduling plugin in kube-scheduler, which implements "all-or-nothing"
scheduling algorithm. The [Workload API](/docs/concepts/workloads/workload-api/) is used
to express the requirements.
Original file line number Diff line number Diff line change
@@ -0,0 +1,17 @@
---
title: GenericWorkload
content_type: feature_gate
_build:
list: never
render: false

stages:
- stage: alpha
defaultValue: false
fromVersion: "1.35"
---

Enables the support for [Workload API](/docs/concepts/workloads/workload-api/) to express scheduling requirements at the workload level.

When enabled Pods can reference a specific pod group and use this to influence
the way that they are scheduled.