Skip to content

Commit 2a40dd2

Browse files
authored
Merge pull request #53296 from macsko/gang_scheduling_docs
KEP-4671 Add docs for Workload API and Gang scheduling
2 parents b235bf5 + 451e915 commit 2a40dd2

File tree

9 files changed

+316
-0
lines changed

9 files changed

+316
-0
lines changed

content/en/docs/concepts/scheduling-eviction/_index.md

Lines changed: 1 addition & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -24,6 +24,7 @@ of terminating one or more Pods on Nodes.
2424
* [Scheduler Performance Tuning](/docs/concepts/scheduling-eviction/scheduler-perf-tuning/)
2525
* [Resource Bin Packing for Extended Resources](/docs/concepts/scheduling-eviction/resource-bin-packing/)
2626
* [Pod Scheduling Readiness](/docs/concepts/scheduling-eviction/pod-scheduling-readiness/)
27+
* [Gang Scheduling](/docs/concepts/scheduling-eviction/gang-scheduling/)
2728
* [Descheduler](https://github.com/kubernetes-sigs/descheduler#descheduler-for-kubernetes)
2829

2930
## Pod Disruption
Lines changed: 50 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,50 @@
1+
---
2+
title: Gang Scheduling
3+
content_type: concept
4+
weight: 70
5+
---
6+
7+
<!-- overview -->
8+
{{< feature-state feature_gate_name="GangScheduling" >}}
9+
10+
Gang scheduling ensures that a group of Pods are scheduled on an "all-or-nothing" basis.
11+
If the cluster cannot accommodate the entire group (or a defined minimum number of Pods),
12+
none of the Pods are bound to a node.
13+
14+
This feature depends on the [Workload API](/docs/concepts/workloads/workload-api/).
15+
Ensure the [`GenericWorkload`](/docs/reference/command-line-tools-reference/feature-gates/#GenericWorkload)
16+
feature gate and the `scheduling.k8s.io/v1alpha1`
17+
{{< glossary_tooltip text="API group" term_id="api-group" >}} are enabled in the cluster.
18+
19+
<!-- body -->
20+
21+
## How it works
22+
23+
When the `GangScheduling` plugin is enabled, the scheduler alters the lifecycle for Pods belonging
24+
to a `gang` [pod group policy](/docs/concepts/workloads/workload-api/policies/) within
25+
a [Workload](/docs/concepts/workloads/workload-api/).
26+
The process follows these steps independently for each pod group and its replica key:
27+
28+
1. The scheduler holds Pods in the `PreEnqueue` phase until:
29+
* The referenced Workload object is created.
30+
* The referenced pod group exists in a Workload.
31+
* The number of Pods that have been created for the specific group
32+
is at least equal to the `minCount`.
33+
34+
Pods do not enter the active scheduling queue until all of these conditions are met.
35+
36+
2. Once the quorum is met, the scheduler attempts to find placements for all Pods in the group.
37+
All assigned Pods wait at the `WaitOnPermit` gate during this process.
38+
Note that in the Alpha phase of this feature, finding a placement is based on pod-by-pod scheduling,
39+
rather than a single-cycle approach.
40+
41+
3. If the scheduler finds valid placements for at least `minCount` Pods,
42+
it allows all of them to be bound to their assigned nodes. If it cannot find placements for the entire group
43+
within a fixed timeout of 5 minutes, none of the Pods are scheduled.
44+
Instead, they are moved to the unschedulable queue to wait for cluster resources to free up,
45+
allowing other workloads to be scheduled in the meantime.
46+
47+
## {{% heading "whatsnext" %}}
48+
49+
* Learn about the [Workload API](/docs/concepts/workloads/workload-api/).
50+
* See how to [reference a Workload](/docs/concepts/workloads/pods/workload-reference/) in a Pod.

content/en/docs/concepts/workloads/_index.md

Lines changed: 12 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -66,6 +66,18 @@ of Kubernetes' core. For example, if you wanted to run a group of Pods for your
6666
stop work unless _all_ the Pods are available (perhaps for some high-throughput distributed task),
6767
then you can implement or install an extension that does provide that feature.
6868

69+
## Workload placement
70+
71+
{{< feature-state feature_gate_name="GenericWorkload" >}}
72+
73+
While standard workload resources (like Deployments and Jobs) manage the lifecycle of Pods,
74+
you may have complex scheduling requirements where groups of Pods must be treated as a single unit.
75+
76+
The [Workload API](/docs/concepts/workloads/workload-api/) allows you to define a group of Pods
77+
and apply advanced scheduling policies to them, such as [gang scheduling](/docs/concepts/scheduling-eviction/gang-scheduling/).
78+
This is particularly useful for batch processing and machine learning workloads
79+
where "all-or-nothing" placement is required.
80+
6981
## {{% heading "whatsnext" %}}
7082

7183
As well as reading about each API kind for workload management, you can read how to

content/en/docs/concepts/workloads/pods/_index.md

Lines changed: 12 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -155,6 +155,18 @@ Here are some examples of workload resources that manage one or more Pods:
155155
* {{< glossary_tooltip text="StatefulSet" term_id="statefulset" >}}
156156
* {{< glossary_tooltip text="DaemonSet" term_id="daemonset" >}}
157157

158+
### Specifying a Workload reference
159+
160+
{{< feature-state feature_gate_name="GenericWorkload" >}}
161+
162+
By default, Kubernetes schedules every Pod individually. However, some tightly-coupled applications
163+
need a group of Pods to be scheduled simultaneously to function correctly.
164+
165+
You can link a Pod to a [Workload](/docs/concepts/workloads/workload-api/) object
166+
using a [Workload reference](/docs/concepts/workloads/pods/workload-reference/).
167+
This tells the `kube-scheduler` that the Pod is part of a specific group,
168+
enabling it to make coordinated placement decisions for the entire group at once.
169+
158170
### Pod templates
159171

160172
Controllers for {{< glossary_tooltip text="workload" term_id="workload" >}} resources create Pods
Lines changed: 80 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,80 @@
1+
---
2+
title: Workload Reference
3+
content_type: concept
4+
weight: 90
5+
---
6+
7+
<!-- overview -->
8+
{{< feature-state feature_gate_name="GenericWorkload" >}}
9+
10+
You can link a Pod to a [Workload](/docs/concepts/workloads/workload-api/) object
11+
to indicate that the Pod belongs to a larger application or group. This enables the scheduler to make decisions
12+
based on the group's requirements rather than treating the Pod as an independent entity.
13+
14+
<!-- body -->
15+
16+
## Specifying a Workload reference
17+
18+
When the [`GenericWorkload`]((/docs/reference/command-line-tools-reference/feature-gates/#GenericWorkload))
19+
feature gate is enabled, you can use the `spec.workloadRef` field in your Pod manifest.
20+
This field establishes a link to a specific pod group defined within a Workload resource
21+
in the same namespace.
22+
23+
```yaml
24+
apiVersion: v1
25+
kind: Pod
26+
metadata:
27+
name: worker-0
28+
namespace: some-ns
29+
spec:
30+
workloadRef:
31+
# The name of the Workload object in the same namespace
32+
name: training-job-workload
33+
# The name of the specific pod group inside that Workload
34+
podGroup: workers
35+
```
36+
37+
### Pod group replicas
38+
39+
For more complex scenarios, you can replicate a single pod group into multiple, independent scheduling units.
40+
You achieve this using the `podGroupReplicaKey` field within a Pod's `workloadRef`. This key acts as a label
41+
to create logical subgroups.
42+
43+
For example, if you have a pod group with `minCount: 2` and you create four Pods: two with `podGroupReplicaKey: "0"`
44+
and two with `podGroupReplicaKey: "1"`, they will be treated as two independent groups of two Pods.
45+
46+
```yaml
47+
spec:
48+
workloadRef:
49+
name: training-job-workload
50+
podGroup: workers
51+
# All workers with the replica key "0" will be scheduled together as one group.
52+
podGroupReplicaKey: "0"
53+
```
54+
55+
### Behavior
56+
57+
When you define a `workloadRef`, the Pod behaves differently depending on the
58+
[policy](/docs/concepts/workloads/workload-api/policies/) defined in the referenced pod group.
59+
60+
* If the referenced group uses the `basic` policy, the workload reference acts primarily as a grouping label.
61+
* If the referenced group uses the `gang` policy
62+
(and the [`GangScheduling`]((/docs/reference/command-line-tools-reference/feature-gates/#GangScheduling)) feature gate is enabled),
63+
the Pod enters a gang scheduling lifecycle. It will wait for other Pods in the group to be created
64+
and scheduled before binding to a node.
65+
66+
### Missing references
67+
68+
The scheduler validates the `workloadRef` before making any placement decisions.
69+
70+
If a Pod references a Workload that does not exist, or a pod group that is not defined within that Workload,
71+
the Pod will remain pending. It is not considered for placement until you create the missing Workload object
72+
or recreate it to include the missing `PodGroup` definition.
73+
74+
This behavior applies to all Pods with a `workloadRef`, regardless of whether the eventual policy will be `basic` or `gang`,
75+
as the scheduler requires the Workload definition to determine the policy.
76+
77+
## {{% heading "whatsnext" %}}
78+
79+
* Learn about the [Workload API](/docs/concepts/workloads/workload-api/).
80+
* Read the details of [pod group policies](/docs/concepts/workloads/workload-api/policies/).
Lines changed: 71 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,71 @@
1+
---
2+
title: "Workload API"
3+
weight: 20
4+
simple_list: true
5+
---
6+
7+
<!-- overview -->
8+
{{< feature-state feature_gate_name="GenericWorkload" >}}
9+
10+
The Workload API resource allows you to describe the scheduling requirements and structure of a multi-Pod application.
11+
While workload controllers provide runtime behavior for the workloads,
12+
the Workload API is supposed to provide scheduling constraints for the "true" workloads, such as Job and others.
13+
14+
<!-- body -->
15+
16+
## What is a Workload?
17+
18+
The Workload API resource is part of the `scheduling.k8s.io/v1alpha1`
19+
{{< glossary_tooltip text="API group" term_id="api-group" >}}
20+
(and your cluster must have that API group enabled, as well as the `GenericWorkload`
21+
[feature gate](/docs/reference/command-line-tools-reference/feature-gates/),
22+
before you can benefit from this API).
23+
This resource acts as a structured, machine-readable definition of the scheduling requirements
24+
of a multi-Pod application. While user-facing workloads like [Jobs](/docs/concepts/workloads/controllers/job/)
25+
define what to run, the Workload resource determines how a group of Pods should be scheduled
26+
and how its placement should be managed throughout its lifecycle.
27+
28+
## API structure
29+
30+
A Workload allows you to define a group of Pods and apply a scheduling policy to them.
31+
It consists of two sections: a list of pod groups and a reference to a controller.
32+
33+
### Pod groups
34+
35+
The `podGroups` list defines the distinct components of your workload.
36+
For example, a machine learning job might have a `driver` group and a `worker` group.
37+
38+
Each entry in `podGroups` must have:
39+
1. A unique `name` that can be used in the Pod's [Workload reference](/docs/concepts/workloads/pods/workload-reference/).
40+
2. A [scheduling policy](/docs/concepts/workloads/workload-api/policies/) (`basic` or `gang`).
41+
42+
```yaml
43+
apiVersion: scheduling.k8s.io/v1alpha1
44+
kind: Workload
45+
metadata:
46+
name: training-job-workload
47+
namespace: some-ns
48+
spec:
49+
controllerRef:
50+
apiGroup: batch
51+
kind: Job
52+
name: training-job
53+
podGroups:
54+
- name: workers
55+
policy:
56+
gang:
57+
# The gang is schedulable only if 4 pods can run at once
58+
minCount: 4
59+
```
60+
61+
### Referencing a workload controlling object
62+
63+
The `controllerRef` field links the Workload back to the specific high-level object defining the application,
64+
such as a [Job](/docs/concepts/workloads/controllers/job/) or a custom CRD. This is useful for observability and tooling.
65+
This data is not used to schedule or manage the Workload.
66+
67+
## {{% heading "whatsnext" %}}
68+
69+
* See how to [reference a Workload](/docs/concepts/workloads/pods/workload-reference/) in a Pod.
70+
* Learn about [pod group policies](/docs/concepts/workloads/workload-api/policies/).
71+
* Read about [gang scheduling](/docs/concepts/scheduling-eviction/gang-scheduling/) algorithm.
Lines changed: 57 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,57 @@
1+
---
2+
title: Pod Group Policies
3+
content_type: concept
4+
weight: 10
5+
---
6+
7+
<!-- overview -->
8+
{{< feature-state feature_gate_name="GenericWorkload" >}}
9+
10+
Every pod group defined in a [Workload](/docs/concepts/workloads/workload-api/)
11+
must declare a scheduling policy. This policy dictates how the scheduler treats the collection of Pods.
12+
13+
<!-- body -->
14+
15+
## Policy types
16+
17+
The API currently supports two policy types: `basic` and `gang`.
18+
You must specify exactly one policy for each group.
19+
20+
### Basic policy
21+
22+
The `basic` policy instructs the scheduler to treat all Pods in the group as independent entities,
23+
scheduling them using the standard Kubernetes behavior.
24+
25+
The main reason to use the `basic` policy is to organize the Pods within your Workload
26+
for better observability and management.
27+
28+
This policy can be used for groups of a Workload that do not require simultaneous startup
29+
but logically belong to the application, or to open the way for future group constraints
30+
that do not imply "all-or-nothing" placement.
31+
32+
```yaml
33+
policy:
34+
basic: {}
35+
```
36+
37+
### Gang policy
38+
39+
The `gang` policy enforces "all-or-nothing" scheduling. This is essential for tightly-coupled workloads
40+
where partial startup results in deadlocks or wasted resources.
41+
42+
This can be used for [Jobs](/docs/concepts/workloads/controllers/job/)
43+
or any other batch process where all workers must run concurrently to make progress.
44+
45+
The `gang` policy requires a `minCount` parameter:
46+
47+
```yaml
48+
policy:
49+
gang:
50+
# The number of Pods that must be schedulable simultaneously
51+
# for the group to be admitted.
52+
minCount: 4
53+
```
54+
55+
## {{% heading "whatsnext" %}}
56+
57+
* Read about [gang scheduling](/docs/concepts/scheduling-eviction/gang-scheduling/) algorithm.
Lines changed: 16 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,16 @@
1+
---
2+
title: GangScheduling
3+
content_type: feature_gate
4+
_build:
5+
list: never
6+
render: false
7+
8+
stages:
9+
- stage: alpha
10+
defaultValue: false
11+
fromVersion: "1.35"
12+
---
13+
14+
Enables the GangScheduling plugin in kube-scheduler, which implements "all-or-nothing"
15+
scheduling algorithm. The [Workload API](/docs/concepts/workloads/workload-api/) is used
16+
to express the requirements.
Lines changed: 17 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,17 @@
1+
---
2+
title: GenericWorkload
3+
content_type: feature_gate
4+
_build:
5+
list: never
6+
render: false
7+
8+
stages:
9+
- stage: alpha
10+
defaultValue: false
11+
fromVersion: "1.35"
12+
---
13+
14+
Enables the support for [Workload API](/docs/concepts/workloads/workload-api/) to express scheduling requirements at the workload level.
15+
16+
When enabled Pods can reference a specific pod group and use this to influence
17+
the way that they are scheduled.

0 commit comments

Comments
 (0)