Skip to content

Commit bb4f3f2

Browse files
committed
KEP-4671 Add docs for Workload API and Gang scheduling
1 parent 786e670 commit bb4f3f2

File tree

6 files changed

+288
-0
lines changed

6 files changed

+288
-0
lines changed
Lines changed: 50 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,50 @@
1+
---
2+
title: Gang Scheduling
3+
content_type: concept
4+
weight: 70
5+
---
6+
7+
<!-- overview -->
8+
{{< feature-state feature_gate_name="GangScheduling" >}}
9+
10+
Gang scheduling ensures that a group of Pods are scheduled on an "all-or-nothing" basis.
11+
If the cluster cannot accommodate the entire group (or a defined minimum number of Pods),
12+
none of the Pods are bound to a node.
13+
14+
This feature depends on the [Workload API](/docs/concepts/workloads/workload-api/).
15+
Ensure the [`GenericWorkload`](/docs/reference/command-line-tools-reference/feature-gates/#GenericWorkload)
16+
feature gate and the `scheduling.k8s.io/v1alpha1`
17+
{{< glossary_tooltip text="API group" term_id="api-group" >}} are enabled in the cluster.
18+
19+
<!-- body -->
20+
21+
## How it works
22+
23+
When the `GangScheduling` plugin is enabled, the scheduler alters the lifecycle for Pods belonging
24+
to a `gang` [pod group policy](/docs/concepts/workloads/workload-api/policies/) within
25+
a [Workload](/docs/concepts/workloads/workload-api/).
26+
The process follows these steps independently for each pod group and its replica key:
27+
28+
1. The scheduler holds Pods in the `PreEnqueue` phase until:
29+
* The referenced Workload object is created.
30+
* The referenced pod group exists in a Workload.
31+
* The number of Pods that have been created for the specific group
32+
is at least equal to the `minCount`.
33+
34+
Pods do not enter the active scheduling queue until all of these conditions are met.
35+
36+
2. Once the quorum is met, the scheduler attempts to find placements for all Pods in the group.
37+
All assigned Pods wait at the `WaitOnPermit` gate during this process.
38+
Note that in the Alpha phase of this feature, finding a placement is based on pod-by-pod scheduling,
39+
rather than a single-cycle approach.
40+
41+
3. If the scheduler finds valid placements for at least `minCount` Pods,
42+
it allows all of them to be bound to their assigned nodes. If it cannot find placements for the entire group
43+
within a fixed timeout of 5 minutes, none of the Pods are scheduled.
44+
Instead, they are moved to the unschedulable queue to wait for cluster resources to free up,
45+
allowing other workloads to be scheduled in the meantime.
46+
47+
## {{% heading "whatsnext" %}}
48+
49+
* Learn about the [Workload API](/docs/concepts/workloads/workload-api/).
50+
* See how to [reference a Workload](/docs/concepts/workloads/pods/workload-reference/) in a Pod.
Lines changed: 80 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,80 @@
1+
---
2+
title: Workload Reference
3+
content_type: concept
4+
weight: 90
5+
---
6+
7+
<!-- overview -->
8+
{{< feature-state feature_gate_name="GenericWorkload" >}}
9+
10+
You can link a Pod to a [Workload](/docs/concepts/workloads/workload-api/) object
11+
to indicate that the Pod belongs to a larger application or group. This enables the scheduler to make decisions
12+
based on the group's requirements rather than treating the Pod as an independent entity.
13+
14+
<!-- body -->
15+
16+
## Specifying a Workload reference
17+
18+
When the [`GenericWorkload`]((/docs/reference/command-line-tools-reference/feature-gates/#GenericWorkload))
19+
feature gate is enabled, you can use the `spec.workloadRef` field in your Pod manifest.
20+
This field establishes a link to a specific pod group defined within a Workload resource
21+
in the same namespace.
22+
23+
```yaml
24+
apiVersion: v1
25+
kind: Pod
26+
metadata:
27+
name: worker-0
28+
namespace: some-ns
29+
spec:
30+
workloadRef:
31+
# The name of the Workload object in the same namespace
32+
name: training-job-workload
33+
# The name of the specific pod group inside that Workload
34+
podGroup: workers
35+
```
36+
37+
### Pod group replicas
38+
39+
For more complex scenarios, you can partition a single pod group into replicated, independent scheduling units.
40+
You achieve this using the `podGroupReplicaKey` field within a Pod's `workloadRef`. This key acts as a label
41+
to create logical subgroups.
42+
43+
For example, if you have a pod group with `minCount: 2` and you create four Pods: two with `podGroupReplicaKey: "0"`
44+
and two with `podGroupReplicaKey: "1"`, they will be treated as two independent groups of two Pods.
45+
46+
```yaml
47+
spec:
48+
workloadRef:
49+
name: training-job-workload
50+
podGroup: workers
51+
# All workers with the replica key "0" will be scheduled together as one group.
52+
podGroupReplicaKey: "0"
53+
```
54+
55+
### Behavior
56+
57+
When you define a `workloadRef`, the Pod behaves differently depending on the
58+
[policy](/docs/concepts/workloads/workload-api/policies/) defined in the referenced pod group.
59+
60+
* If the referenced group uses the `basic` policy, the workload reference acts primarily as a grouping label.
61+
* If the referenced group uses the `gang` policy
62+
(and the [`GangScheduling`]((/docs/reference/command-line-tools-reference/feature-gates/#GangScheduling)) feature gate is enabled),
63+
the Pod enters a gang scheduling lifecycle. It will wait for other Pods in the group to be created
64+
and scheduled before binding to a node.
65+
66+
### Missing references
67+
68+
The scheduler validates the `workloadRef` before making any placement decisions.
69+
70+
If a Pod references a Workload that does not exist, or a pod group that is not defined within that Workload,
71+
the Pod will remain pending. It is not considered for placement until you create the missing Workload object
72+
or recreate it to include the missing `PodGroup` definition.
73+
74+
This behavior applies to all Pods with a `workloadRef`, regardless of whether the eventual policy will be `basic` or `gang`,
75+
as the scheduler requires the Workload definition to determine the policy.
76+
77+
## {{% heading "whatsnext" %}}
78+
79+
* Learn about the [Workload API](/docs/concepts/workloads/workload-api/).
80+
* Read the details of [pod group policies](/docs/concepts/workloads/workload-api/policies/).
Lines changed: 68 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,68 @@
1+
---
2+
title: "Workload API"
3+
weight: 20
4+
simple_list: true
5+
---
6+
7+
<!-- overview -->
8+
{{< feature-state feature_gate_name="GenericWorkload" >}}
9+
10+
The Workload API resource allows you to describe the scheduling requirements and structure of a multi-Pod application.
11+
While workload controllers provide runtime behavior for the workloads,
12+
the Workload API is supposed to provide scheduling constraints for the "true" workloads, such as Job and others.
13+
14+
<!-- body -->
15+
16+
## What is a Workload?
17+
18+
The Workload API resource is part of the `scheduling.k8s.io/v1alpha1`
19+
{{< glossary_tooltip text="API group" term_id="api-group" >}}.
20+
This resource acts as a structured, machine-readable definition of the scheduling requirements
21+
of a multi-Pod application. While user-facing workloads like [Jobs](/docs/concepts/workloads/controllers/job/)
22+
define what to run, the Workload resource determines how a group of Pods should be scheduled
23+
and how its placement should be managed throughout its lifecycle.
24+
25+
## API structure
26+
27+
A Workload allows you to define a group of Pods and apply a scheduling policy to them.
28+
It consists of two sections: a list of pod groups and a reference to a controller.
29+
30+
### Pod groups
31+
32+
The `podGroups` list defines the distinct components of your workload.
33+
For example, a machine learning job might have a `driver` group and a `worker` group.
34+
35+
Each entry in `podGroups` must have:
36+
1. A unique `name` that can be used in the Pod's [Workload reference](/docs/concepts/workloads/pods/workload-reference/).
37+
2. A [scheduling policy](/docs/concepts/workloads/workload-api/policies/) (`basic` or `gang`).
38+
39+
```yaml
40+
apiVersion: scheduling.k8s.io/v1alpha1
41+
kind: Workload
42+
metadata:
43+
name: training-job-workload
44+
namespace: some-ns
45+
spec:
46+
controllerRef:
47+
apiGroup: batch
48+
kind: Job
49+
name: training-job
50+
podGroups:
51+
- name: workers
52+
policy:
53+
gang:
54+
# The gang is schedulable only if 4 pods can run at once
55+
minCount: 4
56+
```
57+
58+
### Referencing a workload controlling object
59+
60+
The `controllerRef` field links the Workload back to the specific high-level object defining the application,
61+
such as a [Job](/docs/concepts/workloads/controllers/job/) or a custom CRD. This is useful for observability and tooling.
62+
This data is not used to schedule or manage the Workload.
63+
64+
## {{% heading "whatsnext" %}}
65+
66+
* See how to [reference a Workload](/docs/concepts/workloads/pods/workload-reference/) in a Pod.
67+
* Learn about [pod group policies](/docs/concepts/workloads/workload-api/policies/).
68+
* Read about [gang scheduling](/docs/concepts/scheduling-eviction/gang-scheduling/) algorithm.
Lines changed: 57 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,57 @@
1+
---
2+
title: Pod Group Policies
3+
content_type: concept
4+
weight: 10
5+
---
6+
7+
<!-- overview -->
8+
{{< feature-state feature_gate_name="GenericWorkload" >}}
9+
10+
Every pod group defined in a [Workload](/docs/concepts/workloads/workload-api/)
11+
must declare a scheduling policy. This policy dictates how the scheduler treats the collection of Pods.
12+
13+
<!-- body -->
14+
15+
## Policy types
16+
17+
The API currently supports two policy types: `basic` and `gang`.
18+
You must specify exactly one policy for each group.
19+
20+
### Basic policy
21+
22+
The `basic` policy instructs the scheduler to treat all Pods in the group as independent entities,
23+
scheduling them using the standard Kubernetes behavior.
24+
25+
The main reason to use the `basic` policy is to organize the Pods within your Workload
26+
for better observability and management.
27+
28+
This policy can be used for groups of a Workload that do not require simultaneous startup
29+
but logically belong to the application, or to open the way for future group constraints
30+
that do not imply "all-or-nothing" placement.
31+
32+
```yaml
33+
policy:
34+
basic: {}
35+
```
36+
37+
### Gang policy
38+
39+
The `gang` policy enforces "all-or-nothing" scheduling. This is essential for tightly-coupled workloads
40+
where partial startup results in deadlocks or wasted resources.
41+
42+
This can be used for [Jobs](/docs/concepts/workloads/controllers/job/)
43+
or any other batch process where all workers must run concurrently to make progress.
44+
45+
The `gang` policy requires a `minCount` parameter:
46+
47+
```yaml
48+
policy:
49+
gang:
50+
# The number of Pods that must be schedulable simultaneously
51+
# for the group to be admitted.
52+
minCount: 4
53+
```
54+
55+
## {{% heading "whatsnext" %}}
56+
57+
* Read about [gang scheduling](/docs/concepts/scheduling-eviction/gang-scheduling/) algorithm.
Lines changed: 16 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,16 @@
1+
---
2+
title: GangScheduling
3+
content_type: feature_gate
4+
_build:
5+
list: never
6+
render: false
7+
8+
stages:
9+
- stage: alpha
10+
defaultValue: false
11+
fromVersion: "1.35"
12+
---
13+
14+
Enables the GangScheduling plugin in kube-scheduler, which implements "all-or-nothing"
15+
scheduling algorithm. The [Workload API](/docs/concepts/workloads/workload-api/) is used
16+
to express the requirements.
Lines changed: 17 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,17 @@
1+
---
2+
title: GenericWorkload
3+
content_type: feature_gate
4+
_build:
5+
list: never
6+
render: false
7+
8+
stages:
9+
- stage: alpha
10+
defaultValue: false
11+
fromVersion: "1.35"
12+
---
13+
14+
Enables the support for [Workload API](/docs/concepts/workloads/workload-api/) to express scheduling requirements
15+
at the workload level. Pods can now reference a specific Workload PodGroup using the spec.workloadRef field.
16+
scheduling.k8s.io/v1alpha1 {{< glossary_tooltip text="API group" term_id="api-group" >}}
17+
has to be enabled to make the Workload API available.

0 commit comments

Comments
 (0)