KEP-4671 Add docs for Workload API and Gang scheduling

macsko · macsko · commit bb4f3f222f3e · 2025-11-26T13:59:38.000Z
diff --git a/content/en/docs/concepts/scheduling-eviction/gang-scheduling.md b/content/en/docs/concepts/scheduling-eviction/gang-scheduling.md
@@ -0,0 +1,50 @@
+---
+title: Gang Scheduling
+content_type: concept
+weight: 70
+---
+
+<!-- overview -->
+{{< feature-state feature_gate_name="GangScheduling" >}}
+
+Gang scheduling ensures that a group of Pods are scheduled on an "all-or-nothing" basis.
+If the cluster cannot accommodate the entire group (or a defined minimum number of Pods),
+none of the Pods are bound to a node.
+
+This feature depends on the [Workload API](/docs/concepts/workloads/workload-api/).
+Ensure the [`GenericWorkload`](/docs/reference/command-line-tools-reference/feature-gates/#GenericWorkload)
+feature gate and the `scheduling.k8s.io/v1alpha1`
+{{< glossary_tooltip text="API group" term_id="api-group" >}} are enabled in the cluster.
+
+<!-- body -->
+
+## How it works
+
+When the `GangScheduling` plugin is enabled, the scheduler alters the lifecycle for Pods belonging
+to a `gang` [pod group policy](/docs/concepts/workloads/workload-api/policies/) within
+a [Workload](/docs/concepts/workloads/workload-api/).
+The process follows these steps independently for each pod group and its replica key:
+
+1. The scheduler holds Pods in the `PreEnqueue` phase until:
+   * The referenced Workload object is created.
+   * The referenced pod group exists in a Workload.
+   * The number of Pods that have been created for the specific group
+     is at least equal to the `minCount`.
+
+   Pods do not enter the active scheduling queue until all of these conditions are met.
+
+2. Once the quorum is met, the scheduler attempts to find placements for all Pods in the group.
+   All assigned Pods wait at the `WaitOnPermit` gate during this process.
+   Note that in the Alpha phase of this feature, finding a placement is based on pod-by-pod scheduling,
+   rather than a single-cycle approach.
+
+3. If the scheduler finds valid placements for at least `minCount` Pods,
+   it allows all of them to be bound to their assigned nodes. If it cannot find placements for the entire group
+   within a fixed timeout of 5 minutes, none of the Pods are scheduled.
+   Instead, they are moved to the unschedulable queue to wait for cluster resources to free up,
+   allowing other workloads to be scheduled in the meantime.
+
+## {{% heading "whatsnext" %}}
+
+* Learn about the [Workload API](/docs/concepts/workloads/workload-api/).
+* See how to [reference a Workload](/docs/concepts/workloads/pods/workload-reference/) in a Pod.
diff --git a/content/en/docs/concepts/workloads/pods/workload-reference.md b/content/en/docs/concepts/workloads/pods/workload-reference.md
@@ -0,0 +1,80 @@
+---
+title: Workload Reference
+content_type: concept
+weight: 90
+---
+
+<!-- overview -->
+{{< feature-state feature_gate_name="GenericWorkload" >}}
+
+You can link a Pod to a [Workload](/docs/concepts/workloads/workload-api/) object
+to indicate that the Pod belongs to a larger application or group. This enables the scheduler to make decisions
+based on the group's requirements rather than treating the Pod as an independent entity.
+
+<!-- body -->
+
+## Specifying a Workload reference
+
+When the [`GenericWorkload`]((/docs/reference/command-line-tools-reference/feature-gates/#GenericWorkload))
+feature gate is enabled, you can use the `spec.workloadRef` field in your Pod manifest.
+This field establishes a link to a specific pod group defined within a Workload resource
+in the same namespace.
+
+```yaml
+apiVersion: v1
+kind: Pod
+metadata:
+  name: worker-0
+  namespace: some-ns
+spec:
+  workloadRef:
+    # The name of the Workload object in the same namespace
+    name: training-job-workload
+    # The name of the specific pod group inside that Workload
+    podGroup: workers
+```
+
+### Pod group replicas
+
+For more complex scenarios, you can partition a single pod group into replicated, independent scheduling units.
+You achieve this using the `podGroupReplicaKey` field within a Pod's `workloadRef`. This key acts as a label
+to create logical subgroups.
+
+For example, if you have a pod group with `minCount: 2` and you create four Pods: two with `podGroupReplicaKey: "0"`
+and two with `podGroupReplicaKey: "1"`, they will be treated as two independent groups of two Pods.
+
+```yaml
+spec:
+  workloadRef:
+    name: training-job-workload
+    podGroup: workers
+    # All workers with the replica key "0" will be scheduled together as one group.
+    podGroupReplicaKey: "0"
+```
+
+### Behavior
+
+When you define a `workloadRef`, the Pod behaves differently depending on the
+[policy](/docs/concepts/workloads/workload-api/policies/) defined in the referenced pod group.
+
+* If the referenced group uses the `basic` policy, the workload reference acts primarily as a grouping label.
+* If the referenced group uses the `gang` policy
+  (and the [`GangScheduling`]((/docs/reference/command-line-tools-reference/feature-gates/#GangScheduling)) feature gate is enabled),
+  the Pod enters a gang scheduling lifecycle. It will wait for other Pods in the group to be created
+  and scheduled before binding to a node.
+
+### Missing references
+
+The scheduler validates the `workloadRef` before making any placement decisions.
+
+If a Pod references a Workload that does not exist, or a pod group that is not defined within that Workload,
+the Pod will remain pending. It is not considered for placement until you create the missing Workload object
+or recreate it to include the missing `PodGroup` definition.
+
+This behavior applies to all Pods with a `workloadRef`, regardless of whether the eventual policy will be `basic` or `gang`,
+as the scheduler requires the Workload definition to determine the policy.
+
+## {{% heading "whatsnext" %}}
+
+* Learn about the [Workload API](/docs/concepts/workloads/workload-api/).
+* Read the details of [pod group policies](/docs/concepts/workloads/workload-api/policies/).
diff --git a/content/en/docs/concepts/workloads/workload-api/_index.md b/content/en/docs/concepts/workloads/workload-api/_index.md
@@ -0,0 +1,68 @@
+---
+title: "Workload API"
+weight: 20
+simple_list: true
+---
+
+<!-- overview -->
+{{< feature-state feature_gate_name="GenericWorkload" >}}
+
+The Workload API resource allows you to describe the scheduling requirements and structure of a multi-Pod application.
+While workload controllers provide runtime behavior for the workloads,
+the Workload API is supposed to provide scheduling constraints for the "true" workloads, such as Job and others.
+
+<!-- body -->
+
+## What is a Workload?
+
+The Workload API resource is part of the `scheduling.k8s.io/v1alpha1`
+{{< glossary_tooltip text="API group" term_id="api-group" >}}.
+This resource acts as a structured, machine-readable definition of the scheduling requirements
+of a multi-Pod application. While user-facing workloads like [Jobs](/docs/concepts/workloads/controllers/job/)
+define what to run, the Workload resource determines how a group of Pods should be scheduled
+and how its placement should be managed throughout its lifecycle.
+
+## API structure
+
+A Workload allows you to define a group of Pods and apply a scheduling policy to them.
+It consists of two sections: a list of pod groups and a reference to a controller.
+
+### Pod groups
+
+The `podGroups` list defines the distinct components of your workload.
+For example, a machine learning job might have a `driver` group and a `worker` group.
+
+Each entry in `podGroups` must have:
+1. A unique `name` that can be used in the Pod's [Workload reference](/docs/concepts/workloads/pods/workload-reference/).
+2. A [scheduling policy](/docs/concepts/workloads/workload-api/policies/) (`basic` or `gang`).
+
+```yaml
+apiVersion: scheduling.k8s.io/v1alpha1
+kind: Workload
+metadata:
+  name: training-job-workload
+  namespace: some-ns
+spec:
+  controllerRef:
+    apiGroup: batch
+    kind: Job
+    name: training-job
+  podGroups:
+  - name: workers
+    policy:
+      gang:
+        # The gang is schedulable only if 4 pods can run at once
+        minCount: 4
+```
+
+### Referencing a workload controlling object
+
+The `controllerRef` field links the Workload back to the specific high-level object defining the application,
+such as a [Job](/docs/concepts/workloads/controllers/job/) or a custom CRD. This is useful for observability and tooling.
+This data is not used to schedule or manage the Workload.
+
+## {{% heading "whatsnext" %}}
+
+* See how to [reference a Workload](/docs/concepts/workloads/pods/workload-reference/) in a Pod.
+* Learn about [pod group policies](/docs/concepts/workloads/workload-api/policies/).
+* Read about [gang scheduling](/docs/concepts/scheduling-eviction/gang-scheduling/) algorithm.
diff --git a/content/en/docs/concepts/workloads/workload-api/policies.md b/content/en/docs/concepts/workloads/workload-api/policies.md
@@ -0,0 +1,57 @@
+---
+title: Pod Group Policies
+content_type: concept
+weight: 10
+---
+
+<!-- overview -->
+{{< feature-state feature_gate_name="GenericWorkload" >}}
+
+Every pod group defined in a [Workload](/docs/concepts/workloads/workload-api/)
+must declare a scheduling policy. This policy dictates how the scheduler treats the collection of Pods.
+
+<!-- body -->
+
+## Policy types
+
+The API currently supports two policy types: `basic` and `gang`.
+You must specify exactly one policy for each group.
+
+### Basic policy
+
+The `basic` policy instructs the scheduler to treat all Pods in the group as independent entities,
+scheduling them using the standard Kubernetes behavior.
+
+The main reason to use the `basic` policy is to organize the Pods within your Workload
+for better observability and management.
+
+This policy can be used for groups of a Workload that do not require simultaneous startup
+but logically belong to the application, or to open the way for future group constraints
+that do not imply "all-or-nothing" placement.
+
+```yaml
+policy:
+  basic: {}
+```
+
+### Gang policy
+
+The `gang` policy enforces "all-or-nothing" scheduling. This is essential for tightly-coupled workloads
+where partial startup results in deadlocks or wasted resources.
+
+This can be used for [Jobs](/docs/concepts/workloads/controllers/job/)
+or any other batch process where all workers must run concurrently to make progress.
+
+The `gang` policy requires a `minCount` parameter:
+
+```yaml
+policy:
+  gang:
+    # The number of Pods that must be schedulable simultaneously
+    # for the group to be admitted.
+    minCount: 4
+```
+
+## {{% heading "whatsnext" %}}
+
+* Read about [gang scheduling](/docs/concepts/scheduling-eviction/gang-scheduling/) algorithm.
diff --git a/content/en/docs/reference/command-line-tools-reference/feature-gates/GangScheduling.md b/content/en/docs/reference/command-line-tools-reference/feature-gates/GangScheduling.md
@@ -0,0 +1,16 @@
+---
+title: GangScheduling
+content_type: feature_gate
+_build:
+  list: never
+  render: false
+
+stages:
+  - stage: alpha
+    defaultValue: false
+    fromVersion: "1.35"
+---
+
+Enables the GangScheduling plugin in kube-scheduler, which implements "all-or-nothing"
+scheduling algorithm. The [Workload API](/docs/concepts/workloads/workload-api/) is used
+to express the requirements.
diff --git a/content/en/docs/reference/command-line-tools-reference/feature-gates/GenericWorkload.md b/content/en/docs/reference/command-line-tools-reference/feature-gates/GenericWorkload.md
@@ -0,0 +1,17 @@
+---
+title: GenericWorkload
+content_type: feature_gate
+_build:
+  list: never
+  render: false
+
+stages:
+  - stage: alpha
+    defaultValue: false
+    fromVersion: "1.35"
+---
+
+Enables the support for [Workload API](/docs/concepts/workloads/workload-api/) to express scheduling requirements
+at the workload level. Pods can now reference a specific Workload PodGroup using the spec.workloadRef field.
+scheduling.k8s.io/v1alpha1 {{< glossary_tooltip text="API group" term_id="api-group" >}}
+has to be enabled to make the Workload API available.