-
Notifications
You must be signed in to change notification settings - Fork 15.2k
KEP-4671 Add docs for Workload API and Gang scheduling #53296
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Changes from all commits
File filter
Filter by extension
Conversations
Jump to
Diff view
Diff view
There are no files selected for viewing
| Original file line number | Diff line number | Diff line change |
|---|---|---|
| @@ -0,0 +1,50 @@ | ||
| --- | ||
| title: Gang Scheduling | ||
| content_type: concept | ||
| weight: 70 | ||
| --- | ||
|
|
||
| <!-- overview --> | ||
| {{< feature-state feature_gate_name="GangScheduling" >}} | ||
|
|
||
| Gang scheduling ensures that a group of Pods are scheduled on an "all-or-nothing" basis. | ||
| If the cluster cannot accommodate the entire group (or a defined minimum number of Pods), | ||
| none of the Pods are bound to a node. | ||
|
|
||
| This feature depends on the [Workload API](/docs/concepts/workloads/workload-api/). | ||
| Ensure the [`GenericWorkload`](/docs/reference/command-line-tools-reference/feature-gates/#GenericWorkload) | ||
| feature gate and the `scheduling.k8s.io/v1alpha1` | ||
| {{< glossary_tooltip text="API group" term_id="api-group" >}} are enabled in the cluster. | ||
|
|
||
| <!-- body --> | ||
|
|
||
| ## How it works | ||
|
|
||
| When the `GangScheduling` plugin is enabled, the scheduler alters the lifecycle for Pods belonging | ||
| to a `gang` [pod group policy](/docs/concepts/workloads/workload-api/policies/) within | ||
| a [Workload](/docs/concepts/workloads/workload-api/). | ||
| The process follows these steps independently for each pod group and its replica key: | ||
|
|
||
| 1. The scheduler holds Pods in the `PreEnqueue` phase until: | ||
| * The referenced Workload object is created. | ||
| * The referenced pod group exists in a Workload. | ||
| * The number of Pods that have been created for the specific group | ||
| is at least equal to the `minCount`. | ||
|
|
||
| Pods do not enter the active scheduling queue until all of these conditions are met. | ||
|
|
||
| 2. Once the quorum is met, the scheduler attempts to find placements for all Pods in the group. | ||
| All assigned Pods wait at the `WaitOnPermit` gate during this process. | ||
| Note that in the Alpha phase of this feature, finding a placement is based on pod-by-pod scheduling, | ||
| rather than a single-cycle approach. | ||
|
|
||
| 3. If the scheduler finds valid placements for at least `minCount` Pods, | ||
| it allows all of them to be bound to their assigned nodes. If it cannot find placements for the entire group | ||
| within a fixed timeout of 5 minutes, none of the Pods are scheduled. | ||
| Instead, they are moved to the unschedulable queue to wait for cluster resources to free up, | ||
| allowing other workloads to be scheduled in the meantime. | ||
|
|
||
| ## {{% heading "whatsnext" %}} | ||
|
|
||
| * Learn about the [Workload API](/docs/concepts/workloads/workload-api/). | ||
| * See how to [reference a Workload](/docs/concepts/workloads/pods/workload-reference/) in a Pod. | ||
| Original file line number | Diff line number | Diff line change | ||||
|---|---|---|---|---|---|---|
|
|
@@ -66,6 +66,18 @@ of Kubernetes' core. For example, if you wanted to run a group of Pods for your | |||||
| stop work unless _all_ the Pods are available (perhaps for some high-throughput distributed task), | ||||||
| then you can implement or install an extension that does provide that feature. | ||||||
|
|
||||||
| ## Workload placement | ||||||
|
|
||||||
| {{< feature-state feature_gate_name="GenericWorkload" >}} | ||||||
|
|
||||||
| While standard workload resources (like Deployments and Jobs) manage the lifecycle of Pods, | ||||||
| you may have complex scheduling requirements where groups of Pods must be treated as a single unit. | ||||||
|
|
||||||
| The [Workload API](/docs/concepts/workloads/workload-api/) allows you to define a group of Pods | ||||||
| and apply advanced scheduling policies to them, such as [gang scheduling](/docs/concepts/scheduling-eviction/gang-scheduling/). | ||||||
| This is particularly useful for batch processing and machine learning workloads | ||||||
| where "all-or-nothing" placement is required. | ||||||
|
Member
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more.
Suggested change
Member
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. I think placement is OK, TBH. |
||||||
|
|
||||||
| ## {{% heading "whatsnext" %}} | ||||||
|
|
||||||
| As well as reading about each API kind for workload management, you can read how to | ||||||
|
|
||||||
| Original file line number | Diff line number | Diff line change | ||||
|---|---|---|---|---|---|---|
| @@ -0,0 +1,80 @@ | ||||||
| --- | ||||||
| title: Workload Reference | ||||||
| content_type: concept | ||||||
| weight: 90 | ||||||
| --- | ||||||
|
|
||||||
| <!-- overview --> | ||||||
| {{< feature-state feature_gate_name="GenericWorkload" >}} | ||||||
|
|
||||||
| You can link a Pod to a [Workload](/docs/concepts/workloads/workload-api/) object | ||||||
| to indicate that the Pod belongs to a larger application or group. This enables the scheduler to make decisions | ||||||
| based on the group's requirements rather than treating the Pod as an independent entity. | ||||||
|
|
||||||
| <!-- body --> | ||||||
|
|
||||||
| ## Specifying a Workload reference | ||||||
|
|
||||||
| When the [`GenericWorkload`]((/docs/reference/command-line-tools-reference/feature-gates/#GenericWorkload)) | ||||||
| feature gate is enabled, you can use the `spec.workloadRef` field in your Pod manifest. | ||||||
| This field establishes a link to a specific pod group defined within a Workload resource | ||||||
| in the same namespace. | ||||||
|
|
||||||
| ```yaml | ||||||
| apiVersion: v1 | ||||||
| kind: Pod | ||||||
| metadata: | ||||||
| name: worker-0 | ||||||
| namespace: some-ns | ||||||
| spec: | ||||||
| workloadRef: | ||||||
| # The name of the Workload object in the same namespace | ||||||
| name: training-job-workload | ||||||
| # The name of the specific pod group inside that Workload | ||||||
| podGroup: workers | ||||||
| ``` | ||||||
| ### Pod group replicas | ||||||
| For more complex scenarios, you can replicate a single pod group into multiple, independent scheduling units. | ||||||
| You achieve this using the `podGroupReplicaKey` field within a Pod's `workloadRef`. This key acts as a label | ||||||
| to create logical subgroups. | ||||||
|
|
||||||
| For example, if you have a pod group with `minCount: 2` and you create four Pods: two with `podGroupReplicaKey: "0"` | ||||||
| and two with `podGroupReplicaKey: "1"`, they will be treated as two independent groups of two Pods. | ||||||
| ```yaml | ||||||
| spec: | ||||||
| workloadRef: | ||||||
| name: training-job-workload | ||||||
| podGroup: workers | ||||||
| # All workers with the replica key "0" will be scheduled together as one group. | ||||||
| podGroupReplicaKey: "0" | ||||||
| ``` | ||||||
| ### Behavior | ||||||
| When you define a `workloadRef`, the Pod behaves differently depending on the | ||||||
| [policy](/docs/concepts/workloads/workload-api/policies/) defined in the referenced pod group. | ||||||
|
|
||||||
| * If the referenced group uses the `basic` policy, the workload reference acts primarily as a grouping label. | ||||||
| * If the referenced group uses the `gang` policy | ||||||
| (and the [`GangScheduling`]((/docs/reference/command-line-tools-reference/feature-gates/#GangScheduling)) feature gate is enabled), | ||||||
| the Pod enters a gang scheduling lifecycle. It will wait for other Pods in the group to be created | ||||||
| and scheduled before binding to a node. | ||||||
|
|
||||||
| ### Missing references | ||||||
|
|
||||||
| The scheduler validates the `workloadRef` before making any placement decisions. | ||||||
|
|
||||||
| If a Pod references a Workload that does not exist, or a pod group that is not defined within that Workload, | ||||||
| the Pod will remain pending. It is not considered for placement until you create the missing Workload object | ||||||
| or recreate it to include the missing `PodGroup` definition. | ||||||
|
Member
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. For beta, try for this:
Suggested change
Member
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. We can still update the merged docs, even after docs freeze The key thing about the deadline is that we must have docs that are at least good enough ahead of the upcoming release. |
||||||
|
|
||||||
| This behavior applies to all Pods with a `workloadRef`, regardless of whether the eventual policy will be `basic` or `gang`, | ||||||
| as the scheduler requires the Workload definition to determine the policy. | ||||||
|
|
||||||
| ## {{% heading "whatsnext" %}} | ||||||
|
|
||||||
| * Learn about the [Workload API](/docs/concepts/workloads/workload-api/). | ||||||
| * Read the details of [pod group policies](/docs/concepts/workloads/workload-api/policies/). | ||||||
| Original file line number | Diff line number | Diff line change |
|---|---|---|
| @@ -0,0 +1,71 @@ | ||
| --- | ||
| title: "Workload API" | ||
| weight: 20 | ||
| simple_list: true | ||
| --- | ||
|
|
||
| <!-- overview --> | ||
| {{< feature-state feature_gate_name="GenericWorkload" >}} | ||
|
|
||
| The Workload API resource allows you to describe the scheduling requirements and structure of a multi-Pod application. | ||
| While workload controllers provide runtime behavior for the workloads, | ||
| the Workload API is supposed to provide scheduling constraints for the "true" workloads, such as Job and others. | ||
|
|
||
| <!-- body --> | ||
|
|
||
| ## What is a Workload? | ||
|
|
||
| The Workload API resource is part of the `scheduling.k8s.io/v1alpha1` | ||
| {{< glossary_tooltip text="API group" term_id="api-group" >}} | ||
| (and your cluster must have that API group enabled, as well as the `GenericWorkload` | ||
| [feature gate](/docs/reference/command-line-tools-reference/feature-gates/), | ||
| before you can benefit from this API). | ||
| This resource acts as a structured, machine-readable definition of the scheduling requirements | ||
| of a multi-Pod application. While user-facing workloads like [Jobs](/docs/concepts/workloads/controllers/job/) | ||
| define what to run, the Workload resource determines how a group of Pods should be scheduled | ||
| and how its placement should be managed throughout its lifecycle. | ||
|
|
||
| ## API structure | ||
|
|
||
| A Workload allows you to define a group of Pods and apply a scheduling policy to them. | ||
| It consists of two sections: a list of pod groups and a reference to a controller. | ||
|
|
||
| ### Pod groups | ||
|
|
||
| The `podGroups` list defines the distinct components of your workload. | ||
| For example, a machine learning job might have a `driver` group and a `worker` group. | ||
|
|
||
| Each entry in `podGroups` must have: | ||
| 1. A unique `name` that can be used in the Pod's [Workload reference](/docs/concepts/workloads/pods/workload-reference/). | ||
| 2. A [scheduling policy](/docs/concepts/workloads/workload-api/policies/) (`basic` or `gang`). | ||
|
|
||
| ```yaml | ||
| apiVersion: scheduling.k8s.io/v1alpha1 | ||
| kind: Workload | ||
| metadata: | ||
| name: training-job-workload | ||
| namespace: some-ns | ||
| spec: | ||
| controllerRef: | ||
| apiGroup: batch | ||
| kind: Job | ||
| name: training-job | ||
| podGroups: | ||
| - name: workers | ||
| policy: | ||
| gang: | ||
| # The gang is schedulable only if 4 pods can run at once | ||
| minCount: 4 | ||
| ``` | ||
| ### Referencing a workload controlling object | ||
| The `controllerRef` field links the Workload back to the specific high-level object defining the application, | ||
| such as a [Job](/docs/concepts/workloads/controllers/job/) or a custom CRD. This is useful for observability and tooling. | ||
| This data is not used to schedule or manage the Workload. | ||
|
|
||
| ## {{% heading "whatsnext" %}} | ||
|
|
||
| * See how to [reference a Workload](/docs/concepts/workloads/pods/workload-reference/) in a Pod. | ||
| * Learn about [pod group policies](/docs/concepts/workloads/workload-api/policies/). | ||
| * Read about [gang scheduling](/docs/concepts/scheduling-eviction/gang-scheduling/) algorithm. |
|
Member
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. nit: should update https://kubernetes.io/docs/concepts/policy/ to hyperlink here |
| Original file line number | Diff line number | Diff line change | ||||
|---|---|---|---|---|---|---|
| @@ -0,0 +1,57 @@ | ||||||
| --- | ||||||
| title: Pod Group Policies | ||||||
| content_type: concept | ||||||
| weight: 10 | ||||||
| --- | ||||||
|
|
||||||
| <!-- overview --> | ||||||
| {{< feature-state feature_gate_name="GenericWorkload" >}} | ||||||
|
|
||||||
| Every pod group defined in a [Workload](/docs/concepts/workloads/workload-api/) | ||||||
| must declare a scheduling policy. This policy dictates how the scheduler treats the collection of Pods. | ||||||
|
|
||||||
| <!-- body --> | ||||||
|
|
||||||
| ## Policy types | ||||||
|
|
||||||
| The API currently supports two policy types: `basic` and `gang`. | ||||||
| You must specify exactly one policy for each group. | ||||||
|
|
||||||
| ### Basic policy | ||||||
|
|
||||||
| The `basic` policy instructs the scheduler to treat all Pods in the group as independent entities, | ||||||
| scheduling them using the standard Kubernetes behavior. | ||||||
|
|
||||||
| The main reason to use the `basic` policy is to organize the Pods within your Workload | ||||||
| for better observability and management. | ||||||
|
|
||||||
| This policy can be used for groups of a Workload that do not require simultaneous startup | ||||||
| but logically belong to the application, or to open the way for future group constraints | ||||||
| that do not imply "all-or-nothing" placement. | ||||||
|
|
||||||
| ```yaml | ||||||
| policy: | ||||||
| basic: {} | ||||||
| ``` | ||||||
| ### Gang policy | ||||||
| The `gang` policy enforces "all-or-nothing" scheduling. This is essential for tightly-coupled workloads | ||||||
| where partial startup results in deadlocks or wasted resources. | ||||||
|
Member
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more.
Suggested change
Member
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. We can still update the merged docs, even after docs freeze The key thing about the deadline is that we must have docs that are at least good enough ahead of the upcoming release. |
||||||
|
|
||||||
| This can be used for [Jobs](/docs/concepts/workloads/controllers/job/) | ||||||
| or any other batch process where all workers must run concurrently to make progress. | ||||||
|
|
||||||
| The `gang` policy requires a `minCount` parameter: | ||||||
|
|
||||||
| ```yaml | ||||||
| policy: | ||||||
| gang: | ||||||
| # The number of Pods that must be schedulable simultaneously | ||||||
| # for the group to be admitted. | ||||||
| minCount: 4 | ||||||
| ``` | ||||||
|
|
||||||
| ## {{% heading "whatsnext" %}} | ||||||
|
|
||||||
| * Read about [gang scheduling](/docs/concepts/scheduling-eviction/gang-scheduling/) algorithm. | ||||||
| Original file line number | Diff line number | Diff line change |
|---|---|---|
| @@ -0,0 +1,16 @@ | ||
| --- | ||
| title: GangScheduling | ||
| content_type: feature_gate | ||
| _build: | ||
| list: never | ||
| render: false | ||
|
|
||
| stages: | ||
| - stage: alpha | ||
| defaultValue: false | ||
| fromVersion: "1.35" | ||
| --- | ||
|
|
||
| Enables the GangScheduling plugin in kube-scheduler, which implements "all-or-nothing" | ||
| scheduling algorithm. The [Workload API](/docs/concepts/workloads/workload-api/) is used | ||
| to express the requirements. |
| Original file line number | Diff line number | Diff line change |
|---|---|---|
| @@ -0,0 +1,17 @@ | ||
| --- | ||
| title: GenericWorkload | ||
| content_type: feature_gate | ||
| _build: | ||
| list: never | ||
| render: false | ||
|
|
||
| stages: | ||
| - stage: alpha | ||
| defaultValue: false | ||
| fromVersion: "1.35" | ||
| --- | ||
|
|
||
| Enables the support for [Workload API](/docs/concepts/workloads/workload-api/) to express scheduling requirements at the workload level. | ||
|
|
||
| When enabled Pods can reference a specific pod group and use this to influence | ||
| the way that they are scheduled. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We can still update the merged docs, even after docs freeze
The key thing about the deadline is that we must have docs that are at least good enough ahead of the upcoming release.