Skip to content

Commit 0a515a1

Browse files
committed
Update the PodGroup API proposal
1 parent 48430c2 commit 0a515a1

File tree

1 file changed

+85
-90
lines changed
  • keps/sig-scheduling/4671-gang-scheduling

1 file changed

+85
-90
lines changed

keps/sig-scheduling/4671-gang-scheduling/README.md

Lines changed: 85 additions & 90 deletions
Original file line numberDiff line numberDiff line change
@@ -106,7 +106,8 @@ The `Workload` object will allow kube-scheduler to be aware that pods are part o
106106
- Implement the first version of `Workload` API necessary for defining a Gang
107107
- Ensuring that we can extend `Workload` API in backward compatible way toward north-star API
108108
- Ensuring that `Workload` API will be usable for both built-in and third-party workload controllers and APIs
109-
- Implement first version of gang-scheduling in kube-scheduler
109+
- Implement first version of gang-scheduling in kube-scheduler supporting (potentially in non-optimal way)
110+
all existing scheduling features.
110111
- Provide full backward compatibility for all existing scheduling features
111112

112113
### Non-Goals
@@ -117,6 +118,7 @@ The `Workload` object will allow kube-scheduler to be aware that pods are part o
117118

118119
The following are non-goals for this KEP but will probably soon appear to be goals for follow-up KEPs:
119120

121+
- Integrate cluster autoscaling with gang scheduling.
120122
- Introduce a concept of `Reservation` that can be later consumed by pods.
121123
- Workload-level preemption.
122124
- Address resource contention between different schedulers (including possible deadlocks).
@@ -177,12 +179,11 @@ metadata:
177179
namespace: ns-1
178180
name: job-1
179181
spec:
180-
podGroups: # or gangGroups -- TBD
182+
podGroups:
181183
- name: "pg1"
182-
gangMode: Single
183-
gangSchedulingPolicy:
184-
minCount: 100
185-
schedulingTimeoutSeconds: 60
184+
policy:
185+
gang:
186+
minCount: 100
186187
```
187188

188189

@@ -223,12 +224,9 @@ usecases. You can read more about it in the [extended proposal] document.
223224
* `Workload` is the resource Kind.
224225
* `scheduling` is the ApiGroup.
225226
* `spec.workload` is the name of the new field in pod.
226-
* Within a Workload there is a list of groups of pods. Each group represents a top-level division of pods within a Workload. Each group can be independently gang scheduled (or not use gang scheduling). This group is named
227-
<<[UNRESOLVED community feedback requested]>> `PodGroup` or `GangGroup` for the top level. <<[/UNRESOLVED]>>.
228-
* In a future , we expect that this group can optionally specify further subdivision into sub groups. Each sub-group can have an index. The indexes go from 0 to N, without repeats or gaps. These subgroups are called
229-
<<[UNRESOLVED depending on previous unresolved item]>> `PodSubGroup` if `PodGroup` is chosen, or else `RankedGroup` if `GangGroup` is chosen<<[/UNRESOLVED]>>.
230-
* In subsequent KEPs, we expect that a sub-group can optionally specify further subdivision into pod equivalence classes. All pods in a pod equivalence class have the same values for all fields that affect scheduling feasibility. These pod equivalence classes are called
231-
<<[UNRESOLVED depending on a previous unresolved item]>> `PodSet` if `PodGroup` is chosen, or else `EqGroup` if `GangGroup` is chosen<<[/UNRESOLVED]>>.
227+
* Within a Workload there is a list of groups of pods. Each group represents a top-level division of pods within a Workload. Each group can be independently gang scheduled (or not use gang scheduling). This group is named `PodGroup`.
228+
* In a future , we expect that this group can optionally specify further subdivision into sub groups. Each sub-group can have an index. The indexes go from 0 to N, without repeats or gaps. These subgroups are called `PodSubGroup`.
229+
* In subsequent KEPs, we expect that a sub-group can optionally specify further subdivision into pod equivalence classes. All pods in a pod equivalence class have the same values for all fields that affect scheduling feasibility. These pod equivalence classes are called `PodSet`.
232230

233231
### Associating Pod into PodGroups
234232

@@ -244,8 +242,8 @@ and is defined as following:
244242
// that a Pod belongs to. The scheduler uses this information to enforce
245243
// gang scheduling semantics.
246244
type WorkloadReference struct {
247-
// Workload defines the name of the Workload object this pod belongs to.
248-
Workload string
245+
// Name defines the name of the Workload object this pod belongs to.
246+
Name string
249247
250248
// PodGroup defines the name of the PodGroup within a Workload this pod belongs to.
251249
PodGroup string
@@ -272,13 +270,12 @@ kind: Workload
272270
metadata:
273271
name: jobset
274272
spec:
275-
podGroups: # or gangGroups -- TBD
273+
podGroups:
276274
- name: "job-1"
277-
gangMode: Replicated
278275
replicas: 4
279-
gangSchedulingPolicy:
280-
minCount: 100
281-
schedulingTimeoutSeconds: 60
276+
policy:
277+
gang:
278+
minCount: 100
282279
```
283280

284281
```yaml
@@ -291,7 +288,7 @@ spec:
291288
workload:
292289
name: jobset
293290
podGroup: job-1
294-
podGroupReplica: 2
291+
podGroupReplicaIndex: 2
295292
...
296293
297294
```
@@ -335,60 +332,8 @@ type WorkloadSpec struct {
335332
PodGroups []PodGroup
336333
}
337334
338-
type GangMode string
339-
const (
340-
// GangModeOff means that all pods in this PodGroup do not need to be scheduled as a gang.
341-
GangModeOff GangMode = "Off"
342-
343-
// GangModeSingle means that all pods in this PodGroup need to be scheduled as one gang.
344-
GangModeSingle GangMode = "Single"
345-
346-
// GangModeReplicatedGang means that there is a variable number of identical copies of this PodGroup,
347-
// as specified in Replicas, and each copy needs to be independently gang scheduled.
348-
GangModeReplicated GangMode = "Replicated"
349-
)
350-
351-
// GangSchedulingPolicy holds options that affect how gang scheduling of one PodGroup is handled by the scheduler.
352-
type GangSchedulingPolicy struct {
353-
// SchedulingTimeoutSeconds defines the timeout for the scheduling logic.
354-
// Namely it's timeout from the moment when `minCount` pods show up in
355-
// PreEnqueue, until those pods are observed in WaitOnPermit - for context
356-
// see https://kubernetes.io/docs/concepts/scheduling-eviction/scheduling-framework/#interfaces
357-
// If the timeout is hit, we reject all the waiting pods, free the resources
358-
// they were reserving and put all of them back to scheduling queue.
359-
SchedulingTimeoutSeconds *int
360-
MinCount *int
361-
}
362-
363-
// PodGroup is a group of pods that may contain multiple shapes (EqGroups) and may contain
364-
// multiple dense indexes (RankedGroups) and which can optionally be replicated in a variable
365-
// number of identical copies.
366-
//
367-
// TODO: Decide on the naming: PodGroup vs GangGroup.
368-
type PodGroup struct {
369-
Name *string
370-
GangMode *GangMode // default is "Off"
371-
372-
// Optional when GangMode = "ReplicatedGang".
373-
// Forbidden otherwise.
374-
Replicas int
375-
376-
// GangSchedulingPolicy defines the options applying to all pods in this gang.
377-
// Forbidden if GangMode is set to "Off".
378-
GangSchedulingPolicy GangSchedulingPolicy
379-
}
380-
381-
382-
type WorkloadStatus struct {
383-
// Necessary status fields TBD.
384-
}
385-
```
386-
387-
We also consider an alternative API design for PodGroup as following:
388-
389-
```go
390-
// PodGroup is a group of pods that may contain multiple shapes (EqGroups) and may contain
391-
// multiple dense indexes (RankedGroups) and which can optionally be replicated in a variable
335+
// PodGroup is a group of pods that may contain multiple shapes (PodSets) and may contain
336+
// multiple dense indexes (PodSubGroups) and which can optionally be replicated in a variable
392337
// number of identical copies.
393338
type PodGroup struct {
394339
Name *string
@@ -421,16 +366,12 @@ type DefaultSchedulingPolicy struct {
421366
// GangSchedulingPolicy represents options for how gang scheduling of one
422367
// PodGroup should be handled.
423368
type GangSchedulingPolicy struct {
424-
// SchedulingTimeoutSeconds defines the timeout for the scheduling logic.
425-
// Namely it's timeout from the moment when `minCount` pods show up in
426-
// PreEnqueue, until those pods are observed in WaitOnPermit - for context
427-
// see https://kubernetes.io/docs/concepts/scheduling-eviction/scheduling-framework/#interfaces
428-
// If the timeout is hit, we reject all the waiting pods, free the resources
429-
// they were reserving and put all of them back to scheduling queue.
430-
SchedulingTimeoutSeconds *int
431-
432369
MinCount *int
433370
}
371+
372+
type WorkloadStatus struct {
373+
// Necessary status fields TBD.
374+
}
434375
```
435376

436377
The individual `PodGroups` and `PodGroup` replicas are treated as independent gangs. As an example, if one of
@@ -557,7 +498,6 @@ N/A
557498
- `k8s.io/kubernetes/pkg/scheduler`: `2025-10-02` - 81.7%
558499
- `k8s.io/kubernetes/pkg/scheduler/backend/queue`: `2025-10-02` - 91.4%
559500
- `k8s.io/kubernetes/pkg/scheduler/framework`: `2025-10-02` - 81.7%
560-
- `k8s.io/kubernetes/pkg/scheduler/framework`: `2025-10-02` - 81.7%
561501
- `k8s.io/kubernetes/pkg/scheduler/framework/preemption`: `2025-10-02` - 64.2%
562502
- `k8s.io/kubernetes/pkg/scheduler/framework/util/assumecache`: `2025-10-02` - 86.2%
563503

@@ -573,6 +513,8 @@ This can be done with:
573513
- permalinks to the GitHub source code
574514
- links to the periodic job (typically https://testgrid.k8s.io/sig-release-master-blocking#integration-master), filtered by the test name
575515
- a search in the Kubernetes bug triage tool (https://storage.googleapis.com/k8s-triage/index.html)
516+
517+
- [test name](https://github.com/kubernetes/kubernetes/blob/2334b8469e1983c525c0c6382125710093a25883/test/integration/...): [integration master](https://testgrid.k8s.io/sig-release-master-blocking#integration-master?include-filter-by-regex=MyCoolFeature), [triage search](https://storage.googleapis.com/k8s-triage/index.html?test=MyCoolFeature)
576518
-->
577519

578520
We will create integration test(s) to ensure basic functionalities of gang-scheduling including:
@@ -582,8 +524,6 @@ We will create integration test(s) to ensure basic functionalities of gang-sched
582524

583525
In Beta, we will add tests to verify that deadlocks are not happening.
584526

585-
- [test name](https://github.com/kubernetes/kubernetes/blob/2334b8469e1983c525c0c6382125710093a25883/test/integration/...): [integration master](https://testgrid.k8s.io/sig-release-master-blocking#integration-master?include-filter-by-regex=MyCoolFeature), [triage search](https://storage.googleapis.com/k8s-triage/index.html?test=MyCoolFeature)
586-
587527
##### e2e tests
588528

589529
<!--
@@ -599,13 +539,13 @@ This can be done with:
599539

600540
We expect no non-infra related flakes in the last month as a GA graduation criteria.
601541
If e2e tests are not necessary or useful, explain why.
542+
543+
- [test name](https://github.com/kubernetes/kubernetes/blob/2334b8469e1983c525c0c6382125710093a25883/test/e2e/...): [SIG ...](https://testgrid.k8s.io/sig-...?include-filter-by-regex=MyCoolFeature), [triage search](https://storage.googleapis.com/k8s-triage/index.html?test=MyCoolFeature)
602544
-->
603545

604546
We will add basic API tests for the the new `Workload` API, that will later be
605547
promoted to the conformance.
606548

607-
- [test name](https://github.com/kubernetes/kubernetes/blob/2334b8469e1983c525c0c6382125710093a25883/test/e2e/...): [SIG ...](https://testgrid.k8s.io/sig-...?include-filter-by-regex=MyCoolFeature), [triage search](https://storage.googleapis.com/k8s-triage/index.html?test=MyCoolFeature)
608-
609549
### Graduation Criteria
610550

611551
#### Alpha
@@ -959,10 +899,12 @@ Major milestones might include:
959899
## Drawbacks
960900

961901
There are already multiple implementations of gang scheduling in the ecosystem.
902+
However:
903+
- the other implementations don't address all the issues (e.g. different kinds of
904+
races/deadlocks) that this proposal paves the way for addressing
905+
- the introduced concepts are fundamental enough in AI era, that we believe that
906+
our users shouldn't need to install any extensions to have them addressed
962907

963-
<!--
964-
Why should this KEP _not_ be implemented?
965-
-->
966908

967909
## Alternatives
968910

@@ -971,6 +913,59 @@ above described approach can be found in the [extended proposal] document.
971913

972914
[extended proposal]: https://docs.google.com/document/d/1ulO5eUnAsBWzqJdk_o5L-qdq5DIVwGcE7gWzCQ80SCM/edit?
973915

916+
It's maybe worth noting that we started the KEP with a different API definition of
917+
`PodGroup`, but based on the community discussions and feedback decided to change it.
918+
The original API definition for `PodGroup` was as following:
919+
920+
```go
921+
type GangMode string
922+
const (
923+
// GangModeOff means that all pods in this PodGroup do not need to be scheduled as a gang.
924+
GangModeOff GangMode = "Off"
925+
926+
// GangModeSingle means that all pods in this PodGroup need to be scheduled as one gang.
927+
GangModeSingle GangMode = "Single"
928+
929+
// GangModeReplicated means that there is a variable number of identical copies of this PodGroup,
930+
// as specified in Replicas, and each copy needs to be independently gang scheduled.
931+
GangModeReplicated GangMode = "Replicated"
932+
)
933+
934+
// GangSchedulingPolicy holds options that affect how gang scheduling of one PodGroup is handled by the scheduler.
935+
type GangSchedulingPolicy struct {
936+
// SchedulingTimeoutSeconds defines the timeout for the scheduling logic.
937+
// Namely it's timeout from the moment when the first pod show up in
938+
// PreEnqueue, until those pods are observed in WaitOnPermit - for context
939+
// see https://kubernetes.io/docs/concepts/scheduling-eviction/scheduling-framework/#interfaces
940+
// If the timeout is hit, we reject all the waiting pods, free the resources
941+
// they were reserving and put all of them back to scheduling queue.
942+
//
943+
// We decided to drop the field for Alpha because:
944+
// 1) it won't be obvious for majority of users how to set it
945+
// 2) it's usefulness after Beta is unclear - see:
946+
// https://github.com/kubernetes/enhancements/pull/5558#discussion_r2400876903
947+
SchedulingTimeoutSeconds *int
948+
MinCount *int
949+
}
950+
951+
// PodGroup is a group of pods that may contain multiple shapes (EqGroups) and may contain
952+
// multiple dense indexes (RankedGroups) and which can optionally be replicated in a variable
953+
// number of identical copies.
954+
//
955+
// TODO: Decide on the naming: PodGroup vs GangGroup.
956+
type PodGroup struct {
957+
Name *string
958+
GangMode *GangMode // default is "Off"
959+
960+
// Optional when GangMode = "ReplicatedGang".
961+
// Forbidden otherwise.
962+
Replicas int
963+
964+
// GangSchedulingPolicy defines the options applying to all pods in this gang.
965+
// Forbidden if GangMode is set to "Off".
966+
GangSchedulingPolicy GangSchedulingPolicy
967+
}
968+
```
974969

975970
## Infrastructure Needed (Optional)
976971

0 commit comments

Comments
 (0)