Skip to content

Commit 2dc00c6

Browse files
committed
Update the PodGroup API proposal
1 parent 48430c2 commit 2dc00c6

File tree

1 file changed

+77
-79
lines changed
  • keps/sig-scheduling/4671-gang-scheduling

1 file changed

+77
-79
lines changed

keps/sig-scheduling/4671-gang-scheduling/README.md

Lines changed: 77 additions & 79 deletions
Original file line numberDiff line numberDiff line change
@@ -177,12 +177,12 @@ metadata:
177177
namespace: ns-1
178178
name: job-1
179179
spec:
180-
podGroups: # or gangGroups -- TBD
180+
podGroups:
181181
- name: "pg1"
182-
gangMode: Single
183-
gangSchedulingPolicy:
184-
minCount: 100
185-
schedulingTimeoutSeconds: 60
182+
policy:
183+
gang:
184+
minCount: 100
185+
schedulingTimeoutSeconds: 60
186186
```
187187

188188

@@ -224,11 +224,8 @@ usecases. You can read more about it in the [extended proposal] document.
224224
* `scheduling` is the ApiGroup.
225225
* `spec.workload` is the name of the new field in pod.
226226
* Within a Workload there is a list of groups of pods. Each group represents a top-level division of pods within a Workload. Each group can be independently gang scheduled (or not use gang scheduling). This group is named
227-
<<[UNRESOLVED community feedback requested]>> `PodGroup` or `GangGroup` for the top level. <<[/UNRESOLVED]>>.
228227
* In a future , we expect that this group can optionally specify further subdivision into sub groups. Each sub-group can have an index. The indexes go from 0 to N, without repeats or gaps. These subgroups are called
229-
<<[UNRESOLVED depending on previous unresolved item]>> `PodSubGroup` if `PodGroup` is chosen, or else `RankedGroup` if `GangGroup` is chosen<<[/UNRESOLVED]>>.
230228
* In subsequent KEPs, we expect that a sub-group can optionally specify further subdivision into pod equivalence classes. All pods in a pod equivalence class have the same values for all fields that affect scheduling feasibility. These pod equivalence classes are called
231-
<<[UNRESOLVED depending on a previous unresolved item]>> `PodSet` if `PodGroup` is chosen, or else `EqGroup` if `GangGroup` is chosen<<[/UNRESOLVED]>>.
232229

233230
### Associating Pod into PodGroups
234231

@@ -244,8 +241,8 @@ and is defined as following:
244241
// that a Pod belongs to. The scheduler uses this information to enforce
245242
// gang scheduling semantics.
246243
type WorkloadReference struct {
247-
// Workload defines the name of the Workload object this pod belongs to.
248-
Workload string
244+
// Name defines the name of the Workload object this pod belongs to.
245+
Name string
249246
250247
// PodGroup defines the name of the PodGroup within a Workload this pod belongs to.
251248
PodGroup string
@@ -272,13 +269,13 @@ kind: Workload
272269
metadata:
273270
name: jobset
274271
spec:
275-
podGroups: # or gangGroups -- TBD
272+
podGroups:
276273
- name: "job-1"
277-
gangMode: Replicated
278274
replicas: 4
279-
gangSchedulingPolicy:
280-
minCount: 100
281-
schedulingTimeoutSeconds: 60
275+
policy:
276+
gang:
277+
minCount: 100
278+
schedulingTimeoutSeconds: 60
282279
```
283280

284281
```yaml
@@ -291,7 +288,7 @@ spec:
291288
workload:
292289
name: jobset
293290
podGroup: job-1
294-
podGroupReplica: 2
291+
podGroupReplicaIndex: 2
295292
...
296293
297294
```
@@ -335,60 +332,8 @@ type WorkloadSpec struct {
335332
PodGroups []PodGroup
336333
}
337334
338-
type GangMode string
339-
const (
340-
// GangModeOff means that all pods in this PodGroup do not need to be scheduled as a gang.
341-
GangModeOff GangMode = "Off"
342-
343-
// GangModeSingle means that all pods in this PodGroup need to be scheduled as one gang.
344-
GangModeSingle GangMode = "Single"
345-
346-
// GangModeReplicatedGang means that there is a variable number of identical copies of this PodGroup,
347-
// as specified in Replicas, and each copy needs to be independently gang scheduled.
348-
GangModeReplicated GangMode = "Replicated"
349-
)
350-
351-
// GangSchedulingPolicy holds options that affect how gang scheduling of one PodGroup is handled by the scheduler.
352-
type GangSchedulingPolicy struct {
353-
// SchedulingTimeoutSeconds defines the timeout for the scheduling logic.
354-
// Namely it's timeout from the moment when `minCount` pods show up in
355-
// PreEnqueue, until those pods are observed in WaitOnPermit - for context
356-
// see https://kubernetes.io/docs/concepts/scheduling-eviction/scheduling-framework/#interfaces
357-
// If the timeout is hit, we reject all the waiting pods, free the resources
358-
// they were reserving and put all of them back to scheduling queue.
359-
SchedulingTimeoutSeconds *int
360-
MinCount *int
361-
}
362-
363-
// PodGroup is a group of pods that may contain multiple shapes (EqGroups) and may contain
364-
// multiple dense indexes (RankedGroups) and which can optionally be replicated in a variable
365-
// number of identical copies.
366-
//
367-
// TODO: Decide on the naming: PodGroup vs GangGroup.
368-
type PodGroup struct {
369-
Name *string
370-
GangMode *GangMode // default is "Off"
371-
372-
// Optional when GangMode = "ReplicatedGang".
373-
// Forbidden otherwise.
374-
Replicas int
375-
376-
// GangSchedulingPolicy defines the options applying to all pods in this gang.
377-
// Forbidden if GangMode is set to "Off".
378-
GangSchedulingPolicy GangSchedulingPolicy
379-
}
380-
381-
382-
type WorkloadStatus struct {
383-
// Necessary status fields TBD.
384-
}
385-
```
386-
387-
We also consider an alternative API design for PodGroup as following:
388-
389-
```go
390-
// PodGroup is a group of pods that may contain multiple shapes (EqGroups) and may contain
391-
// multiple dense indexes (RankedGroups) and which can optionally be replicated in a variable
335+
// PodGroup is a group of pods that may contain multiple shapes (PodSets) and may contain
336+
// multiple dense indexes (PodSubGroups) and which can optionally be replicated in a variable
392337
// number of identical copies.
393338
type PodGroup struct {
394339
Name *string
@@ -422,7 +367,7 @@ type DefaultSchedulingPolicy struct {
422367
// PodGroup should be handled.
423368
type GangSchedulingPolicy struct {
424369
// SchedulingTimeoutSeconds defines the timeout for the scheduling logic.
425-
// Namely it's timeout from the moment when `minCount` pods show up in
370+
// Namely it's timeout from the moment when the first pod show up in
426371
// PreEnqueue, until those pods are observed in WaitOnPermit - for context
427372
// see https://kubernetes.io/docs/concepts/scheduling-eviction/scheduling-framework/#interfaces
428373
// If the timeout is hit, we reject all the waiting pods, free the resources
@@ -431,6 +376,10 @@ type GangSchedulingPolicy struct {
431376
432377
MinCount *int
433378
}
379+
380+
type WorkloadStatus struct {
381+
// Necessary status fields TBD.
382+
}
434383
```
435384

436385
The individual `PodGroups` and `PodGroup` replicas are treated as independent gangs. As an example, if one of
@@ -557,7 +506,6 @@ N/A
557506
- `k8s.io/kubernetes/pkg/scheduler`: `2025-10-02` - 81.7%
558507
- `k8s.io/kubernetes/pkg/scheduler/backend/queue`: `2025-10-02` - 91.4%
559508
- `k8s.io/kubernetes/pkg/scheduler/framework`: `2025-10-02` - 81.7%
560-
- `k8s.io/kubernetes/pkg/scheduler/framework`: `2025-10-02` - 81.7%
561509
- `k8s.io/kubernetes/pkg/scheduler/framework/preemption`: `2025-10-02` - 64.2%
562510
- `k8s.io/kubernetes/pkg/scheduler/framework/util/assumecache`: `2025-10-02` - 86.2%
563511

@@ -573,6 +521,8 @@ This can be done with:
573521
- permalinks to the GitHub source code
574522
- links to the periodic job (typically https://testgrid.k8s.io/sig-release-master-blocking#integration-master), filtered by the test name
575523
- a search in the Kubernetes bug triage tool (https://storage.googleapis.com/k8s-triage/index.html)
524+
525+
- [test name](https://github.com/kubernetes/kubernetes/blob/2334b8469e1983c525c0c6382125710093a25883/test/integration/...): [integration master](https://testgrid.k8s.io/sig-release-master-blocking#integration-master?include-filter-by-regex=MyCoolFeature), [triage search](https://storage.googleapis.com/k8s-triage/index.html?test=MyCoolFeature)
576526
-->
577527

578528
We will create integration test(s) to ensure basic functionalities of gang-scheduling including:
@@ -582,8 +532,6 @@ We will create integration test(s) to ensure basic functionalities of gang-sched
582532

583533
In Beta, we will add tests to verify that deadlocks are not happening.
584534

585-
- [test name](https://github.com/kubernetes/kubernetes/blob/2334b8469e1983c525c0c6382125710093a25883/test/integration/...): [integration master](https://testgrid.k8s.io/sig-release-master-blocking#integration-master?include-filter-by-regex=MyCoolFeature), [triage search](https://storage.googleapis.com/k8s-triage/index.html?test=MyCoolFeature)
586-
587535
##### e2e tests
588536

589537
<!--
@@ -599,13 +547,13 @@ This can be done with:
599547

600548
We expect no non-infra related flakes in the last month as a GA graduation criteria.
601549
If e2e tests are not necessary or useful, explain why.
550+
551+
- [test name](https://github.com/kubernetes/kubernetes/blob/2334b8469e1983c525c0c6382125710093a25883/test/e2e/...): [SIG ...](https://testgrid.k8s.io/sig-...?include-filter-by-regex=MyCoolFeature), [triage search](https://storage.googleapis.com/k8s-triage/index.html?test=MyCoolFeature)
602552
-->
603553

604554
We will add basic API tests for the the new `Workload` API, that will later be
605555
promoted to the conformance.
606556

607-
- [test name](https://github.com/kubernetes/kubernetes/blob/2334b8469e1983c525c0c6382125710093a25883/test/e2e/...): [SIG ...](https://testgrid.k8s.io/sig-...?include-filter-by-regex=MyCoolFeature), [triage search](https://storage.googleapis.com/k8s-triage/index.html?test=MyCoolFeature)
608-
609557
### Graduation Criteria
610558

611559
#### Alpha
@@ -959,10 +907,12 @@ Major milestones might include:
959907
## Drawbacks
960908

961909
There are already multiple implementations of gang scheduling in the ecosystem.
910+
However:
911+
- the other implementations don't address all the issues (e.g. different kinds of
912+
races/deadlocks) that this proposal paves the way for addressing
913+
- the introduced concepts are fundamental enough in AI era, that we believe that
914+
our users shouldn't need to install any extensions to have them addressed
962915

963-
<!--
964-
Why should this KEP _not_ be implemented?
965-
-->
966916

967917
## Alternatives
968918

@@ -971,6 +921,54 @@ above described approach can be found in the [extended proposal] document.
971921

972922
[extended proposal]: https://docs.google.com/document/d/1ulO5eUnAsBWzqJdk_o5L-qdq5DIVwGcE7gWzCQ80SCM/edit?
973923

924+
It's maybe worth noting that we started the KEP with a different API definition of
925+
`PodGroup`, but based on the community discussions and feedback decided to change it.
926+
The original API definition for `PodGroup` was as following:
927+
928+
```go
929+
type GangMode string
930+
const (
931+
// GangModeOff means that all pods in this PodGroup do not need to be scheduled as a gang.
932+
GangModeOff GangMode = "Off"
933+
934+
// GangModeSingle means that all pods in this PodGroup need to be scheduled as one gang.
935+
GangModeSingle GangMode = "Single"
936+
937+
// GangModeReplicated means that there is a variable number of identical copies of this PodGroup,
938+
// as specified in Replicas, and each copy needs to be independently gang scheduled.
939+
GangModeReplicated GangMode = "Replicated"
940+
)
941+
942+
// GangSchedulingPolicy holds options that affect how gang scheduling of one PodGroup is handled by the scheduler.
943+
type GangSchedulingPolicy struct {
944+
// SchedulingTimeoutSeconds defines the timeout for the scheduling logic.
945+
// Namely it's timeout from the moment when the first pod show up in
946+
// PreEnqueue, until those pods are observed in WaitOnPermit - for context
947+
// see https://kubernetes.io/docs/concepts/scheduling-eviction/scheduling-framework/#interfaces
948+
// If the timeout is hit, we reject all the waiting pods, free the resources
949+
// they were reserving and put all of them back to scheduling queue.
950+
SchedulingTimeoutSeconds *int
951+
MinCount *int
952+
}
953+
954+
// PodGroup is a group of pods that may contain multiple shapes (EqGroups) and may contain
955+
// multiple dense indexes (RankedGroups) and which can optionally be replicated in a variable
956+
// number of identical copies.
957+
//
958+
// TODO: Decide on the naming: PodGroup vs GangGroup.
959+
type PodGroup struct {
960+
Name *string
961+
GangMode *GangMode // default is "Off"
962+
963+
// Optional when GangMode = "ReplicatedGang".
964+
// Forbidden otherwise.
965+
Replicas int
966+
967+
// GangSchedulingPolicy defines the options applying to all pods in this gang.
968+
// Forbidden if GangMode is set to "Off".
969+
GangSchedulingPolicy GangSchedulingPolicy
970+
}
971+
```
974972

975973
## Infrastructure Needed (Optional)
976974

0 commit comments

Comments
 (0)