You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Copy file name to clipboardExpand all lines: keps/sig-scheduling/4671-gang-scheduling/README.md
+77-79Lines changed: 77 additions & 79 deletions
Display the source diff
Display the rich diff
Original file line number
Diff line number
Diff line change
@@ -177,12 +177,12 @@ metadata:
177
177
namespace: ns-1
178
178
name: job-1
179
179
spec:
180
-
podGroups: # or gangGroups -- TBD
180
+
podGroups:
181
181
- name: "pg1"
182
-
gangMode: Single
183
-
gangSchedulingPolicy:
184
-
minCount: 100
185
-
schedulingTimeoutSeconds: 60
182
+
policy:
183
+
gang:
184
+
minCount: 100
185
+
schedulingTimeoutSeconds: 60
186
186
```
187
187
188
188
@@ -224,11 +224,8 @@ usecases. You can read more about it in the [extended proposal] document.
224
224
* `scheduling` is the ApiGroup.
225
225
* `spec.workload` is the name of the new field in pod.
226
226
* Within a Workload there is a list of groups of pods. Each group represents a top-level division of pods within a Workload. Each group can be independently gang scheduled (or not use gang scheduling). This group is named
227
-
<<[UNRESOLVED community feedback requested]>> `PodGroup` or `GangGroup` for the top level. <<[/UNRESOLVED]>>.
228
227
* In a future , we expect that this group can optionally specify further subdivision into sub groups. Each sub-group can have an index. The indexes go from 0 to N, without repeats or gaps. These subgroups are called
229
-
<<[UNRESOLVED depending on previous unresolved item]>> `PodSubGroup` if `PodGroup` is chosen, or else `RankedGroup` if `GangGroup` is chosen<<[/UNRESOLVED]>>.
230
228
* In subsequent KEPs, we expect that a sub-group can optionally specify further subdivision into pod equivalence classes. All pods in a pod equivalence class have the same values for all fields that affect scheduling feasibility. These pod equivalence classes are called
231
-
<<[UNRESOLVED depending on a previous unresolved item]>> `PodSet` if `PodGroup` is chosen, or else `EqGroup` if `GangGroup` is chosen<<[/UNRESOLVED]>>.
232
229
233
230
### Associating Pod into PodGroups
234
231
@@ -244,8 +241,8 @@ and is defined as following:
244
241
// that a Pod belongs to. The scheduler uses this information to enforce
245
242
// gang scheduling semantics.
246
243
type WorkloadReference struct {
247
-
// Workload defines the name of the Workload object this pod belongs to.
248
-
Workload string
244
+
// Name defines the name of the Workload object this pod belongs to.
245
+
Name string
249
246
250
247
// PodGroup defines the name of the PodGroup within a Workload this pod belongs to.
251
248
PodGroup string
@@ -272,13 +269,13 @@ kind: Workload
272
269
metadata:
273
270
name: jobset
274
271
spec:
275
-
podGroups: # or gangGroups -- TBD
272
+
podGroups:
276
273
- name: "job-1"
277
-
gangMode: Replicated
278
274
replicas: 4
279
-
gangSchedulingPolicy:
280
-
minCount: 100
281
-
schedulingTimeoutSeconds: 60
275
+
policy:
276
+
gang:
277
+
minCount: 100
278
+
schedulingTimeoutSeconds: 60
282
279
```
283
280
284
281
```yaml
@@ -291,7 +288,7 @@ spec:
291
288
workload:
292
289
name: jobset
293
290
podGroup: job-1
294
-
podGroupReplica: 2
291
+
podGroupReplicaIndex: 2
295
292
...
296
293
297
294
```
@@ -335,60 +332,8 @@ type WorkloadSpec struct {
335
332
PodGroups []PodGroup
336
333
}
337
334
338
-
type GangMode string
339
-
const (
340
-
// GangModeOff means that all pods in this PodGroup do not need to be scheduled as a gang.
341
-
GangModeOff GangMode = "Off"
342
-
343
-
// GangModeSingle means that all pods in this PodGroup need to be scheduled as one gang.
344
-
GangModeSingle GangMode = "Single"
345
-
346
-
// GangModeReplicatedGang means that there is a variable number of identical copies of this PodGroup,
347
-
// as specified in Replicas, and each copy needs to be independently gang scheduled.
348
-
GangModeReplicated GangMode = "Replicated"
349
-
)
350
-
351
-
// GangSchedulingPolicy holds options that affect how gang scheduling of one PodGroup is handled by the scheduler.
352
-
type GangSchedulingPolicy struct {
353
-
// SchedulingTimeoutSeconds defines the timeout for the scheduling logic.
354
-
// Namely it's timeout from the moment when `minCount` pods show up in
355
-
// PreEnqueue, until those pods are observed in WaitOnPermit - for context
356
-
// see https://kubernetes.io/docs/concepts/scheduling-eviction/scheduling-framework/#interfaces
357
-
// If the timeout is hit, we reject all the waiting pods, free the resources
358
-
// they were reserving and put all of them back to scheduling queue.
359
-
SchedulingTimeoutSeconds *int
360
-
MinCount *int
361
-
}
362
-
363
-
// PodGroup is a group of pods that may contain multiple shapes (EqGroups) and may contain
364
-
// multiple dense indexes (RankedGroups) and which can optionally be replicated in a variable
365
-
// number of identical copies.
366
-
//
367
-
// TODO: Decide on the naming: PodGroup vs GangGroup.
368
-
type PodGroup struct {
369
-
Name *string
370
-
GangMode *GangMode // default is "Off"
371
-
372
-
// Optional when GangMode = "ReplicatedGang".
373
-
// Forbidden otherwise.
374
-
Replicas int
375
-
376
-
// GangSchedulingPolicy defines the options applying to all pods in this gang.
377
-
// Forbidden if GangMode is set to "Off".
378
-
GangSchedulingPolicy GangSchedulingPolicy
379
-
}
380
-
381
-
382
-
type WorkloadStatus struct {
383
-
// Necessary status fields TBD.
384
-
}
385
-
```
386
-
387
-
We also consider an alternative API design for PodGroup as following:
388
-
389
-
```go
390
-
// PodGroup is a group of pods that may contain multiple shapes (EqGroups) and may contain
391
-
// multiple dense indexes (RankedGroups) and which can optionally be replicated in a variable
335
+
// PodGroup is a group of pods that may contain multiple shapes (PodSets) and may contain
336
+
// multiple dense indexes (PodSubGroups) and which can optionally be replicated in a variable
392
337
// number of identical copies.
393
338
type PodGroup struct {
394
339
Name *string
@@ -422,7 +367,7 @@ type DefaultSchedulingPolicy struct {
422
367
// PodGroup should be handled.
423
368
type GangSchedulingPolicy struct {
424
369
// SchedulingTimeoutSeconds defines the timeout for the scheduling logic.
425
-
// Namely it's timeout from the moment when `minCount` pods show up in
370
+
// Namely it's timeout from the moment when the first pod show up in
426
371
// PreEnqueue, until those pods are observed in WaitOnPermit - for context
427
372
// see https://kubernetes.io/docs/concepts/scheduling-eviction/scheduling-framework/#interfaces
428
373
// If the timeout is hit, we reject all the waiting pods, free the resources
@@ -431,6 +376,10 @@ type GangSchedulingPolicy struct {
431
376
432
377
MinCount *int
433
378
}
379
+
380
+
type WorkloadStatus struct {
381
+
// Necessary status fields TBD.
382
+
}
434
383
```
435
384
436
385
The individual `PodGroups` and `PodGroup` replicas are treated as independent gangs. As an example, if one of
0 commit comments