You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Copy file name to clipboardExpand all lines: keps/sig-scheduling/4671-gang-scheduling/README.md
+85-90Lines changed: 85 additions & 90 deletions
Display the source diff
Display the rich diff
Original file line number
Diff line number
Diff line change
@@ -106,7 +106,8 @@ The `Workload` object will allow kube-scheduler to be aware that pods are part o
106
106
- Implement the first version of `Workload` API necessary for defining a Gang
107
107
- Ensuring that we can extend `Workload` API in backward compatible way toward north-star API
108
108
- Ensuring that `Workload` API will be usable for both built-in and third-party workload controllers and APIs
109
-
- Implement first version of gang-scheduling in kube-scheduler
109
+
- Implement first version of gang-scheduling in kube-scheduler supporting (potentially in non-optimal way)
110
+
all existing scheduling features.
110
111
- Provide full backward compatibility for all existing scheduling features
111
112
112
113
### Non-Goals
@@ -117,6 +118,7 @@ The `Workload` object will allow kube-scheduler to be aware that pods are part o
117
118
118
119
The following are non-goals for this KEP but will probably soon appear to be goals for follow-up KEPs:
119
120
121
+
- Integrate cluster autoscaling with gang scheduling.
120
122
- Introduce a concept of `Reservation` that can be later consumed by pods.
121
123
- Workload-level preemption.
122
124
- Address resource contention between different schedulers (including possible deadlocks).
@@ -177,12 +179,11 @@ metadata:
177
179
namespace: ns-1
178
180
name: job-1
179
181
spec:
180
-
podGroups: # or gangGroups -- TBD
182
+
podGroups:
181
183
- name: "pg1"
182
-
gangMode: Single
183
-
gangSchedulingPolicy:
184
-
minCount: 100
185
-
schedulingTimeoutSeconds: 60
184
+
policy:
185
+
gang:
186
+
minCount: 100
186
187
```
187
188
188
189
@@ -223,12 +224,9 @@ usecases. You can read more about it in the [extended proposal] document.
223
224
* `Workload` is the resource Kind.
224
225
* `scheduling` is the ApiGroup.
225
226
* `spec.workload` is the name of the new field in pod.
226
-
* Within a Workload there is a list of groups of pods. Each group represents a top-level division of pods within a Workload. Each group can be independently gang scheduled (or not use gang scheduling). This group is named
227
-
<<[UNRESOLVED community feedback requested]>> `PodGroup` or `GangGroup` for the top level. <<[/UNRESOLVED]>>.
228
-
* In a future , we expect that this group can optionally specify further subdivision into sub groups. Each sub-group can have an index. The indexes go from 0 to N, without repeats or gaps. These subgroups are called
229
-
<<[UNRESOLVED depending on previous unresolved item]>> `PodSubGroup` if `PodGroup` is chosen, or else `RankedGroup` if `GangGroup` is chosen<<[/UNRESOLVED]>>.
230
-
* In subsequent KEPs, we expect that a sub-group can optionally specify further subdivision into pod equivalence classes. All pods in a pod equivalence class have the same values for all fields that affect scheduling feasibility. These pod equivalence classes are called
231
-
<<[UNRESOLVED depending on a previous unresolved item]>> `PodSet` if `PodGroup` is chosen, or else `EqGroup` if `GangGroup` is chosen<<[/UNRESOLVED]>>.
227
+
* Within a Workload there is a list of groups of pods. Each group represents a top-level division of pods within a Workload. Each group can be independently gang scheduled (or not use gang scheduling). This group is named `PodGroup`.
228
+
* In a future , we expect that this group can optionally specify further subdivision into sub groups. Each sub-group can have an index. The indexes go from 0 to N, without repeats or gaps. These subgroups are called `PodSubGroup`.
229
+
* In subsequent KEPs, we expect that a sub-group can optionally specify further subdivision into pod equivalence classes. All pods in a pod equivalence class have the same values for all fields that affect scheduling feasibility. These pod equivalence classes are called `PodSet`.
232
230
233
231
### Associating Pod into PodGroups
234
232
@@ -244,8 +242,8 @@ and is defined as following:
244
242
// that a Pod belongs to. The scheduler uses this information to enforce
245
243
// gang scheduling semantics.
246
244
type WorkloadReference struct {
247
-
// Workload defines the name of the Workload object this pod belongs to.
248
-
Workload string
245
+
// Name defines the name of the Workload object this pod belongs to.
246
+
Name string
249
247
250
248
// PodGroup defines the name of the PodGroup within a Workload this pod belongs to.
251
249
PodGroup string
@@ -272,13 +270,12 @@ kind: Workload
272
270
metadata:
273
271
name: jobset
274
272
spec:
275
-
podGroups: # or gangGroups -- TBD
273
+
podGroups:
276
274
- name: "job-1"
277
-
gangMode: Replicated
278
275
replicas: 4
279
-
gangSchedulingPolicy:
280
-
minCount: 100
281
-
schedulingTimeoutSeconds: 60
276
+
policy:
277
+
gang:
278
+
minCount: 100
282
279
```
283
280
284
281
```yaml
@@ -291,7 +288,7 @@ spec:
291
288
workload:
292
289
name: jobset
293
290
podGroup: job-1
294
-
podGroupReplica: 2
291
+
podGroupReplicaIndex: 2
295
292
...
296
293
297
294
```
@@ -335,60 +332,8 @@ type WorkloadSpec struct {
335
332
PodGroups []PodGroup
336
333
}
337
334
338
-
type GangMode string
339
-
const (
340
-
// GangModeOff means that all pods in this PodGroup do not need to be scheduled as a gang.
341
-
GangModeOff GangMode = "Off"
342
-
343
-
// GangModeSingle means that all pods in this PodGroup need to be scheduled as one gang.
344
-
GangModeSingle GangMode = "Single"
345
-
346
-
// GangModeReplicatedGang means that there is a variable number of identical copies of this PodGroup,
347
-
// as specified in Replicas, and each copy needs to be independently gang scheduled.
348
-
GangModeReplicated GangMode = "Replicated"
349
-
)
350
-
351
-
// GangSchedulingPolicy holds options that affect how gang scheduling of one PodGroup is handled by the scheduler.
352
-
type GangSchedulingPolicy struct {
353
-
// SchedulingTimeoutSeconds defines the timeout for the scheduling logic.
354
-
// Namely it's timeout from the moment when `minCount` pods show up in
355
-
// PreEnqueue, until those pods are observed in WaitOnPermit - for context
356
-
// see https://kubernetes.io/docs/concepts/scheduling-eviction/scheduling-framework/#interfaces
357
-
// If the timeout is hit, we reject all the waiting pods, free the resources
358
-
// they were reserving and put all of them back to scheduling queue.
359
-
SchedulingTimeoutSeconds *int
360
-
MinCount *int
361
-
}
362
-
363
-
// PodGroup is a group of pods that may contain multiple shapes (EqGroups) and may contain
364
-
// multiple dense indexes (RankedGroups) and which can optionally be replicated in a variable
365
-
// number of identical copies.
366
-
//
367
-
// TODO: Decide on the naming: PodGroup vs GangGroup.
368
-
type PodGroup struct {
369
-
Name *string
370
-
GangMode *GangMode // default is "Off"
371
-
372
-
// Optional when GangMode = "ReplicatedGang".
373
-
// Forbidden otherwise.
374
-
Replicas int
375
-
376
-
// GangSchedulingPolicy defines the options applying to all pods in this gang.
377
-
// Forbidden if GangMode is set to "Off".
378
-
GangSchedulingPolicy GangSchedulingPolicy
379
-
}
380
-
381
-
382
-
type WorkloadStatus struct {
383
-
// Necessary status fields TBD.
384
-
}
385
-
```
386
-
387
-
We also consider an alternative API design for PodGroup as following:
388
-
389
-
```go
390
-
// PodGroup is a group of pods that may contain multiple shapes (EqGroups) and may contain
391
-
// multiple dense indexes (RankedGroups) and which can optionally be replicated in a variable
335
+
// PodGroup is a group of pods that may contain multiple shapes (PodSets) and may contain
336
+
// multiple dense indexes (PodSubGroups) and which can optionally be replicated in a variable
392
337
// number of identical copies.
393
338
type PodGroup struct {
394
339
Name *string
@@ -421,16 +366,12 @@ type DefaultSchedulingPolicy struct {
421
366
// GangSchedulingPolicy represents options for how gang scheduling of one
422
367
// PodGroup should be handled.
423
368
type GangSchedulingPolicy struct {
424
-
// SchedulingTimeoutSeconds defines the timeout for the scheduling logic.
425
-
// Namely it's timeout from the moment when `minCount` pods show up in
426
-
// PreEnqueue, until those pods are observed in WaitOnPermit - for context
427
-
// see https://kubernetes.io/docs/concepts/scheduling-eviction/scheduling-framework/#interfaces
428
-
// If the timeout is hit, we reject all the waiting pods, free the resources
429
-
// they were reserving and put all of them back to scheduling queue.
430
-
SchedulingTimeoutSeconds *int
431
-
432
369
MinCount *int
433
370
}
371
+
372
+
type WorkloadStatus struct {
373
+
// Necessary status fields TBD.
374
+
}
434
375
```
435
376
436
377
The individual `PodGroups` and `PodGroup` replicas are treated as independent gangs. As an example, if one of
0 commit comments