Skip to content

Commit 3ec7426

Browse files
authored
Merge pull request #5735 from macsko/gang_scheduling_api_update
KEP-4671: Update the API spec to reflect the implementation
2 parents 42c07e5 + ed5c92c commit 3ec7426

File tree

1 file changed

+141
-76
lines changed
  • keps/sig-scheduling/4671-gang-scheduling

1 file changed

+141
-76
lines changed

keps/sig-scheduling/4671-gang-scheduling/README.md

Lines changed: 141 additions & 76 deletions
Original file line numberDiff line numberDiff line change
@@ -128,14 +128,15 @@ The following are non-goals for this KEP but will probably soon appear to be goa
128128

129129
## Proposal
130130

131-
The `spec.workload` field will be added to the Pod resource. A sample pod with this new field looks like this:
131+
The `spec.workloadRef` field will be added to the Pod resource. A sample pod with this new field looks like this:
132132
```yaml
133133
apiVersion: v1
134134
kind: Pod
135135
spec:
136136
...
137137
workloadRef:
138138
name: job-1
139+
podGroup: pg1
139140
...
140141
```
141142

@@ -224,35 +225,64 @@ usecases. You can read more about it in the [extended proposal] document.
224225

225226
* `Workload` is the resource Kind.
226227
* `scheduling.k8s.io` is the ApiGroup.
227-
* `spec.workload` is the name of the new field in pod.
228+
* `spec.workloadRef` is the name of the new field in pod.
228229
* Within a Workload there is a list of groups of pods. Each group represents a top-level division of pods within a Workload. Each group can be independently gang scheduled (or not use gang scheduling). This group is named `PodGroup`.
229230
* In a future , we expect that this group can optionally specify further subdivision into sub groups. Each sub-group can have an index. The indexes go from 0 to N, without repeats or gaps. These subgroups are called `PodSubGroup`.
230231
* In subsequent KEPs, we expect that a sub-group can optionally specify further subdivision into pod equivalence classes. All pods in a pod equivalence class have the same values for all fields that affect scheduling feasibility. These pod equivalence classes are called `PodSet`.
231232

232233
### Associating Pod into PodGroups
233234

234-
When a `Workload` consists of a single group of pods needing Gang Scheduling, it is clear which pods belong to the group from the `spec.workload.name` field of the pod. However `Workload` supports listing multiple list items, and a list item can represent a single group, or a set of identical replica groups.
235+
When a `Workload` consists of a single group of pods needing Gang Scheduling, it is clear which pods belong to the group from the `spec.workloadRef.name` field of the pod. However `Workload` supports listing multiple list items, and a list item can represent a single group, or a set of identical replica groups.
235236
In these cases, there needs to be additional information to indicate which group a pod belongs to.
236237

237-
We proposed to extend the newly introduced `pod.spec.workload` field with additional information
238-
to include that information. More specifically, the `pod.spec.workload` field is of type `WorkloadReference`
238+
We proposed to extend the newly introduced `pod.spec.workloadRef` field with additional information
239+
to include that information. More specifically, the `pod.spec.workloadRef` field is of type `WorkloadReference`
239240
and is defined as following:
240241

241242
```go
243+
type PodSpec struct {
244+
...
245+
// WorkloadRef provides a reference to the Workload object that this Pod belongs to.
246+
// This field is used by the scheduler to identify the PodGroup and apply the
247+
// correct group scheduling policies. The Workload object referenced
248+
// by this field may not exist at the time the Pod is created.
249+
// This field is immutable, but a Workload object with the same name
250+
// may be recreated with different policies. Doing this during pod scheduling
251+
// may result in the placement not conforming to the expected policies.
252+
//
253+
// +featureGate=GenericWorkload
254+
// +optional
255+
WorkloadRef *WorkloadReference
256+
}
257+
242258
// WorkloadReference identifies the Workload object and PodGroup membership
243-
// that a Pod belongs to. The scheduler uses this information to enforce
244-
// gang scheduling semantics.
259+
// that a Pod belongs to. The scheduler uses this information to apply
260+
// workload-aware scheduling semantics.
245261
type WorkloadReference struct {
246-
// Name defines the name of the Workload object this pod belongs to.
247-
Name string
248-
249-
// PodGroup defines the name of the PodGroup within a Workload this pod belongs to.
250-
PodGroup string
251-
// PodGroupReplicaIndex is the replica index of the PodGroup that this pod
252-
// belong to when the workload is running ReplicatedGangMode. In this mode,
253-
// a workload may create multiple identical PodGroups.
254-
// For workload in a different mode, this field is unset.
255-
PodGroupReplicaIndex string
262+
// Name defines the name of the Workload object this Pod belongs to.
263+
// Workload must be in the same namespace as the Pod.
264+
// If it doesn't match any existing Workload, the Pod will remain unschedulable
265+
// until a Workload object is created and observed by the kube-scheduler.
266+
// It must be a DNS subdomain.
267+
//
268+
// +required
269+
Name string
270+
271+
// PodGroup is the name of the PodGroup within the Workload that this Pod
272+
// belongs to. If it doesn't match any existing PodGroup within the Workload,
273+
// the Pod will remain unschedulable until the Workload object is recreated
274+
// and observed by the kube-scheduler. It must be a DNS label.
275+
//
276+
// +required
277+
PodGroup string
278+
279+
// PodGroupReplicaKey specifies the replica key of the PodGroup to which this
280+
// Pod belongs. It is used to distinguish pods belonging to different replicas
281+
// of the same pod group. The pod group policy is applied separately to each replica.
282+
// When set, it must be a DNS label.
283+
//
284+
// +optional
285+
PodGroupReplicaKey string
256286
}
257287
```
258288

@@ -273,7 +303,6 @@ metadata:
273303
spec:
274304
podGroups:
275305
- name: "job-1"
276-
replicas: 4
277306
policy:
278307
gang:
279308
minCount: 100
@@ -291,7 +320,6 @@ spec:
291320
podGroup: job-1
292321
podGroupReplicaKey: key-2
293322
...
294-
295323
```
296324

297325
We decided for this option because it is more succinct and makes the role of a pod clear just
@@ -312,77 +340,114 @@ to identify pods belonging to it. However, with this pattern:
312340

313341
The `Workload` type will be defined with the following structure:
314342
```go
343+
// Workload allows for expressing scheduling constraints that should be used
344+
// when managing lifecycle of workloads from scheduling perspective,
345+
// including scheduling, preemption, eviction and other phases.
315346
type Workload struct {
316347
metav1.TypeMeta
348+
// Standard object's metadata.
349+
// Name must be a DNS subdomain.
350+
//
351+
// +optional
317352
metav1.ObjectMeta
353+
354+
// Spec defines the desired behavior of a Workload.
355+
//
356+
// +required
318357
Spec WorkloadSpec
319-
Status WorkloadStatus
320358
}
321359
322-
// WorkloadSpec describes a workload in a portable way that scheduler and related
323-
// tools can understand.
360+
// WorkloadMaxPodGroups is the maximum number of pod groups per Workload.
361+
const WorkloadMaxPodGroups = 8
362+
363+
// WorkloadSpec defines the desired state of a Workload.
324364
type WorkloadSpec struct {
325-
// ControllerRef points to the true workload, e.g. Deployment.
326-
// It is optional to set and is intended to make this mapping easier for
327-
// things like CLI tools.
328-
// This field is immutable.
329-
ControllerRef *v1.ObjectReference
330-
331-
// PodGroups is a list of groups of pods.
332-
// Each group may request gang scheduling.
333-
PodGroups []PodGroup
365+
// ControllerRef is an optional reference to the controlling object, such as a
366+
// Deployment or Job. This field is intended for use by tools like CLIs
367+
// to provide a link back to the original workload definition.
368+
// When set, it cannot be changed.
369+
//
370+
// +optional
371+
ControllerRef *TypedLocalObjectReference
372+
373+
// PodGroups is the list of pod groups that make up the Workload.
374+
// The maximum number of pod groups is 8. This field is immutable.
375+
//
376+
// +required
377+
// +listType=map
378+
// +listMapKey=name
379+
PodGroups []PodGroup
334380
}
335381
336-
// PodGroup is a group of pods that may contain multiple shapes (PodSets) and may contain
337-
// multiple dense indexes (PodSubGroups) and which can optionally be replicated in a variable
338-
// number of identical copies.
339-
type PodGroup struct {
340-
Name *string
341-
342-
// Number of identical instances of PodGroup that are part of the Workload.
343-
// Defaults to 1.
344-
Replicas int
382+
// TypedLocalObjectReference allows to reference typed object inside the same namespace.
383+
type TypedLocalObjectReference struct {
384+
// APIGroup is the group for the resource being referenced.
385+
// If APIGroup is empty, the specified Kind must be in the core API group.
386+
// For any other third-party types, setting APIGroup is required.
387+
// It must be a DNS subdomain.
388+
//
389+
// +optional
390+
APIGroup string
391+
// Kind is the type of resource being referenced.
392+
// It must be a path segment name.
393+
//
394+
// +required
395+
Kind string
396+
// Name is the name of resource being referenced.
397+
// It must be a path segment name.
398+
//
399+
// +required
400+
Name string
401+
}
345402
346-
// Policy defines the configuration of the PodGroup to enable different
347-
// scheduling policies.
348-
Policy PodGroupPolicy
403+
// PodGroup represents a set of pods with a common scheduling policy.
404+
type PodGroup struct {
405+
// Name is a unique identifier for the PodGroup within the Workload.
406+
// It must be a DNS label. This field is immutable.
407+
//
408+
// +required
409+
Name string
410+
411+
// Policy defines the scheduling policy for this PodGroup.
412+
//
413+
// +required
414+
Policy PodGroupPolicy
349415
}
350416
351-
// PodGroupPolicy defines scheduling configuration of a PodGroup.
417+
// PodGroupPolicy defines the scheduling configuration for a PodGroup.
352418
type PodGroupPolicy struct {
353-
// Kind indicates which of the other fields is non-empty.
354-
// Required.
355-
// +unionDiscriminator
356-
Kind PodGroupPolicyKind
357-
358-
// Default scheduling policy (default Kubernetes behavior).
359-
Default *DefaultSchedulingPolicy
360-
361-
// Gang scheduling policy (all-or-nothing scheduling semantics)
362-
Gang *GangSchedulingPolicy
419+
// Basic specifies that the pods in this group should be scheduled using
420+
// standard Kubernetes scheduling behavior.
421+
//
422+
// +optional
423+
// +oneOf=PolicySelection
424+
Basic *BasicSchedulingPolicy
425+
426+
// Gang specifies that the pods in this group should be scheduled using
427+
// all-or-nothing semantics.
428+
//
429+
// +optional
430+
// +oneOf=PolicySelection
431+
Gang *GangSchedulingPolicy
363432
}
364433
365-
type PodGroupPolicyKind string
366-
367-
// Supported PodGroupPolicy kinds.
368-
const (
369-
PodGroupPolicyKindDefault PodGroupPolicyKind = "Default"
370-
PodGroupPolicyKindGang PodGroupPolicyKind = "Gang"
371-
)
372-
373-
// DefaultSchedulingPolicy represents default scheduling behavior.
374-
type DefaultSchedulingPolicy struct {
375-
// For now this is effectively just a marker type.
434+
// BasicSchedulingPolicy indicates that standard Kubernetes
435+
// scheduling behavior should be used.
436+
type BasicSchedulingPolicy struct {
437+
// This is intentionally empty. Its presence indicates that the basic
438+
// scheduling policy should be applied. In the future, new fields may appear,
439+
// describing such constraints on a pod group level without "all or nothing"
440+
// (gang) scheduling.
376441
}
377442
378-
// GangSchedulingPolicy represents options for how gang scheduling of one
379-
// PodGroup should be handled.
443+
// GangSchedulingPolicy defines the parameters for gang scheduling.
380444
type GangSchedulingPolicy struct {
381-
MinCount *int
382-
}
383-
384-
type WorkloadStatus struct {
385-
// Necessary status fields TBD.
445+
// MinCount is the minimum number of pods that must be schedulable or scheduled
446+
// at the same time for the scheduler to admit the entire group.
447+
// It must be a positive integer.
448+
//
449+
// +required
450+
MinCount int32
386451
}
387452
```
388453

@@ -591,13 +656,13 @@ This KEP effectively boils down to two separate functionalities:
591656

592657
When user upgrades the cluster to the version that supports these two features:
593658
- they can start using the new API by creating Workload objects and linking pods to it via
594-
explicitly specifying their new `spec.workload` field
659+
explicitly specifying their new `spec.workloadRef` field
595660
- scheduler automatically uses the new extensions and tries to schedule all pods from a given
596661
gang in a scheduling group based on the defined `Workload` objects
597662

598663
When user downgrades the cluster to the version that no longer supports these two features:
599664
- the `Workload` objects can no longer be created (the existing ones are not removed though)
600-
- the `spec.workload` field can no longer be set on the Pods (the already set fields continue
665+
- the `spec.workloadRef` field can no longer be set on the Pods (the already set fields continue
601666
to be set though)
602667
- scheduler reverts to the original behavior of scheduling one pod at a time ignoring
603668
existence of `Workload` objects and pods being linked to them
@@ -673,7 +738,7 @@ those are not yet created automatically behind the scenes.
673738
Yes. The GangScheduling features gate need to be switched off to disabled gang scheduling
674739
functionality.
675740
If additionally the API changes needs to be disabled, the GenericWorkload feature gate needs to
676-
also be disabled. However, the content of `spec.workload` fields in Pod objects will not be
741+
also be disabled. However, the content of `spec.workloadRef` fields in Pod objects will not be
677742
cleared, as well as the existing Workload objects will not be deleted.
678743

679744

@@ -842,7 +907,7 @@ No.
842907

843908
###### Will enabling / using this feature result in increasing size or count of the existing API objects?
844909

845-
Yes. New field (spec.workload) is added to the Pod API:
910+
Yes. New field (spec.workloadRef) is added to the Pod API:
846911
- API type: Pod
847912
- Estimated increase in size: XX-XXX bytes per object (depending on the final choice described
848913
in the Associating Pod into PodGroups section above).

0 commit comments

Comments
 (0)