Skip to content

Commit 61fdfb2

Browse files
committed
Add workload ref to job spec kep
Signed-off-by: Heba Elayoty <heelayot@microsoft.com>
1 parent 7371425 commit 61fdfb2

File tree

2 files changed

+272
-0
lines changed

2 files changed

+272
-0
lines changed
Lines changed: 229 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,229 @@
1+
# KEP-5547: Expose workloadRef in the Job API for scheduler coordination
2+
3+
<!-- toc -->
4+
- [Release Signoff Checklist](#release-signoff-checklist)
5+
- [Summary](#summary)
6+
- [Motivation](#motivation)
7+
- [Goals](#goals)
8+
- [Non-Goals](#non-goals)
9+
- [Proposal](#proposal)
10+
- [User Stories (Optional)](#user-stories-optional)
11+
- [Story 1](#story-1)
12+
- [Story 2](#story-2)
13+
- [Notes/Constraints/Caveats (Optional)](#notesconstraintscaveats-optional)
14+
- [Risks and Mitigations](#risks-and-mitigations)
15+
- [Design Details](#design-details)
16+
- [Test Plan](#test-plan)
17+
- [Prerequisite testing updates](#prerequisite-testing-updates)
18+
- [Unit tests](#unit-tests)
19+
- [Integration tests](#integration-tests)
20+
- [e2e tests](#e2e-tests)
21+
- [Graduation Criteria](#graduation-criteria)
22+
- [Alpha](#alpha)
23+
- [Upgrade / Downgrade Strategy](#upgrade--downgrade-strategy)
24+
- [Version Skew Strategy](#version-skew-strategy)
25+
- [Production Readiness Review Questionnaire](#production-readiness-review-questionnaire)
26+
- [Feature Enablement and Rollback](#feature-enablement-and-rollback)
27+
- [Rollout, Upgrade and Rollback Planning](#rollout-upgrade-and-rollback-planning)
28+
- [Monitoring Requirements](#monitoring-requirements)
29+
- [Dependencies](#dependencies)
30+
- [Scalability](#scalability)
31+
- [Troubleshooting](#troubleshooting)
32+
- [Implementation History](#implementation-history)
33+
- [Drawbacks](#drawbacks)
34+
- [Alternatives](#alternatives)
35+
- [Infrastructure Needed (Optional)](#infrastructure-needed-optional)
36+
<!-- /toc -->
37+
38+
## Release Signoff Checklist
39+
40+
Items marked with (R) are required *prior to targeting to a milestone / release*.
41+
42+
- [x] (R) Enhancement issue in release milestone, which links to KEP dir in [kubernetes/enhancements] (not the initial KEP PR)
43+
- [ ] (R) KEP approvers have approved the KEP status as `implementable`
44+
- [ ] (R) Design details are appropriately documented
45+
- [ ] (R) Test plan is in place, giving consideration to SIG Architecture and SIG Testing input (including test refactors)
46+
- [ ] e2e Tests for all Beta API Operations (endpoints)
47+
- [ ] (R) Ensure GA e2e tests meet requirements for [Conformance Tests](https://github.com/kubernetes/community/blob/master/contributors/devel/sig-architecture/conformance-tests.md)
48+
- [ ] (R) Minimum Two Week Window for GA e2e tests to prove flake free
49+
- [ ] (R) Graduation criteria is in place
50+
- [ ] (R) [all GA Endpoints](https://github.com/kubernetes/community/pull/1806) must be hit by [Conformance Tests](https://github.com/kubernetes/community/blob/master/contributors/devel/sig-architecture/conformance-tests.md) within one minor version of promotion to GA
51+
- [ ] (R) Production readiness review completed
52+
- [ ] (R) Production readiness review approved
53+
- [ ] "Implementation History" section is up-to-date for milestone
54+
- [ ] User-facing documentation has been created in [kubernetes/website], for publication to [kubernetes.io]
55+
- [ ] Supporting documentation—e.g., additional design documents, links to mailing list discussions/SIG meetings, relevant PRs/issues, release notes
56+
57+
## Summary
58+
59+
Introduce a new optional field in the Job API spec to explicitly associate a Job with a Workload object, enabling safe coordination between workload-aware (Gang) scheduling and job controllers without introducing race conditions or forcing the scheduler to perform controller duties.
60+
61+
## Motivation
62+
63+
Workload-aware and gang scheduling logic rely on treating a group of pods as a single schedulable unit, which require the scheduler to operate with full knowledge of how Pods relate to higher-level workloads. While Job currently creates Pods directly, the linkage to any Workload concept is implicit and subject to race conditions during controller and scheduler interactions.
64+
65+
Without an explicit `workloadRef`, schedulers must guess which Job created a given Pod, causing unsafe scheduling or requiring speculative heuristics. This KEP makes the workload-pod relation first-class by allowing Jobs to opt-in to associating with a Workload object directly.
66+
67+
### Goals
68+
69+
### Non-Goals
70+
71+
- Not replacing `PodSet` or `minAvailable` directly, rather enabling cleaner linkage.
72+
- Not enforcing mutual exclusivity (i.e. Job may be used with or without a `workloadRef`).
73+
74+
## Proposal
75+
76+
Add a new optional field to JobSpec:
77+
78+
```go
79+
type JobSpec struct {
80+
...
81+
// WorkloadRef allows this job to declare an association to a Workload object.
82+
// The scheduler may use this to coordinate gang placement or workload-level decisions.
83+
// This field is optional and has no effect on job execution semantics.
84+
WorkloadRef *corev1.ObjectReference `json:"workloadRef,omitempty"`
85+
}
86+
```
87+
88+
### User Stories (Optional)
89+
90+
#### Story 1
91+
92+
#### Story 2
93+
94+
### Notes/Constraints/Caveats (Optional)
95+
96+
### Risks and Mitigations
97+
98+
## Design Details
99+
100+
### Test Plan
101+
102+
[ ] I/we understand the owners of the involved components may require updates to
103+
existing tests to make this code solid enough prior to committing the changes necessary
104+
to implement this enhancement.
105+
106+
##### Prerequisite testing updates
107+
108+
##### Unit tests
109+
110+
- `<package>`: `<date>` - `<test coverage>`
111+
112+
##### Integration tests
113+
114+
- [test name](https://github.com/kubernetes/kubernetes/blob/2334b8469e1983c525c0c6382125710093a25883/test/integration/...): [integration master](https://testgrid.k8s.io/sig-release-master-blocking#integration-master?include-filter-by-regex=MyCoolFeature), [triage search](https://storage.googleapis.com/k8s-triage/index.html?test=MyCoolFeature)
115+
116+
##### e2e tests
117+
118+
- [test name](https://github.com/kubernetes/kubernetes/blob/2334b8469e1983c525c0c6382125710093a25883/test/e2e/...): [SIG ...](https://testgrid.k8s.io/sig-...?include-filter-by-regex=MyCoolFeature), [triage search](https://storage.googleapis.com/k8s-triage/index.html?test=MyCoolFeature)
119+
120+
### Graduation Criteria
121+
122+
#### Alpha
123+
124+
- Field added to `JobSpec`.
125+
- Job controller populates it via Same Gang Scheduler FeatureGate.
126+
- Scheduler validates and uses it safely.
127+
128+
### Upgrade / Downgrade Strategy
129+
130+
### Version Skew Strategy
131+
132+
## Production Readiness Review Questionnaire
133+
134+
### Feature Enablement and Rollback
135+
136+
###### How can this feature be enabled / disabled in a live cluster?
137+
138+
- [ ] Feature gate (also fill in values in `kep.yaml`)
139+
- Feature gate name:
140+
- Components depending on the feature gate:
141+
- [ ] Other
142+
- Describe the mechanism:
143+
- Will enabling / disabling the feature require downtime of the control
144+
plane?
145+
- Will enabling / disabling the feature require downtime or reprovisioning
146+
of a node?
147+
148+
###### Does enabling the feature change any default behavior?
149+
150+
###### Can the feature be disabled once it has been enabled (i.e. can we roll back the enablement)?
151+
152+
###### What happens if we reenable the feature if it was previously rolled back?
153+
154+
###### Are there any tests for feature enablement/disablement?
155+
156+
### Rollout, Upgrade and Rollback Planning
157+
158+
###### How can a rollout or rollback fail? Can it impact already running workloads?
159+
160+
###### What specific metrics should inform a rollback?
161+
162+
###### Were upgrade and rollback tested? Was the upgrade->downgrade->upgrade path tested?
163+
164+
###### Is the rollout accompanied by any deprecations and/or removals of features, APIs, fields of API types, flags, etc.?
165+
166+
### Monitoring Requirements
167+
168+
###### How can an operator determine if the feature is in use by workloads?
169+
170+
###### How can someone using this feature know that it is working for their instance?
171+
172+
- [ ] Events
173+
- Event Reason:
174+
- [ ] API .status
175+
- Condition name:
176+
- Other field:
177+
- [ ] Other (treat as last resort)
178+
- Details:
179+
180+
###### What are the reasonable SLOs (Service Level Objectives) for the enhancement?
181+
182+
###### What are the SLIs (Service Level Indicators) an operator can use to determine the health of the service?
183+
184+
- [ ] Metrics
185+
- Metric name:
186+
- [Optional] Aggregation method:
187+
- Components exposing the metric:
188+
- [ ] Other (treat as last resort)
189+
- Details:
190+
191+
###### Are there any missing metrics that would be useful to have to improve observability of this feature?
192+
193+
### Dependencies
194+
195+
###### Does this feature depend on any specific services running in the cluster?
196+
197+
### Scalability
198+
199+
###### Will enabling / using this feature result in any new API calls?
200+
201+
###### Will enabling / using this feature result in introducing new API types?
202+
203+
###### Will enabling / using this feature result in any new calls to the cloud provider?
204+
205+
###### Will enabling / using this feature result in increasing size or count of the existing API objects?
206+
207+
###### Will enabling / using this feature result in increasing time taken by any operations covered by existing SLIs/SLOs?
208+
209+
###### Will enabling / using this feature result in non-negligible increase of resource usage (CPU, RAM, disk, IO, ...) in any components?
210+
211+
###### Can enabling / using this feature result in resource exhaustion of some node resources (PIDs, sockets, inodes, etc.)?
212+
213+
### Troubleshooting
214+
215+
###### How does this feature react if the API server and/or etcd is unavailable?
216+
217+
###### What are other known failure modes?
218+
219+
###### What steps should be taken if SLOs are not being met to determine the problem?
220+
221+
## Implementation History
222+
223+
## Drawbacks
224+
225+
## Alternatives
226+
227+
## Infrastructure Needed (Optional)
228+
229+
NA
Lines changed: 43 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,43 @@
1+
title: Expose workloadRef in the Job API for scheduler coordination
2+
kep-number: 5547
3+
authors:
4+
- "@helayoty"
5+
owning-sig:
6+
- sig-scheduling
7+
- sig-apps
8+
participating-sigs:
9+
- sig-scheduling
10+
- sig-apps
11+
status: implementable
12+
creation-date: 2025-09-19
13+
reviewers:
14+
- "@janetkuo"
15+
- "@soltysh"
16+
- "@erictune"
17+
approvers:
18+
- "@janetkuo"
19+
- "@soltysh"
20+
21+
see-also:
22+
- "/keps/sig-scheduling/4671-gang-scheduling"
23+
replaces: NA
24+
25+
stage: alpha
26+
27+
latest-milestone: "v1.35"
28+
29+
milestone:
30+
alpha: "v1.35"
31+
beta: TDB
32+
stable: TDB
33+
34+
feature-gates:
35+
- name: TBD
36+
components:
37+
- kube-apiserver
38+
- kube-scheduler
39+
- kube-controller-manager
40+
disable-supported: true
41+
42+
metrics:
43+
- TDB

0 commit comments

Comments
 (0)