|
| 1 | +# KEP-5547: Expose workloadRef in the Job API for scheduler coordination |
| 2 | + |
| 3 | +<!-- toc --> |
| 4 | +- [Release Signoff Checklist](#release-signoff-checklist) |
| 5 | +- [Summary](#summary) |
| 6 | +- [Motivation](#motivation) |
| 7 | + - [Goals](#goals) |
| 8 | + - [Non-Goals](#non-goals) |
| 9 | +- [Proposal](#proposal) |
| 10 | + - [User Stories (Optional)](#user-stories-optional) |
| 11 | + - [Story 1: Coordinated Gang Scheduling for ML Training Jobs](#story-1-coordinated-gang-scheduling-for-ml-training-jobs) |
| 12 | + - [Story 2: Prevent Race Conditions Between Job Controller and Scheduler](#story-2-prevent-race-conditions-between-job-controller-and-scheduler) |
| 13 | + - [Notes/Constraints/Caveats (Optional)](#notesconstraintscaveats-optional) |
| 14 | + - [Risks and Mitigations](#risks-and-mitigations) |
| 15 | + - [Misconfiguration or Invalid References](#misconfiguration-or-invalid-references) |
| 16 | + - [API Coupling and Evolution Risk](#api-coupling-and-evolution-risk) |
| 17 | +- [Design Details](#design-details) |
| 18 | + - [Test Plan](#test-plan) |
| 19 | + - [Prerequisite testing updates](#prerequisite-testing-updates) |
| 20 | + - [Unit tests](#unit-tests) |
| 21 | + - [Integration tests](#integration-tests) |
| 22 | + - [e2e tests](#e2e-tests) |
| 23 | + - [Graduation Criteria](#graduation-criteria) |
| 24 | + - [Alpha](#alpha) |
| 25 | + - [Upgrade / Downgrade Strategy](#upgrade--downgrade-strategy) |
| 26 | + - [Version Skew Strategy](#version-skew-strategy) |
| 27 | +- [Production Readiness Review Questionnaire](#production-readiness-review-questionnaire) |
| 28 | + - [Feature Enablement and Rollback](#feature-enablement-and-rollback) |
| 29 | + - [Rollout, Upgrade and Rollback Planning](#rollout-upgrade-and-rollback-planning) |
| 30 | + - [Monitoring Requirements](#monitoring-requirements) |
| 31 | + - [Dependencies](#dependencies) |
| 32 | + - [Scalability](#scalability) |
| 33 | + - [Troubleshooting](#troubleshooting) |
| 34 | +- [Implementation History](#implementation-history) |
| 35 | +- [Drawbacks](#drawbacks) |
| 36 | +- [Alternatives](#alternatives) |
| 37 | +- [Infrastructure Needed (Optional)](#infrastructure-needed-optional) |
| 38 | +<!-- /toc --> |
| 39 | + |
| 40 | +## Release Signoff Checklist |
| 41 | + |
| 42 | +Items marked with (R) are required *prior to targeting to a milestone / release*. |
| 43 | + |
| 44 | +- [x] (R) Enhancement issue in release milestone, which links to KEP dir in [kubernetes/enhancements] (not the initial KEP PR) |
| 45 | +- [x] (R) KEP approvers have approved the KEP status as `implementable` |
| 46 | +- [ ] (R) Design details are appropriately documented |
| 47 | +- [ ] (R) Test plan is in place, giving consideration to SIG Architecture and SIG Testing input (including test refactors) |
| 48 | + - [ ] e2e Tests for all Beta API Operations (endpoints) |
| 49 | + - [ ] (R) Ensure GA e2e tests meet requirements for [Conformance Tests](https://github.com/kubernetes/community/blob/master/contributors/devel/sig-architecture/conformance-tests.md) |
| 50 | + - [ ] (R) Minimum Two Week Window for GA e2e tests to prove flake free |
| 51 | +- [ ] (R) Graduation criteria is in place |
| 52 | + - [ ] (R) [all GA Endpoints](https://github.com/kubernetes/community/pull/1806) must be hit by [Conformance Tests](https://github.com/kubernetes/community/blob/master/contributors/devel/sig-architecture/conformance-tests.md) within one minor version of promotion to GA |
| 53 | +- [ ] (R) Production readiness review completed |
| 54 | +- [ ] (R) Production readiness review approved |
| 55 | +- [ ] "Implementation History" section is up-to-date for milestone |
| 56 | +- [ ] User-facing documentation has been created in [kubernetes/website], for publication to [kubernetes.io] |
| 57 | +- [ ] Supporting documentation—e.g., additional design documents, links to mailing list discussions/SIG meetings, relevant PRs/issues, release notes |
| 58 | + |
| 59 | +## Summary |
| 60 | + |
| 61 | +Introduce a new optional field in the Job API spec to explicitly associate a Job with a Workload object, enabling safe coordination between workload-aware (Gang) scheduling and job controllers without introducing race conditions or forcing the scheduler to perform controller duties. |
| 62 | + |
| 63 | +## Motivation |
| 64 | + |
| 65 | +Workload-aware and gang scheduling logic rely on treating a group of pods as a single schedulable unit, which require the scheduler to operate with full knowledge of how Pods relate to higher-level workloads. While Job currently creates Pods directly, the linkage to any Workload concept is implicit and subject to race conditions during controller and scheduler interactions. |
| 66 | + |
| 67 | +Without an explicit `workloadRef`, schedulers must guess which Job created a given Pod, causing unsafe scheduling or requiring speculative heuristics. This KEP makes the workload-pod relation first-class by allowing Jobs to opt-in to associating with a Workload object directly. |
| 68 | + |
| 69 | +### Goals |
| 70 | + |
| 71 | +- Introduce a new optional `workloadRef` field in the `JobSpec`, allowing a Job to declare an explicit association with a higher-level workload object. |
| 72 | +- Keep the Job API backward-compatible and aligned with SIG Apps ownership, without altering existing Job behavior or introducing mandatory new semantics. |
| 73 | + |
| 74 | +### Non-Goals |
| 75 | + |
| 76 | +- Not replacing `PodSet` or `minAvailable` directly, rather enabling cleaner linkage. |
| 77 | +- Not enforcing mutual exclusivity (i.e. Job may be used with or without a `workloadRef`). |
| 78 | + |
| 79 | +## Proposal |
| 80 | + |
| 81 | +Add a new optional field to JobSpec: |
| 82 | + |
| 83 | +```go |
| 84 | +type JobSpec struct { |
| 85 | + ... |
| 86 | + // WorkloadRef allows this job to declare an association to a Workload object. |
| 87 | + // The scheduler may use this to coordinate gang placement or workload-level decisions. |
| 88 | + // This field is optional and has no effect on job execution semantics. |
| 89 | + WorkloadRef *corev1.ObjectReference `json:"workloadRef,omitempty"` |
| 90 | +} |
| 91 | +``` |
| 92 | + |
| 93 | +### User Stories (Optional) |
| 94 | + |
| 95 | +#### Story 1: Coordinated Gang Scheduling for ML Training Jobs |
| 96 | + |
| 97 | +**Context**: As a platform operator running ML training pipelines composed of multiple Jobs, I want to associate each Job with a Workload object that specifies gang scheduling constraints (e.g., minAvailable), So that the scheduler can treat the set of pods across Jobs as a single schedulable unit and either co-schedule them or delay all together. Without having to track down the workload topology based on labels or timing. |
| 98 | + |
| 99 | +**Example Configuration:** |
| 100 | +```yaml |
| 101 | +apiVersion: batch/v1 |
| 102 | +kind: Job |
| 103 | +metadata: |
| 104 | + name: job-1 |
| 105 | +spec: |
| 106 | + ... |
| 107 | + template: |
| 108 | + spec: |
| 109 | + workloadRef: |
| 110 | + apiVersion: scheduling/v1alpha1 |
| 111 | + name: w-job-1 |
| 112 | + namespace: demo-workload |
| 113 | + containers: |
| 114 | + - name: job-container |
| 115 | + image: job-image |
| 116 | + command: ["./sample"] |
| 117 | + ... |
| 118 | +``` |
| 119 | + |
| 120 | +#### Story 2: Prevent Race Conditions Between Job Controller and Scheduler |
| 121 | + |
| 122 | +**Context**: As a scheduler maintainer, I want the Job object to explicitly declare which workload it belongs to via a structured `workloadRef`, So that I can fetch the workload metadata during scheduling without relying on label selectors or waiting for controller propagation, And avoid risky correlation logic or inconsistent state across Job creation and pod scheduling. |
| 123 | + |
| 124 | +**Example Configuration:** |
| 125 | + |
| 126 | +```yaml |
| 127 | +apiVersion: scheduling/v1alpha1 |
| 128 | +kind: Workload |
| 129 | +metadata: |
| 130 | + name: w-job-2 |
| 131 | + namespace: demo-workload |
| 132 | +spec: |
| 133 | + controllerRef: |
| 134 | + name: job-2 |
| 135 | + kind: Job |
| 136 | + apiGroup: batch |
| 137 | + ... |
| 138 | +--- |
| 139 | +apiVersion: batch/v1 |
| 140 | +kind: Job |
| 141 | +metadata: |
| 142 | + name: job-2 |
| 143 | +spec: |
| 144 | + ... |
| 145 | + template: |
| 146 | + spec: |
| 147 | + workloadRef: |
| 148 | + apiVersion: scheduling/v1alpha1 |
| 149 | + name: w-job-2 |
| 150 | + namespace: demo-workload |
| 151 | + containers: |
| 152 | + - name: job-container |
| 153 | + image: job-image |
| 154 | + command: ["./sample"] |
| 155 | + ... |
| 156 | +``` |
| 157 | +### Notes/Constraints/Caveats (Optional) |
| 158 | + |
| 159 | +### Risks and Mitigations |
| 160 | + |
| 161 | +#### Misconfiguration or Invalid References |
| 162 | + |
| 163 | +**Risk Description**: Users or controllers may set an invalid or non-existent `workloadRef`, pointing to a workload that doesn’t exist, is in the wrong namespace, or isn’t intended to be compatible with the scheduler logic. |
| 164 | + |
| 165 | +**Mitigation Strategies**: |
| 166 | + |
| 167 | +- Controllers and admission webhooks validate the presence and correctness of the referenced object. |
| 168 | +- The scheduler should fail gracefully if the `workloadRef` cannot be resolved or is incompatible. |
| 169 | +- The field is optional, which means the default behavior is preserved when unset. |
| 170 | + |
| 171 | +#### API Coupling and Evolution Risk |
| 172 | + |
| 173 | +**Risk Description**: If the workload API evolves (i.e. API group changes), older Jobs with workloadRef might break or behave unexpectedly. |
| 174 | + |
| 175 | +**Mitigation Strategies**: |
| 176 | + |
| 177 | +- The use of a structured `ObjectReference` (vs. the workload name string) allows future evolution of the workload object’s type/version. |
| 178 | +- The scheduler should resolve and type-check the object at runtime, enforcing known versions/kinds before attempting coordination. |
| 179 | +- API evolution policies apply to the Workload resource itself. |
| 180 | + |
| 181 | +## Design Details |
| 182 | + |
| 183 | +### Test Plan |
| 184 | + |
| 185 | +[x] I/we understand the owners of the involved components may require updates to |
| 186 | +existing tests to make this code solid enough prior to committing the changes necessary |
| 187 | +to implement this enhancement. |
| 188 | + |
| 189 | +##### Prerequisite testing updates |
| 190 | + |
| 191 | +##### Unit tests |
| 192 | + |
| 193 | +- `<package>`: `<date>` - `<test coverage>` |
| 194 | + |
| 195 | +##### Integration tests |
| 196 | + |
| 197 | +- [test name](https://github.com/kubernetes/kubernetes/blob/2334b8469e1983c525c0c6382125710093a25883/test/integration/...): [integration master](https://testgrid.k8s.io/sig-release-master-blocking#integration-master?include-filter-by-regex=MyCoolFeature), [triage search](https://storage.googleapis.com/k8s-triage/index.html?test=MyCoolFeature) |
| 198 | + |
| 199 | +##### e2e tests |
| 200 | + |
| 201 | +- [test name](https://github.com/kubernetes/kubernetes/blob/2334b8469e1983c525c0c6382125710093a25883/test/e2e/...): [SIG ...](https://testgrid.k8s.io/sig-...?include-filter-by-regex=MyCoolFeature), [triage search](https://storage.googleapis.com/k8s-triage/index.html?test=MyCoolFeature) |
| 202 | + |
| 203 | +### Graduation Criteria |
| 204 | + |
| 205 | +#### Alpha |
| 206 | + |
| 207 | +- Field added to `JobSpec`. |
| 208 | +- Job controller populates it via Same Gang Scheduler FeatureGate. |
| 209 | +- Scheduler validates and uses it safely. |
| 210 | + |
| 211 | +### Upgrade / Downgrade Strategy |
| 212 | + |
| 213 | +### Version Skew Strategy |
| 214 | + |
| 215 | +## Production Readiness Review Questionnaire |
| 216 | + |
| 217 | +### Feature Enablement and Rollback |
| 218 | + |
| 219 | +###### How can this feature be enabled / disabled in a live cluster? |
| 220 | + |
| 221 | +- [ ] Feature gate (also fill in values in `kep.yaml`) |
| 222 | + - Feature gate name: |
| 223 | + - Components depending on the feature gate: |
| 224 | +- [ ] Other |
| 225 | + - Describe the mechanism: |
| 226 | + - Will enabling / disabling the feature require downtime of the control |
| 227 | + plane? |
| 228 | + - Will enabling / disabling the feature require downtime or reprovisioning |
| 229 | + of a node? |
| 230 | + |
| 231 | +###### Does enabling the feature change any default behavior? |
| 232 | + |
| 233 | +###### Can the feature be disabled once it has been enabled (i.e. can we roll back the enablement)? |
| 234 | + |
| 235 | +###### What happens if we reenable the feature if it was previously rolled back? |
| 236 | + |
| 237 | +###### Are there any tests for feature enablement/disablement? |
| 238 | + |
| 239 | +### Rollout, Upgrade and Rollback Planning |
| 240 | + |
| 241 | +###### How can a rollout or rollback fail? Can it impact already running workloads? |
| 242 | + |
| 243 | +###### What specific metrics should inform a rollback? |
| 244 | + |
| 245 | +###### Were upgrade and rollback tested? Was the upgrade->downgrade->upgrade path tested? |
| 246 | + |
| 247 | +###### Is the rollout accompanied by any deprecations and/or removals of features, APIs, fields of API types, flags, etc.? |
| 248 | + |
| 249 | +### Monitoring Requirements |
| 250 | + |
| 251 | +###### How can an operator determine if the feature is in use by workloads? |
| 252 | + |
| 253 | +###### How can someone using this feature know that it is working for their instance? |
| 254 | + |
| 255 | +- [ ] Events |
| 256 | + - Event Reason: |
| 257 | +- [ ] API .status |
| 258 | + - Condition name: |
| 259 | + - Other field: |
| 260 | +- [ ] Other (treat as last resort) |
| 261 | + - Details: |
| 262 | + |
| 263 | +###### What are the reasonable SLOs (Service Level Objectives) for the enhancement? |
| 264 | + |
| 265 | +###### What are the SLIs (Service Level Indicators) an operator can use to determine the health of the service? |
| 266 | + |
| 267 | +- [ ] Metrics |
| 268 | + - Metric name: |
| 269 | + - [Optional] Aggregation method: |
| 270 | + - Components exposing the metric: |
| 271 | +- [ ] Other (treat as last resort) |
| 272 | + - Details: |
| 273 | + |
| 274 | +###### Are there any missing metrics that would be useful to have to improve observability of this feature? |
| 275 | + |
| 276 | +### Dependencies |
| 277 | + |
| 278 | +###### Does this feature depend on any specific services running in the cluster? |
| 279 | + |
| 280 | +### Scalability |
| 281 | + |
| 282 | +###### Will enabling / using this feature result in any new API calls? |
| 283 | + |
| 284 | +###### Will enabling / using this feature result in introducing new API types? |
| 285 | + |
| 286 | +###### Will enabling / using this feature result in any new calls to the cloud provider? |
| 287 | + |
| 288 | +###### Will enabling / using this feature result in increasing size or count of the existing API objects? |
| 289 | + |
| 290 | +###### Will enabling / using this feature result in increasing time taken by any operations covered by existing SLIs/SLOs? |
| 291 | + |
| 292 | +###### Will enabling / using this feature result in non-negligible increase of resource usage (CPU, RAM, disk, IO, ...) in any components? |
| 293 | + |
| 294 | +###### Can enabling / using this feature result in resource exhaustion of some node resources (PIDs, sockets, inodes, etc.)? |
| 295 | + |
| 296 | +### Troubleshooting |
| 297 | + |
| 298 | +###### How does this feature react if the API server and/or etcd is unavailable? |
| 299 | + |
| 300 | +###### What are other known failure modes? |
| 301 | + |
| 302 | +###### What steps should be taken if SLOs are not being met to determine the problem? |
| 303 | + |
| 304 | +## Implementation History |
| 305 | + |
| 306 | +## Drawbacks |
| 307 | + |
| 308 | +## Alternatives |
| 309 | + |
| 310 | +## Infrastructure Needed (Optional) |
| 311 | + |
| 312 | +NA |
0 commit comments