Skip to content

Commit c9e75ba

Browse files
committed
Address feedback by remove Ge/Le and add DRA example
Signed-off-by: Heba Elayoty <heelayot@microsoft.com>
1 parent 35389d0 commit c9e75ba

File tree

2 files changed

+95
-60
lines changed

2 files changed

+95
-60
lines changed

keps/sig-scheduling/5471-enable-sla-based-scheduling/README.md

Lines changed: 88 additions & 53 deletions
Original file line numberDiff line numberDiff line change
@@ -14,10 +14,10 @@
1414
- [Story 2 — AI inference service with strict SLOs](#story-2--ai-inference-service-with-strict-slos)
1515
- [Story 3 — AI training workload balancing cost and reliability](#story-3--ai-training-workload-balancing-cost-and-reliability)
1616
- [Story 4 — DRA GPU claim management](#story-4--dra-gpu-claim-management)
17+
- [Story 5 — DRA device-level error budget management](#story-5--dra-device-level-error-budget-management)
1718
- [Notes/Constraints/Caveats (Optional)](#notesconstraintscaveats-optional)
1819
- [Risks and Mitigations](#risks-and-mitigations)
1920
- [Scheduler Performance Regression](#scheduler-performance-regression)
20-
- [User Confusion Between String and Numeric Semantics](#user-confusion-between-string-and-numeric-semantics)
2121
- [API Compatibility and Version Skew](#api-compatibility-and-version-skew)
2222
- [Edge Cases in Numeric Parsing](#edge-cases-in-numeric-parsing)
2323
- [Cross-SIG Impact](#cross-sig-impact)
@@ -73,7 +73,7 @@ Items marked with (R) are required *prior to targeting to a milestone / release*
7373

7474
Extend **core/v1 Toleration** to support **numeric comparison operators** when matching **Node Taints**:
7575

76-
- New operators: `Lt`, `Le`, `Ge`, `Gt` (in addition to existing `Equal`/`Exists`).
76+
- New operators: `Lt`, `Gt` (in addition to existing `Equal`/`Exists`).
7777
- Primary motivation: allow pods to opt‑in to nodes by `SLA/failure‑probability` values published as taints (e.g., `node.kubernetes.io/sla=950`).
7878
- Scheduler impact is limited to the existing TaintToleration Filter; no new stages or algorithms.
7979

@@ -96,7 +96,7 @@ From a scheduling perspective, adding numeric operators to tolerations only adju
9696

9797
### Goals
9898

99-
- Add comparison operators to Tolerations so pods can match taints like `node.kubernetes.io/sla=<int>` using thresholds.
99+
- Add comparison operators to tolerations so pods can match taints like `node.kubernetes.io/sla=<int>` using thresholds.
100100
- Keep behavior consistent with existing effects (`NoSchedule`, `PreferNoSchedule`, `NoExecute`).
101101
- Backward compatible and opt‑in via a feature gate.
102102

@@ -108,20 +108,20 @@ From a scheduling perspective, adding numeric operators to tolerations only adju
108108

109109
### Benefits for implementing this feature for DRA and AI Workloads
110110

111-
In addition to general scheduling improvements, SLA‑aware opt‑in via Tolerations has specific advantages for `Dynamic Resource Allocation` (DRA) and `AI/ML`:
111+
In addition to general scheduling improvements, SLA‑aware opt‑in via tolerations has specific advantages for `Dynamic Resource Allocation (DRA)` and `AI/ML`:
112112

113-
For DRA, resource claims (e.g., GPUs/accelerators) can be steered by node reliability: critical claims stay on high‑SLA capacity; batch/cheap claims can land on lower‑SLA pools. Taints provide a default drive away from risky pools and `NoExecute` eviction if a pool degrades.
113+
- DRA steers GPUs/accelerators resource claims by node reliability: critical workloads get high‑SLA capacity while batch workloads use cheaper pools. Taints block risky pools and evict when capacity degrades.
114114

115-
For AI/ML, multi‑stage pipelines can place latency‑sensitive inference on high‑SLA nodes while directing batch preprocessing, fine‑tuning, or embedding generation to spot nodes. When spot nodes are reclaimed, `NoExecute` or `NoSchedule` effects plus tolerations allow graceful drain and controlled failover. In multi‑tenant GPU clusters, taints bound access to the reliable pools (fairness), and during autoscaling bursts, extra replicas can safely land on low‑SLA pools with explicit opt‑in.
115+
- AI/ML pipelines can place latency‑sensitive inference on high‑SLA nodes while directing batch to run on spot nodes. When spot nodes are reclaimed, taints trigger graceful drain and controlled failover.
116116

117-
| Benefit | Impact on DRA | Impact on AI/ML workloads |
118-
| --------------------------------- | --------------------------------------------------------------------------------- | ----------------------------------------------------------------------------- |
119-
| **Cost–reliability optimization** | Bind/keep claims on reliability tiers via taints (+ tolerations to opt-in). | Keep latency-critical inference on high-SLA; shift batch to spot. |
120-
| **Stage-aware placement** | Steer per-stage claims to tiers consistently with node policy. | Different stages tolerate different risk; make that explicit via tolerations. |
121-
| **Resilience after preemption** | Use `NoExecute`/`tolerationSeconds` for graceful drain; re-admit on stable tiers. | Training/services recover faster with predictable eviction semantics. |
122-
| **Multi-tenant fairness** | Avoid monopolization of high-SLA tiers by requiring explicit tolerations. | Fair access to reliable accelerators across teams. |
123-
| **Smooth burst handling** | Bursts land on low-SLA pools via opt-in; baseline remains on high-SLA. | HPA can scale to spot with clear safety boundaries. |
124-
| **Operational clarity** | Node-side policy is auditable and centralized. | Platform teams can document and enforce reliability classes cleanly. |
117+
| Benefit | Impact on DRA | Impact on AI/ML workloads |
118+
| ------------------------------ | --------------------------------------------------------- | ------------------------------------------------------- |
119+
| **Cost–reliability trade-off** | Critical workloads stay on premium nodes; batch uses spot | Inference on reliable nodes; training on cheaper pools |
120+
| **Workload-aware placement** | Different claim types target appropriate node tiers | Pipeline stages match their reliability requirements |
121+
| **Graceful preemption** | `NoExecute` provides controlled eviction timing | Predictable failover for training and serving workloads |
122+
| **Resource fairness** | Prevents monopolization of premium capacity | Teams share reliable accelerators fairly |
123+
| **Elastic scaling** | Bursts overflow to lower-SLA pools safely | HPA scales to spot with clear boundaries |
124+
| **Policy transparency** | Node reliability classes are explicit and auditable | Platform teams enforce clear reliability tiers |
125125

126126
## Proposal
127127

@@ -131,7 +131,7 @@ For AI/ML, multi‑stage pipelines can place latency‑sensitive inference on hi
131131

132132
As a cluster operator, I want a default repel from spot (low-SLA) nodes so that only workloads that explicitly tolerate them can land there.
133133

134-
I also want to set numeric SLA thresholds in tolerations (e.g., `Ge 950`) so pods can opt-in to reliable nodes or specific SLA bands without having to hardcode every SLA class in NodeAffinity rules.
134+
I also want to set numeric SLA thresholds in tolerations (e.g., `Gt 950`) so pods can opt-in to reliable nodes or specific SLA bands without having to hardcode every SLA class in NodeAffinity rules.
135135

136136
**Example Configuration:**
137137

@@ -153,7 +153,7 @@ kind: Pod
153153
spec:
154154
tolerations:
155155
- key: node.kubernetes.io/sla
156-
operator: Ge
156+
operator: Gt
157157
value: "750"
158158
effect: NoSchedule
159159
```
@@ -188,7 +188,7 @@ spec:
188188
spec:
189189
tolerations:
190190
- key: node.kubernetes.io/sla
191-
operator: Ge
191+
operator: Gt
192192
value: "950"
193193
effect: NoExecute
194194
tolerationSeconds: 30
@@ -211,7 +211,7 @@ metadata:
211211
spec:
212212
tolerations:
213213
- key: node.kubernetes.io/sla
214-
operator: Ge
214+
operator: Gt
215215
value: "999" # 99.9% SLA
216216
effect: NoSchedule
217217
containers:
@@ -228,7 +228,7 @@ metadata:
228228
spec:
229229
tolerations:
230230
- key: node.kubernetes.io/sla
231-
operator: Ge
231+
operator: Gt
232232
value: "800" # 80% SLA acceptable
233233
effect: NoSchedule
234234
containers:
@@ -269,7 +269,7 @@ spec:
269269
resourceClaimName: gpu-claim-high-sla
270270
tolerations:
271271
- key: node.kubernetes.io/sla
272-
operator: Ge
272+
operator: Gt
273273
value: "950" # Ensure GPU nodes meet SLA requirements
274274
effect: NoSchedule
275275
containers:
@@ -279,11 +279,68 @@ spec:
279279
- name: gpu-claim
280280
```
281281
282+
#### Story 5 — DRA device-level error budget management
283+
284+
As a platform engineer managing GPU clusters with varying reliability states, I want to allocate devices based on their remaining error budget using numeric tolerations. So that critical workloads only get devices with sufficient reliability headroom while allowing degraded devices to serve less sensitive workloads.
285+
286+
This will get the critical inference fresh devices (>24h error budget), batch training can use aging devices (1-24h), and severely degraded devices (<1h) are excluded from allocation entirely, enabling graceful device lifecycle management.
287+
288+
**Example Configuration:**
289+
290+
```yaml
291+
# Driver taints devices with low error budget
292+
kind: ResourceSlice
293+
spec:
294+
driver: device.example.com
295+
devices:
296+
- name: gpu-node-01-device-0
297+
attributes:
298+
memory: "32Gi"
299+
compute-capability: "8.6"
300+
# Driver applies taint when error budget drops below 10 hours
301+
taints:
302+
- key: device.example.com/error-budget-in-hours
303+
value: "8" # 8 hours remaining
304+
effect: NoSchedule
305+
---
306+
# Critical inference workload requires high-reliability devices
307+
kind: ResourceClaim
308+
metadata:
309+
name: inference-gpu-claim
310+
spec:
311+
requests:
312+
- name: high-reliability-gpu
313+
deviceClassName: device.example.com
314+
tolerations:
315+
# Only accept devices with >24 hours error budget
316+
- key: device.example.com/error-budget-in-hours
317+
operator: Gt
318+
value: "24"
319+
effect: NoSchedule
320+
---
321+
# Batch training workload tolerates degraded devices
322+
kind: ResourceClaim
323+
metadata:
324+
name: training-gpu-claim
325+
spec:
326+
requests:
327+
- name: batch-gpu
328+
deviceClassName: device.example.com
329+
tolerations:
330+
# Accept devices with >1 hour error budget
331+
- key: device.example.com/error-budget-in-hours
332+
operator: Gt
333+
value: "1"
334+
effect: NoSchedule
335+
```
336+
282337
### Notes/Constraints/Caveats (Optional)
283338
284-
- **Integer-Only Support**: The implementation supports signed 64-bit integers only. Decimal values (e.g., `"95.5"`) will be rejected by API validation when using numeric operators.
339+
- **Integer-Only Support**: The implementation supports signed 64-bit integers only. Pod specs containing toleration values with decimal numbers (e.g., `"95.5"`) will be rejected by the API server during validation when using numeric comparison operators.
285340
286-
- **Parsing Requirements**: Both taint value and toleration value must be parseable as integers for numeric operators (`Lt`, `Le`, `Ge`, `Gt`). If either fails parsing, the toleration does not match.
341+
- **Parsing Requirements**: The toleration value must be parseable as integers for numeric operators (`Lt`, `Gt`). If fails parsing, the toleration does not match.
342+
343+
> Note: A taint like `foo=95.5:NoSchedule` is valid since taint values follow label values syntax, which allows. The numeric parsing/validation is enforced on toleration **only**.
287344

288345
- **Alpha Restrictions**: When `TaintTolerationComparisonOperators=false`, the API server rejects pods using the new operators.
289346

@@ -302,21 +359,9 @@ spec:
302359
**Mitigation**:
303360

304361
- Parse integers only when new operators are used (no impact on existing workloads)
305-
- Implement microbenchmarks during development to measure parsing overhead
306362
- Consider caching parsed values in scheduler data structures if performance issues arise
307363
- Feature gate allows disabling if performance problems occur
308364

309-
#### User Confusion Between String and Numeric Semantics
310-
311-
**Risk**: Users might expect numeric comparison with `Equal` operator or string comparison with `Ge` operator, leading to mismatched tolerations.
312-
313-
**Mitigation**:
314-
315-
- Clear documentation distinguishing string vs. numeric operators
316-
- API validation provides specific error messages for malformed numeric values
317-
- Examples in documentation show proper usage patterns
318-
- Consider adding warnings/events when numeric values are used with string operators
319-
320365
#### API Compatibility and Version Skew
321366

322367
**Risk**: Pods using new operators cannot be scheduled if some schedulers don't support the feature, creating deployment failures during upgrades.
@@ -356,8 +401,6 @@ spec:
356401
Extend `core/v1.Toleration.Operator` to accept, in addition to `Equal` and `Exists`:
357402

358403
- `Lt`: match if toleration.value < taint.value
359-
- `Le`: match if toleration.value <= taint.value
360-
- `Ge`: match if toleration.value >= taint.value
361404
- `Gt`: match if toleration.value > taint.value
362405
- `Equal`/`Exists`: Remain unchanged
363406

@@ -371,8 +414,6 @@ const (
371414
372415
// New numeric comparison operators (feature-gated)
373416
TolerationOpLt TolerationOperator = "Lt" // Less than
374-
TolerationOpLe TolerationOperator = "Le" // Less than or equal
375-
TolerationOpGe TolerationOperator = "Ge" // Greater than or equal
376417
TolerationOpGt TolerationOperator = "Gt" // Greater than
377418
)
378419
```
@@ -389,7 +430,7 @@ To honor Kubernetes APIs that avoids floating-point numbers where possible due t
389430

390431
```go
391432
const (
392-
// TaintTolerationComparisonOperators enables numeric comparison operators (Lt, Le, Ge, Gt) for tolerations
433+
// TaintTolerationComparisonOperators enables numeric comparison operators (Lt, Gt) for tolerations
393434
TaintTolerationComparisonOperators featuregate.Feature = "TaintTolerationComparisonOperators"
394435
)
395436
@@ -411,7 +452,7 @@ func validateTolerations(tolerations []core.Toleration, fldPath *field.Path) fie
411452
412453
// New: Validate numeric operators (feature-gated)
413454
switch toleration.Operator {
414-
case core.TolerationOpLt, core.TolerationOpLe, core.TolerationOpGe, core.TolerationOpGt:
455+
case core.TolerationOpLt, core.TolerationOpGt:
415456
if !utilfeature.DefaultFeatureGate.Enabled(features.TaintTolerationComparisonOperators) {
416457
allErrors = append(allErrors, field.Invalid(idxPath.Child("operator"),
417458
toleration.Operator, "numeric operators require TaintTolerationComparisonOperators feature gate"))
@@ -438,7 +479,7 @@ func (t *Toleration) ToleratesTaint(taint *Taint) bool {
438479
439480
switch t.Operator {
440481
// ...
441-
case TolerationOpLt, TolerationOpLe, TolerationOpGe, TolerationOpGt:
482+
case TolerationOpLt, TolerationOpGt:
442483
// Feature gate check is not needed here as validation already handles it
443484
return compareNumericValues(t.Value, taint.Value, t.Operator)
444485
default:
@@ -460,10 +501,6 @@ func compareNumericValues(tolerationVal, taintVal string, op TolerationOperator)
460501
switch op {
461502
case TolerationOpLt:
462503
return tVal < nVal
463-
case TolerationOpLe:
464-
return tVal <= nVal
465-
case TolerationOpGe:
466-
return tVal >= nVal
467504
case TolerationOpGt:
468505
return tVal > nVal
469506
default:
@@ -646,13 +683,14 @@ in back-to-back releases.
646683

647684
- Feature implemented behind `TaintTolerationComparisonOperators` feature gate (disabled by default)
648685
- API validation for numeric operators in place
649-
- Taint/toleration matching logic supports `Lt`, `Le`, `Ge`, `Gt` operators
686+
- Taint/toleration matching logic supports `Lt`, `Gt` operators
650687

651688
#### Beta
652689

653690
- Feature enabled by default
654691
- Feedback collected from early adopters in SIG-Scheduling
655692
- Performance testing shows that there is no significant scheduler latency increase nor memory usage increase.
693+
- Implement feature for DRA APIs
656694
- Stress testing with:
657695
- 1000+ nodes with numeric taints
658696
- 10,000+ pods with numeric tolerations
@@ -853,7 +891,7 @@ logs or events for this purpose.
853891

854892
```bash
855893
# Check for pods with numeric toleration operators
856-
kubectl get pods -A -o jsonpath='{range .items[*]}{.metadata.name}{": "}{.spec.tolerations[?(@.operator=="Ge")]}{"\n"}{end}' | grep -v "^[^:]*: *$"
894+
kubectl get pods -A -o jsonpath='{range .items[*]}{.metadata.name}{": "}{.spec.tolerations[?(@.operator=="Gt")]}{"\n"}{end}' | grep -v "^[^:]*: *$"
857895
858896
# Count nodes with numeric taints (SLA example)
859897
kubectl get nodes -o jsonpath='{range .items[*]}{.spec.taints[?(@.key=="node.kubernetes.io/sla")]}{"\n"}{end}' | wc -l
@@ -1089,18 +1127,15 @@ Why should this KEP _not_ be implemented?
10891127

10901128
There are many different alternatives were considered:
10911129

1092-
1. **Extend NodeAffinity with Numeric Operators:** Add Lt, Le, Ge, Gt to `NodeSelectorOperator` instead.
1093-
- **Pros:** `NodeAffinity` already supports `Gt`/`Lt` operators
1094-
- **Cons:** No eviction semantics, per-pod configuration (no cluster defaults), doesn't solve the operational model problem.
1095-
2. **New Dedicated SLA API Resource:** Create `SLAPolicy` CRD
1130+
1. **New Dedicated SLA API Resource:** Create `SLAPolicy` CRD
10961131
- **Pros:** Clean separation, rich policy definitions.
10971132
- **Cons:** New API surface, additional complexity, breaks unified taint/toleration model.
1098-
3. **Custom Scheduler Plugin:** Use scheduling plugin with SLA-aware logic, [placement-policy-scheduler-plugins](https://github.com/Azure/placement-policy-scheduler-plugins)
1133+
2. **Custom Scheduler Plugin:** Use scheduling plugin with SLA-aware logic, [placement-policy-scheduler-plugins](https://github.com/Azure/placement-policy-scheduler-plugins)
10991134
- **Pros:** Full scheduling control, rich logic possible
11001135
- **Cons:**
11011136
- Out-of-tree scheduler plugin to maintain and manage
11021137
- Doesn't leverage existing taint/toleration infrastructure.
1103-
4. **Node Labels + Enhanced NodeAffinity:** Use labels instead of taints, extend NodeAffinity matching.
1138+
3. **Node Labels + Enhanced NodeAffinity:** Use labels instead of taints, extend NodeAffinity matching.
11041139
- **Pros:** Leverages existing label system.
11051140
- **Cons:**
11061141
- No default push-back behavior

keps/sig-scheduling/5471-enable-sla-based-scheduling/kep.yaml

Lines changed: 7 additions & 7 deletions
Original file line numberDiff line numberDiff line change
@@ -6,12 +6,13 @@ owning-sig: sig-scheduling
66
participating-sigs:
77
- sig-node
88
status: provisional
9-
creation-date: 2025-08-11
9+
creation-date: 2025-08-08
1010
reviewers:
11-
- TBD
11+
- "@SergeyKanzhelev"
1212
approvers:
13-
- TBD
14-
- "@oscar.doe"
13+
- "@macsko"
14+
- "@dom4ha"
15+
- "@sanposhiho"
1516

1617
# The target maturity stage in the current dev cycle for this KEP.
1718
# If the purpose of this KEP is to deprecate a user-visible feature
@@ -38,6 +39,5 @@ disable-supported: true
3839

3940
# The following PRR answers are required at beta release
4041
metrics:
41-
- kube_pod_numeric_tolerations_total{operator="Ge|Le|Gt|Lt"}
42-
- scheduler_failed_scheduling_attempts_total{reason="numeric_taint_mismatch"}
43-
- scheduler_framework_extension_point_duration_seconds{plugin="TaintToleration"}
42+
- scheduler_numeric_tolerations_total{operator="Gt|Lt"}
43+
- scheduler_numeric_taint_mismatches_total

0 commit comments

Comments
 (0)