You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
-[User Confusion Between String and Numeric Semantics](#user-confusion-between-string-and-numeric-semantics)
21
21
-[API Compatibility and Version Skew](#api-compatibility-and-version-skew)
22
22
-[Edge Cases in Numeric Parsing](#edge-cases-in-numeric-parsing)
23
23
-[Cross-SIG Impact](#cross-sig-impact)
@@ -73,7 +73,7 @@ Items marked with (R) are required *prior to targeting to a milestone / release*
73
73
74
74
Extend **core/v1 Toleration** to support **numeric comparison operators** when matching **Node Taints**:
75
75
76
-
- New operators: `Lt`, `Le`, `Ge`, `Gt` (in addition to existing `Equal`/`Exists`).
76
+
- New operators: `Lt`, `Gt` (in addition to existing `Equal`/`Exists`).
77
77
- Primary motivation: allow pods to opt‑in to nodes by `SLA/failure‑probability` values published as taints (e.g., `node.kubernetes.io/sla=950`).
78
78
- Scheduler impact is limited to the existing TaintToleration Filter; no new stages or algorithms.
79
79
@@ -96,7 +96,7 @@ From a scheduling perspective, adding numeric operators to tolerations only adju
96
96
97
97
### Goals
98
98
99
-
- Add comparison operators to Tolerations so pods can match taints like `node.kubernetes.io/sla=<int>` using thresholds.
99
+
- Add comparison operators to tolerations so pods can match taints like `node.kubernetes.io/sla=<int>` using thresholds.
100
100
- Keep behavior consistent with existing effects (`NoSchedule`, `PreferNoSchedule`, `NoExecute`).
101
101
- Backward compatible and opt‑in via a feature gate.
102
102
@@ -108,20 +108,20 @@ From a scheduling perspective, adding numeric operators to tolerations only adju
108
108
109
109
### Benefits for implementing this feature for DRA and AI Workloads
110
110
111
-
In addition to general scheduling improvements, SLA‑aware opt‑in via Tolerations has specific advantages for `Dynamic Resource Allocation` (DRA) and `AI/ML`:
111
+
In addition to general scheduling improvements, SLA‑aware opt‑in via tolerations has specific advantages for `Dynamic Resource Allocation (DRA)` and `AI/ML`:
112
112
113
-
For DRA, resource claims (e.g., GPUs/accelerators) can be steered by node reliability: critical claims stay on high‑SLA capacity; batch/cheap claims can land on lower‑SLA pools. Taints provide a default drive away from risky pools and `NoExecute` eviction if a pool degrades.
113
+
- DRA steers GPUs/accelerators resource claims by node reliability: critical workloads get high‑SLA capacity while batch workloads use cheaper pools. Taints block risky pools and evict when capacity degrades.
114
114
115
-
For AI/ML, multi‑stage pipelines can place latency‑sensitive inference on high‑SLA nodes while directing batch preprocessing, fine‑tuning, or embedding generation to spot nodes. When spot nodes are reclaimed, `NoExecute` or `NoSchedule` effects plus tolerations allow graceful drain and controlled failover. In multi‑tenant GPU clusters, taints bound access to the reliable pools (fairness), and during autoscaling bursts, extra replicas can safely land on low‑SLA pools with explicit opt‑in.
115
+
- AI/MLpipelines can place latency‑sensitive inference on high‑SLA nodes while directing batch to run on spot nodes. When spot nodes are reclaimed, taints trigger graceful drain and controlled failover.
116
116
117
-
| Benefit | Impact on DRA | Impact on AI/ML workloads|
|**Cost–reliability optimization**|Bind/keep claims on reliability tiers via taints (+ tolerations to opt-in). | Keep latency-critical inference on high-SLA; shift batch to spot. |
120
-
|**Stage-aware placement**| Steer per-stage claims to tiers consistently with node policy.| Different stages tolerate different risk; make that explicit via tolerations.|
121
-
|**Resilience after preemption**| Use `NoExecute`/`tolerationSeconds` for graceful drain; re-admit on stable tiers. | Training/services recover faster with predictable eviction semantics. |
122
-
|**Multi-tenant fairness**| Avoid monopolization of high-SLA tiers by requiring explicit tolerations.| Fair access to reliable accelerators across teams. |
123
-
|**Smooth burst handling**| Bursts land on low-SLA pools via opt-in; baseline remains on high-SLA.| HPA can scale to spot with clear safety boundaries. |
124
-
|**Operational clarity**| Node-side policy is auditable and centralized.| Platform teams can document and enforce reliability classes cleanly.|
117
+
| Benefit | Impact on DRA | Impact on AI/ML workloads |
|**Cost–reliability trade-off**|Critical workloads stay on premium nodes; batch uses spot| Inference on reliable nodes; training on cheaper pools|
120
+
|**Workload-aware placement**| Different claim types target appropriate node tiers| Pipeline stages match their reliability requirements |
121
+
|**Graceful preemption**|`NoExecute` provides controlled eviction timing | Predictable failover for training and serving workloads|
122
+
|**Resource fairness**| Prevents monopolization of premium capacity | Teams share reliable accelerators fairly|
123
+
|**Elastic scaling**| Bursts overflow to lower-SLA pools safely | HPA scales to spot with clear boundaries |
124
+
|**Policy transparency**| Node reliability classes are explicit and auditable| Platform teams enforce clear reliability tiers|
125
125
126
126
## Proposal
127
127
@@ -131,7 +131,7 @@ For AI/ML, multi‑stage pipelines can place latency‑sensitive inference on hi
131
131
132
132
As a cluster operator, I want a default repel from spot (low-SLA) nodes so that only workloads that explicitly tolerate them can land there.
133
133
134
-
I also want to set numeric SLA thresholds in tolerations (e.g., `Ge 950`) so pods can opt-in to reliable nodes or specific SLA bands without having to hardcode every SLA class in NodeAffinity rules.
134
+
I also want to set numeric SLA thresholds in tolerations (e.g., `Gt 950`) so pods can opt-in to reliable nodes or specific SLA bands without having to hardcode every SLA class in NodeAffinity rules.
#### Story 5 — DRA device-level error budget management
283
+
284
+
As a platform engineer managing GPU clusters with varying reliability states, I want to allocate devices based on their remaining error budget using numeric tolerations. So that critical workloads only get devices with sufficient reliability headroom while allowing degraded devices to serve less sensitive workloads.
285
+
286
+
This will get the critical inference fresh devices (>24h error budget), batch training can use aging devices (1-24h), and severely degraded devices (<1h) are excluded from allocation entirely, enabling graceful device lifecycle management.
# Batch training workload tolerates degraded devices
322
+
kind: ResourceClaim
323
+
metadata:
324
+
name: training-gpu-claim
325
+
spec:
326
+
requests:
327
+
- name: batch-gpu
328
+
deviceClassName: device.example.com
329
+
tolerations:
330
+
# Accept devices with >1 hour error budget
331
+
- key: device.example.com/error-budget-in-hours
332
+
operator: Gt
333
+
value: "1"
334
+
effect: NoSchedule
335
+
```
336
+
282
337
### Notes/Constraints/Caveats (Optional)
283
338
284
-
- **Integer-Only Support**: The implementation supports signed 64-bit integers only. Decimal values (e.g., `"95.5"`) will be rejected by API validation when using numeric operators.
339
+
- **Integer-Only Support**: The implementation supports signed 64-bit integers only. Pod specs containing toleration values with decimal numbers (e.g., `"95.5"`) will be rejected by the API server during validation when using numeric comparison operators.
285
340
286
-
- **Parsing Requirements**: Both taint value and toleration value must be parseable as integers for numeric operators (`Lt`, `Le`, `Ge`, `Gt`). If either fails parsing, the toleration does not match.
341
+
- **Parsing Requirements**: The toleration value must be parseable as integers for numeric operators (`Lt`, `Gt`). If fails parsing, the toleration does not match.
342
+
343
+
> Note: A taint like `foo=95.5:NoSchedule` is valid since taint values follow label values syntax, which allows. The numeric parsing/validation is enforced on toleration **only**.
287
344
288
345
- **Alpha Restrictions**: When `TaintTolerationComparisonOperators=false`, the API server rejects pods using the new operators.
289
346
@@ -302,21 +359,9 @@ spec:
302
359
**Mitigation**:
303
360
304
361
- Parse integers only when new operators are used (no impact on existing workloads)
305
-
- Implement microbenchmarks during development to measure parsing overhead
306
362
- Consider caching parsed values in scheduler data structures if performance issues arise
307
363
- Feature gate allows disabling if performance problems occur
308
364
309
-
#### User Confusion Between String and Numeric Semantics
310
-
311
-
**Risk**: Users might expect numeric comparison with `Equal` operator or string comparison with `Ge` operator, leading to mismatched tolerations.
312
-
313
-
**Mitigation**:
314
-
315
-
- Clear documentation distinguishing string vs. numeric operators
316
-
- API validation provides specific error messages for malformed numeric values
317
-
- Examples in documentation show proper usage patterns
318
-
- Consider adding warnings/events when numeric values are used with string operators
319
-
320
365
#### API Compatibility and Version Skew
321
366
322
367
**Risk**: Pods using new operators cannot be scheduled if some schedulers don't support the feature, creating deployment failures during upgrades.
@@ -356,8 +401,6 @@ spec:
356
401
Extend `core/v1.Toleration.Operator` to accept, in addition to `Equal` and `Exists`:
357
402
358
403
- `Lt`: match if toleration.value < taint.value
359
-
- `Le`: match if toleration.value <= taint.value
360
-
- `Ge`: match if toleration.value >= taint.value
361
404
- `Gt`: match if toleration.value > taint.value
362
405
- `Equal`/`Exists`: Remain unchanged
363
406
@@ -371,8 +414,6 @@ const (
371
414
372
415
// New numeric comparison operators (feature-gated)
373
416
TolerationOpLt TolerationOperator = "Lt" // Less than
374
-
TolerationOpLe TolerationOperator = "Le" // Less than or equal
375
-
TolerationOpGe TolerationOperator = "Ge" // Greater than or equal
376
417
TolerationOpGt TolerationOperator = "Gt" // Greater than
377
418
)
378
419
```
@@ -389,7 +430,7 @@ To honor Kubernetes APIs that avoids floating-point numbers where possible due t
389
430
390
431
```go
391
432
const (
392
-
// TaintTolerationComparisonOperators enables numeric comparison operators (Lt, Le, Ge, Gt) for tolerations
433
+
// TaintTolerationComparisonOperators enables numeric comparison operators (Lt, Gt) for tolerations
0 commit comments