Skip to content

Commit ba8dbf3

Browse files
committed
Initial proposal for InferenceSchedulingObjective
1 parent 68c73c0 commit ba8dbf3

File tree

1 file changed

+257
-0
lines changed
  • docs/proposals/1001-inference-scheduling-objective

1 file changed

+257
-0
lines changed
Lines changed: 257 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,257 @@
1+
# Inference Scheduling Objective
2+
3+
Author(s): @ahg-g, @kfswain
4+
5+
## Proposal Status
6+
***Draft***
7+
8+
[model]:(https://platform.openai.com/docs/api-reference/chat/create#chat-create-model)
9+
10+
## Summary
11+
12+
The [InferenceModel](../002-api-proposal/README.md#inferencemodel) has been found to have a few key issues:
13+
14+
- The naming is misleading & not indicative of its purpose
15+
- The [model] param is the only matching rule
16+
- The InferenceModel is a mandatory inclusion to serve requests to the pool
17+
18+
Due to this, we propose a restructuring of the `InferenceModel` to the `InferenceServingObjective`(ISO), to help solve some of the current pain points.
19+
20+
Original discussion doc: https://docs.google.com/document/d/1x6aI9pbTF5oOsaEQYc9n4pBBY3_AuEY2X51VKxmBSnU/edit?tab=t.0#heading=h.geklwdrtzbph
21+
22+
23+
## Goals
24+
25+
- Update name to something more descriptive & agreed-upon
26+
- Broaden the match from just [model] to a more ubiquitous system
27+
28+
## Non-Goals
29+
30+
- Create a generic, multi-applicable 'policy' field
31+
- Future iterations may transform this field to be more policy-like, but currently ISO is still the identifier for fairness calculation, and so acts as the primary key for fairness budget allocation
32+
33+
## Proposal
34+
35+
`InferenceSchedulingObjective` is focused on defining scheduling objectives for a matching request flow, concretely these changes will:
36+
37+
- Drop the `InferenceModel` API, but no changes to the InferencePool API
38+
- Replace the API with `InferenceSchedulingObjective`. The API will define `scheduling objectives` for matching requests. The inference-scheduler (run by the EPP) will be the primary actuator of this API.
39+
- We will keep the Criticality field, with the intent to add more serving objectives (e.g., latency SLOs)
40+
- Request matching beyond model name to include headers. This allows defining different scheduling policies for different request flows (apps or users) while targeting the same model. The semantics should be adopted from HTTPRouteMatch as defined in HTTPRouteRules.
41+
- A default InferenceSchedulingObjective per InferencePool will be included as a fallback policy when no one matches the request. These defaults can be adjusted.
42+
- Traffic splitting is not part of the InferenceSchedulingObjective API. Traffic splitting is not an endpoint scheduling objective, it is a request routing objective. As we describe below, with some creativity, we can offload traffic splitting to HTTPRoute.
43+
- An intended side effect of this is that users will more easily be able to define different scheduling policies for the same target models, something that required some shenanigans with the current API (2 inferenceModels with distinct `modelNames` both pointing to the same target model).
44+
45+
```golang
46+
type InferenceSchedulingObjectivesSpec struct {
47+
// Match defines what requests this objective applies to.
48+
Match
49+
50+
// Criticality defines how important it is to serve the requests that match this objective
51+
// compared to requests that match other objectives.
52+
// Criticality impacts how traffic is handled in resource constrained situations. It handles this
53+
// by queuing or rejecting requests of lower criticality. Objectives of an equivalent Criticality
54+
// will fairly share resources over throughput of tokens. In the future, the metric used to
55+
// calculate fairness, and the proportionality of fairness will be configurable.
56+
//
57+
//
58+
// Default values for this field will not be set, to allow for future additions of a new field that
59+
// may 'one of' with this field.
60+
// Any implementations that may consume this field may treat an unset value as the
61+
// 'Standard' range.
62+
Criticality *Criticality
63+
64+
// Future scheduling objectives, like SLOs.
65+
66+
// PoolRef is a reference to the inference pool, the pool must exist in the same namespace.
67+
PoolRef PoolObjectReference
68+
}
69+
type Match struct {
70+
// Only one of the following can be set.
71+
72+
// HTTPMatches is a list of http requests matchers, the list of matchers are ORed.
73+
HTTPMatches []HTTPMatch
74+
// GRPCMatches is a list of gRPC requests matchers, the list of matchers are ORed.
75+
GRPCMatches []GRPCMatch
76+
}
77+
78+
// HTTPMatch is an http matching rule. The rules are ANDed.
79+
type HTTPMatch struct {
80+
// ModelName matches against the model name in the body as per OpenAI protocol
81+
ModelName *string
82+
// Headers specifies HTTP request header matchers.
83+
Headers []HTTPHeaderMatch // mostly as defined in the gateway api
84+
// version of it that only supports exact header matching.
85+
}
86+
87+
// GRPCMatch is a gRPC matching rule. The rules are ANDed.
88+
type GRPCMatch struct {
89+
// ModelName matches against the model name in the body as per OpenAI protocol.
90+
ModelName *string
91+
// Headers specifies gRPC request header matchers.
92+
Headers []GRPCHeaderMatch // mostly as defined in the gateway api, likely a more limited
93+
// version of it that only supports exact header matching.
94+
}
95+
```
96+
97+
## API: Before and After
98+
99+
### Default Policy
100+
101+
#### Before
102+
103+
Not possible today, but could be done if we define a catch all modelName expression:
104+
105+
```yaml
106+
kind: InferenceModel
107+
metadata:
108+
name: default
109+
spec:
110+
modelName: *
111+
criticality: Standard
112+
poolRef:
113+
name: gemma-pool
114+
```
115+
116+
#### After
117+
118+
```yaml
119+
kind: InferenceSchedulingObjective
120+
metadata:
121+
name: default
122+
spec:
123+
criticality: Standard
124+
poolRef:
125+
name: gemma-pool
126+
```
127+
128+
### Separate scheduling objectives for the same target model
129+
130+
#### Before
131+
Possible, requires multiple entries
132+
133+
```yaml
134+
kind: InferenceModel
135+
metadata:
136+
name: llama4
137+
spec:
138+
modelName: llama4-prod
139+
targetModels:
140+
- name: llama4
141+
criticality: Critical
142+
poolRef:
143+
name: gemma-pool
144+
145+
kind: InferenceModel
146+
metadata:
147+
name: llama4
148+
spec:
149+
modelName: llama4-dev
150+
targetModels:
151+
- name: llama4
152+
criticality: Sheddable
153+
poolRef:
154+
name: gemma-pool
155+
```
156+
157+
#### After
158+
Possible, requires multiple entries
159+
160+
```yaml
161+
kind: InferenceSchedulingObjective
162+
metadata:
163+
name: critical-llama4
164+
spec:
165+
httpMatches:
166+
- modelName: llama4
167+
headers:
168+
- name: “app”
169+
value: “prod”
170+
criticality: Critical
171+
poolRef:
172+
name: llama4-pool
173+
174+
---
175+
176+
kind: InferenceSchedulingObjective
177+
metadata:
178+
name: sheddable-llama4
179+
spec:
180+
httpMatches:
181+
- modelName: llama4
182+
headers:
183+
- name: “app”
184+
value: “dev”
185+
criticality: Sheddable
186+
poolRef:
187+
name: llama4-pool
188+
```
189+
190+
### Traffic Splitting
191+
192+
#### Before
193+
194+
EPP handling model rewrite & splitting/weighting
195+
196+
```yaml
197+
kind: InferenceModel
198+
metadata:
199+
name: llama4
200+
spec:
201+
modelName: llama4-prod
202+
targetModels:
203+
- name: llama4
204+
weight: 10
205+
- name: llama42
206+
weight: 50
207+
criticality: Critical
208+
poolRef:
209+
name: gemma-pool
210+
```
211+
212+
#### After
213+
214+
Offload to httpRoute, EPP is now extended to override model name on: `X-Gateway-Model-Name`, added benefit of splitting on pools at the same place.
215+
216+
```yaml
217+
kind: HTTPRoute
218+
apiVersion: gateway.networking.k8s.io/v1
219+
metadata:
220+
name: my-route
221+
spec:
222+
parentRefs:
223+
- name: my-inference-gateway
224+
rules:
225+
- matches:
226+
- headers:
227+
- type: Exact
228+
name: X-Gateway-Model-Name
229+
value: food-review
230+
backendRefs:
231+
- name: vllm-llama3-8b-instruct
232+
kind: InferencePool
233+
group: inference.networking.x-k8s.io
234+
weight: 90
235+
- filters:
236+
- type: RequestHeaderModifier
237+
requestHeaderModifier:
238+
set:
239+
- name: X-Gateway-Model-Name
240+
value: food-review-v1
241+
- name: vllm-llama3-8b-instruct
242+
kind: InferencePool
243+
group: inference.networking.x-k8s.io
244+
weight: 10
245+
- filters:
246+
- type: RequestHeaderModifier
247+
requestHeaderModifier:
248+
set:
249+
- name: X-Gateway-Model-Name
250+
value: food-review-v2
251+
```
252+
253+
### Open Questions
254+
255+
- How might `Match` conflict/converge with HTTPRoute?
256+
- Is it easier to make changes piecewise? (We are currently: renaming, adjusting how matching works, & offloading traffic splitting to HTTPRoute)
257+
- Should we split the `Match` into its own CRD (named something like `InferenceWorkload`) that can be used for fairness budget tracking/workload affiliation, and then translate `ISO` to a more objective policy-like object that the `Match` CRD subscribes to, reducing duplicate config

0 commit comments

Comments
 (0)