Skip to content

Commit d236230

Browse files
committed
Initial commit
1 parent 5dba9e0 commit d236230

File tree

1 file changed

+170
-0
lines changed
  • docs/proposals/1199-inferencemodel-api-evolution

1 file changed

+170
-0
lines changed
Lines changed: 170 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,170 @@
1+
# Scheduling Subsystem Architecture
2+
3+
Author(s): @kfswain, @ahg-g, @lukeavandrie
4+
## Proposal Status
5+
***Draft***
6+
7+
## Summary
8+
Multiple docs have discussed the restructuring of the InferenceModel API. This [doc](https://docs.google.com/document/d/1x6aI9pbTF5oOsaEQYc9n4pBBY3_AuEY2X51VKxmBSnU/edit?tab=t.0#heading=h.towq7jyczzgo) proposes an InferenceSchedulingObjective CRD, and this [doc](https://docs.google.com/document/d/1G-CQ17CM4j1vNE3T6u9uP2q-m6jK14ANPCwTfJ2qLS4/edit?tab=t.0) builds upon the previous document to solidify the requirement for the new iteration of the InferenceModel API to continue to solve the identity problem. Both these documents were useful in continuing to gather feedback & iterate on a proper solution.
9+
10+
This proposal is intended to act as the plan of record for solution that will be implemented.
11+
12+
## Implementation Phases
13+
14+
### Phase 1 - Rename, Split, & Modify InferenceModel
15+
A few points were used in composing justification & structure of this change:
16+
- the Criticality field of InferenceModel is in use, & provides functionality
17+
- InferenceModel is an Alpha API
18+
- InferenceModel is not depended upon by upstream or downstream components
19+
20+
Phase 1 will retain the Criticality functionality, but will rename the InferenceModel API and slim down the spec. Additionally, this slimmed down spec will be able to be applied at a _per request_ level. Justification in [Phase 1](#phase-1).
21+
22+
### Phase 2 - Introduce new Usage Tracking, Fairness, & SLO CRDs
23+
Phase 2 will happen over a longer period of time & slowly introduce new CRDs to Inference Gateway, much of what is discussed in this proposal is keeping Phase 2 in mind, but phase 2 can be considered experimental & subject to change.
24+
25+
Primarily phase 2 will introduce these CRDs:
26+
- Usage tracking (used in fairness)
27+
- Fairness configuration
28+
- SLO configuration
29+
30+
## Design Principles
31+
32+
### Goals
33+
- Reliable and predictable fairness allocation
34+
- Disconnect identity from policy-like objects where possible
35+
- Anonymous identity/defaults are graceful (fault-tolerant) & unsurprising
36+
- Scalable, simple, and reusable config
37+
- Retain the functionality of InferenceModel
38+
- Traffic splitting models & modelName rewrite
39+
- Criticality
40+
41+
### Non-Goals
42+
- Addressing security concerns with the API, this is currently expected to either be:
43+
- Entirely contained within a trusted system
44+
- Or auth handled upstream
45+
- IGW implementing a custom auth mechanism
46+
47+
48+
## Definitions
49+
50+
- **Tenant** Kuberenetes chooses the term ***tenant*** as described [here](https://kubernetes.io/docs/concepts/security/multi-tenancy/#tenants). Fairness APIs _may_ be used in multitenant scenarios, so as an example, multi-tenancy may be used.
51+
52+
# Proposal
53+
54+
Discussion of the problem(s) can be seen in the linked documents. Here we will describe the new API surface.
55+
56+
## Phase 1
57+
58+
### Structure change
59+
This API solves 3 general pillars of problem, that can also be categorized into 2 areas:
60+
61+
Higher-order Request Gropuing (Usage tracking):
62+
- This API describes Resource Sharing (Criticality/Fairness)
63+
- This API describes Identification (used in Fairness)
64+
65+
Request specific objectives:
66+
- This API describes Specific Request Policy (SLO/Criticality)
67+
68+
69+
As such, the InferenceModel API will be split into separate CRDs to reflect the difference in these scopes. Phase 1 will focus on the **Request specific objectives**. Specifically it will maintain the inclusion of criticality. Other phase 1 changes:
70+
71+
- The EPP will expose a flag to define the header key that will be used to assign InferenceObjectives to
72+
- The EPP will expose a flag to define the header key that will be used in tracking Request Usage (which will act as the identifier for simple fairness implementation)
73+
- The modelName rewrite functionality will be included into EPP as a core feature (also handled by header) **NOTE**: _Relying on this feature for writing a proper model name disables the ability to use the fail-open feature_
74+
- Continue to support traffic splitting across models, although not necessarily via GIE CRDs directly (e.g., delegated to GW API/HTTPRoute) - example [here](https://docs.google.com/document/d/1s4U4T_cjQkk4UeIDyAJl2Ox6FZoBigXBXn9Ai0qV7As/edit?tab=t.0#heading=h.bkttj79mzxlz)
75+
76+
### Naming
77+
The current name for the CRD that will house **Request specific objectives** is planned to be `InferenceObjectives`
78+
79+
80+
### CRD spec
81+
82+
This CRD definition is a slimmed version of InferenceModel with a name change. Example here:
83+
84+
```golang
85+
type InferenceObjectives struct {
86+
metav1.TypeMeta
87+
metav1.ObjectMeta
88+
89+
Spec InferenceObjectivesSpec
90+
}
91+
92+
type InferenceObjectivesSpec struct {
93+
PoolRef InferenceObjectReference
94+
95+
// this is a departure from InferenceModel that used string for criticality.
96+
// We got quite a bit of feedback around allowing for custom criticality bands, so an int/enum is more flexible & carries inherent stack rank value.
97+
Criticality *int
98+
}
99+
100+
```
101+
102+
## Phase 2 - SUBJECT TO CHANGE
103+
104+
***NOTE: `InferenceUsageMeter` Name is a placeholder***
105+
106+
### CRD spec
107+
```golang
108+
109+
type InferenceUsageMeter struct {
110+
metav1.TypeMeta
111+
metav1.ObjectMeta
112+
113+
Spec InferenceUsageMeterSpec
114+
}
115+
116+
type InferenceUsageMeterSpec struct {
117+
// optional field that defaults to kube object name if not included
118+
ID *string
119+
PoolRef InferenceObjectReference
120+
121+
// one of; This allows for embedded configuration or reference to a commonly used config.
122+
UsageLimits *NotYetDefinedFairnessCRD
123+
UsageLimitsRef *InferenceObjectReference
124+
}
125+
126+
type InferenceObjectives struct {
127+
metav1.TypeMeta
128+
metav1.ObjectMeta
129+
130+
Spec InferenceObjectivesSpec
131+
}
132+
133+
type InferenceObjectivesSpec struct {
134+
PoolRef InferenceObjectReference
135+
136+
// this is a departure from InferenceModel that used string for criticality.
137+
// We got quite a bit of feedback around allowing for custom criticality bands, so an int/enum is more flexible & carries inherent stack rank value.
138+
Criticality *int
139+
PerformanceObjectives NotYetDefinedSLOCRD
140+
PerformanceObjectivesRef *InferenceObjectReference
141+
// Doc on SLO CRD here: https://docs.google.com/document/d/1j2KRAT68_FYxq1iVzG0xVL-DHQhGVUZBqiM22Hd_0hc/edit?resourcekey=0-5cSovS8QcRQNYXj0_kRMiw&tab=t.0#heading=h.emkaixupvf39
142+
}
143+
```
144+
145+
### Intent
146+
147+
The purpose(s) of the `InferenceUsageMeter` is:
148+
- Create a strong concept of usage tracking within the inference pool; used to associate groups of requests together for the purpose of Flow Clontrol (Fairness) - which can enforce:
149+
- Fair resource sharing
150+
- Inter-tenant prioritization
151+
- SLO attainment
152+
- Detach identification from the modelName field
153+
154+
## Design points
155+
Included is some discussion around specific choices made in the API design
156+
157+
### Identification
158+
**Note**: The ID field would default to the kube name.
159+
160+
The only field associated with identification is the `ID` field. An optional ID field was chosen (rather than strictly using the metadata name), because:
161+
- A user may not want to put the same restrictions on the id that is enfored on a kube resource name
162+
- The ID name may be duplicated across different pools
163+
- This could also be solved by allowing the UsageMeter & Objectives to reference multiple pools
164+
- Use of a kube-generated name would force an upstream Auth mechanism to be aware of the `InferenceObjectives` API
165+
166+
***Discussion point***: In order to support a high volume of tenants, we could allow IGW to accept unique IDs that do not have an explicit InferenceUsageMeter object defined. Instead using a default fairness configuration. **Feedback here requested.**
167+
168+
#### Alternative consideration(s)
169+
- Expanding the PoolRef field to be plural was considered, however that was not selected to maintain simplicity. It is a decision that can be revisited in the future, however.
170+

0 commit comments

Comments
 (0)