|
| 1 | +# Scheduling Subsystem Architecture |
| 2 | + |
| 3 | +Author(s): @kfswain, @ahg-g, @lukeavandrie |
| 4 | +## Proposal Status |
| 5 | + ***Draft*** |
| 6 | + |
| 7 | +## Summary |
| 8 | +Multiple docs have discussed the restructuring of the InferenceModel API. This [doc](https://docs.google.com/document/d/1x6aI9pbTF5oOsaEQYc9n4pBBY3_AuEY2X51VKxmBSnU/edit?tab=t.0#heading=h.towq7jyczzgo) proposes an InferenceSchedulingObjective CRD, and this [doc](https://docs.google.com/document/d/1G-CQ17CM4j1vNE3T6u9uP2q-m6jK14ANPCwTfJ2qLS4/edit?tab=t.0) builds upon the previous document to solidify the requirement for the new iteration of the InferenceModel API to continue to solve the identity problem. Both these documents were useful in continuing to gather feedback & iterate on a proper solution. |
| 9 | + |
| 10 | +This proposal is intended to act as the plan of record for solution that will be implemented. |
| 11 | + |
| 12 | +## Implementation Phases |
| 13 | + |
| 14 | +### Phase 1 - Rename & Modify InferenceModel |
| 15 | +Due to these facts: |
| 16 | + - the Criticality field of InferenceModel is in use, & provides functionality |
| 17 | + - InferenceModel is an Alpha API |
| 18 | + - InferenceModel is not depended upon by upstream or downstream components |
| 19 | + |
| 20 | + Phase 1 will retain the Criticality functionality, but will rename the InferenceModel API as well as slimming down the spec. |
| 21 | + |
| 22 | +### Phase 2 - Introduce new Policy fields |
| 23 | +Phase 2 will happen over a longer period of time & slowly introduce new policies to the API, much of what is discussed in this proposal is keeping Phase 2 in mind. |
| 24 | +Primarily phase 2 will introduce these policies: |
| 25 | +- Fairness |
| 26 | +- SLO |
| 27 | + |
| 28 | +Due to the behavior added & required for these policies to function correctly, naming of the API must necessarily consider this second phase. |
| 29 | + |
| 30 | +## Design Principles |
| 31 | + |
| 32 | +### Goals |
| 33 | +- Reliable and predictable fairness allocation |
| 34 | +- Disconnect identity from policy-like objects where possible |
| 35 | +- Anonymous identity/defaults are graceful (fault-tolerant) & unsurprising |
| 36 | +- Scalable, simple, and reusable config |
| 37 | +- Retain the functionality of InferenceModel |
| 38 | + - Traffic splitting models & modelName rewrite |
| 39 | + - Criticality |
| 40 | + |
| 41 | +### Non-Goals |
| 42 | +- Addressing security concerns with the API, this is currently expected to either be: |
| 43 | + - Entirely contained within a trusted system |
| 44 | + - Or auth handled upstream |
| 45 | +- IGW implementing a custom auth mechanism |
| 46 | + |
| 47 | + |
| 48 | +## Definitions |
| 49 | + |
| 50 | +- **Tenant** (synonymous with: *Flow* or *Identity*) - In the context of Inference Gateway these names are synonymous. Kuberenetes chooses the term ***tenant*** as described [here](https://kubernetes.io/docs/concepts/security/multi-tenancy/#tenants). |
| 51 | + |
| 52 | +# Proposal |
| 53 | + |
| 54 | +Discussion of the problem(s) can be seen in the linked documents. Here we will describe the new API surface. |
| 55 | + |
| 56 | +## Phase 1 |
| 57 | + |
| 58 | +### Naming |
| 59 | +This API solves 3 general pillars of problem, that can also be categorized into 2 areas: |
| 60 | + |
| 61 | + - This API describes Resource Sharing (Criticality/Fairness) |
| 62 | + - This API describes Identification (used in Fairness) |
| 63 | + - This API describes Specific Request Policy (SLO) |
| 64 | + |
| 65 | + |
| 66 | +As such, the name of the API should convey these concepts well. The `Inference-` prefix will remain, as that is a succint & accepted term related to generative AI serving. |
| 67 | + |
| 68 | +The accompanying name should also convey some or all of the other pillars, some of the names that have been considered: |
| 69 | + |
| 70 | +- `Tenant` |
| 71 | + - Pros: |
| 72 | + - Resource Sharing is implicit due to evocation of the 'multitenancy' concept |
| 73 | + - Identification is well understood from a term like Tenant |
| 74 | + - Cons: |
| 75 | + - Due to the prevalance of the 'multitenancy' concept, the term of tenant may clash with a user-defined term of 'tenant' causing a confusing interaction |
| 76 | + - Does not convey Specific Request Policy well |
| 77 | + |
| 78 | +- `Flow` |
| 79 | + - Pros: |
| 80 | + - A common networking term helpful in describing request traffic rate control |
| 81 | + - Cons: |
| 82 | + - Doesnt well describe Identification or Resource Sharing |
| 83 | + |
| 84 | +- `Objectives` (preferred) |
| 85 | + - Pros: |
| 86 | + - Specific Request Policy is easily understood |
| 87 | + - Resource Sharing is a natural fit |
| 88 | + - Cons: |
| 89 | + - Identity is not well conveyed |
| 90 | + |
| 91 | + |
| 92 | +### CRD spec |
| 93 | + |
| 94 | +This CRD definition is a slimmed version of InferenceModel with a name change. Example here: |
| 95 | + |
| 96 | +```golang |
| 97 | +type InferenceObjectives struct { |
| 98 | + metav1.TypeMeta |
| 99 | + metav1.ObjectMeta |
| 100 | + |
| 101 | + Spec InferenceObjectivesSpec |
| 102 | +} |
| 103 | + |
| 104 | +type InferenceObjectivesSpec struct { |
| 105 | + PoolRef InferenceObjectReference |
| 106 | + |
| 107 | + Criticality *int |
| 108 | +} |
| 109 | + |
| 110 | +``` |
| 111 | + |
| 112 | +### Other changes |
| 113 | +- The EPP will expose a flag to define the header key that will be used to assign InferenceObjectives to |
| 114 | +- The modelName rewrite functionality will be included into EPP as a core feature (also handled by header) |
| 115 | +- Traffic splitting on model name will be pushed to HTTPRoute using the BBR tool to extract the modelName to the header |
| 116 | + |
| 117 | +## Phase 2 |
| 118 | + |
| 119 | +### CRD spec |
| 120 | +```golang |
| 121 | + |
| 122 | +type InferenceObjectives struct { |
| 123 | + metav1.TypeMeta |
| 124 | + metav1.ObjectMeta |
| 125 | + |
| 126 | + Spec InferenceObjectivesSpec |
| 127 | +} |
| 128 | + |
| 129 | +type InferenceObjectivesSpec struct { |
| 130 | + // Scope: Identifier |
| 131 | + ID string |
| 132 | + |
| 133 | + // Scope: Pool binding |
| 134 | + PoolRef InferenceObjectReference |
| 135 | + |
| 136 | + |
| 137 | + // Scope: Policy Handles (the following are different options on how policy binding may look) |
| 138 | + |
| 139 | + // Include core policies as a field of the object |
| 140 | + Criticality *int |
| 141 | + FairnessPolicy NotYetDefinedFairnessPolicy |
| 142 | + SLOPolicy InferenceSLOPolicy |
| 143 | + |
| 144 | + // or |
| 145 | + |
| 146 | + // Create an array fields that can bind arbitrary policies (including the 'core' policies) |
| 147 | + Policies []InferenceObjectReference |
| 148 | + |
| 149 | + // or |
| 150 | + |
| 151 | + // one-of: (Where Policy class would contain all 'core' policies that can be configured.) |
| 152 | + PolicyProfileRef InferenceObjectReference |
| 153 | + PolicyProfile PolicyProfile |
| 154 | +} |
| 155 | + |
| 156 | +// PoolObjectReference identifies an API object within the namespace of the |
| 157 | +// referrer. |
| 158 | +type InferenceObjectReference struct { |
| 159 | + // Group is the group of the referent. |
| 160 | + // |
| 161 | + // +optional |
| 162 | + // +kubebuilder:default="inference.networking.x-k8s.io" |
| 163 | + Group Group `json:"group,omitempty"` |
| 164 | + |
| 165 | + // Kind is kind of the referent. For example "InferencePool". |
| 166 | + // |
| 167 | + // +optional |
| 168 | + // +kubebuilder:default="InferencePool" |
| 169 | + Kind Kind `json:"kind,omitempty"` |
| 170 | + |
| 171 | + // Name is the name of the referent. |
| 172 | + // |
| 173 | + // +kubebuilder:validation:Required |
| 174 | + Name ObjectName `json:"name"` |
| 175 | +} |
| 176 | + |
| 177 | + |
| 178 | +type PolicyProfile struct { |
| 179 | + metav1.TypeMeta |
| 180 | + metav1.ObjectMeta |
| 181 | + |
| 182 | + Spec PolicyProfileSpec |
| 183 | +} |
| 184 | + |
| 185 | +type PolicyProfileSpec struct { |
| 186 | + // this is a departure from InferenceModel that used string for criticality. |
| 187 | + // We got quite a bit of feedback around allowing for custom criticality bands, so an int/enum is more flexible & carries inherent stack rank value. |
| 188 | + Criticality *int |
| 189 | + FairnessPolicy NotYetDefinedFairnessPolicy |
| 190 | + SLOPolicy InferenceSLOPolicy |
| 191 | +} |
| 192 | + |
| 193 | +type InferenceCriticalityPolicy struct { |
| 194 | + Criticality int32 |
| 195 | +} |
| 196 | +``` |
| 197 | + |
| 198 | +### Intent |
| 199 | + |
| 200 | +The purpose(s) of the `InferenceObjectives` is: |
| 201 | +- Create a strong concept of a tenant within the inference pool, used to associate groups of requests together for the purpose of Flow Clontrol - which can enforce: |
| 202 | + - Fair resource sharing |
| 203 | + - Inter-tenant prioritization |
| 204 | + - SLO attainment |
| 205 | +- Create a handle with which to attach scheduling policies, allowing for heterogenous scheduling behavior across tenants |
| 206 | + - Such as the [InferenceSLOPolicy](https://docs.google.com/document/d/1j2KRAT68_FYxq1iVzG0xVL-DHQhGVUZBqiM22Hd_0hc/edit?resourcekey=0-5cSovS8QcRQNYXj0_kRMiw&tab=t.0#heading=h.emkaixupvf39). |
| 207 | +- Detach identification from the modelName field |
| 208 | + |
| 209 | +## Usage |
| 210 | + |
| 211 | +The InferenceObjectives API surface has 3 general scopes: |
| 212 | +- Identification |
| 213 | +- Pool Binding |
| 214 | +- Policy Handle |
| 215 | + |
| 216 | +### Pool Binding |
| 217 | + |
| 218 | +Tackling the simplest first; the single Pool Binding field `PoolRef` is simply to associate a given InferenceObjectives object with a pool. Meaning that InferenceObjectives with duplicate ID's across different pools are considered valid. |
| 219 | + |
| 220 | +### Identification |
| 221 | +**Note**: This ID field is proposed for Phase 2, & Phase 1 will use the kube name as the identifier in the short term. To make a smooth transition, the ID field would default to the kube name. |
| 222 | + |
| 223 | +The only field associated with identification is the `ID` field. A unique ID field was chosen (rather than using the metadata name), because: |
| 224 | +- We do not want to put the same restrictions on the string that is enfored on a kube resource name |
| 225 | +- The ID name may be duplicated across different pools |
| 226 | +- Use of a kube-generated name would force an upstream Auth mechanism to be aware of the `InferenceObjectives` API |
| 227 | + |
| 228 | +This ID can be any string; in the case of a MaaS platform, it may identify a customer via a one-way hashed JWT. Or for sharing an InferencePool between teams, it may be a team-id. |
| 229 | + |
| 230 | +***Important***: In order to support a high volume of tenants, by default IGW will _allow_ unique IDs that do not have an explicit InferenceObjectives object defined. Instead the default for all policies will be used. |
| 231 | + |
| 232 | +#### Alternative consideration(s) |
| 233 | +- Expanding the PoolRef field to be plural was considered, however that was not selected to maintain simplicity. It is a decision that can be revisited in the future, however. |
| 234 | + |
| 235 | +### Policy Specification |
| 236 | +There are a few options with regards to how we handle policy binding. Some options that have been considered: |
| 237 | + |
| 238 | +- An `array` field on the `InferenceObjectives` which would contain `ObjectReferences`, allowing for binding of arbitrary policies |
| 239 | + - Pros: highly flexible, allows user defined policy |
| 240 | + - Cons: highly abstract & difficult to enforce meaningful defaults/limits on the policies allowed |
| 241 | +- A one-of per policy field to either allow embedded configuration of a policy, or a reference to a policy |
| 242 | + - Pros: flexibility of selection, reuse of policies |
| 243 | + - Cons: reusability is only per policy, composite policy structure is still defined on the `InferenceObjectives`. |
| 244 | +- A `PolicyProfile` type object that can configure multiple policies together, and multiple `InferenceObjectives`s can reference |
| 245 | + - Pros: Allows complex configuration to be reused |
| 246 | + - Cons: Creates another CRD layer, potentially increasing complexity of the API surface |
| 247 | + |
| 248 | +The preferred path forward is a utilization of `PolicyProfile` that can be referenced, or directly configured in the embedded field on the `InferenceObjectives`. |
| 249 | + |
| 250 | +This allows for a tradeoff of clear configuration, or reuse of complex config. |
0 commit comments