Implement new scheduler #823
Conversation
Extract the request-scheduling logic out of baseRouter into a new internal/router/scheduler package so the scheduling strategy can be swapped independently of the process machinery and the eviction policy. FIFO is the first and only implementation and reproduces the previous behaviour exactly. baseRouter keeps the channels, run loop, ServeHTTP, process lifecycle (doSwap), and shutdown teardown, and now implements scheduler.Effects so the scheduler can call back for every side-effect. newBaseRouter takes a scheduler.Factory (defaulting to FIFO, wired up by Group and Matrix) instead of a SwapPlanner. - move SwapPlanner into the scheduler package; groupPlanner and matrixPlanner satisfy it structurally and are captured in the factory closure each concrete router passes to newBaseRouter - move the run-loop state (active/inFlight/queued) and the decision tree (OnRequest/OnSwapDone/OnServeDone/OnUnload/OnShutdown) into scheduler.FIFO - alias router.ErrNoLocalModelFound to scheduler.ErrModelNotFound so SendError keeps matching the error the scheduler grants - migrate the FIFO scheduling tests to the scheduler package as direct, synchronous unit tests against a fake Effects; trim base_test.go down to baseRouter mechanism only (run loop, grant/ServeHTTP, Unload, Shutdown)
- Add the first draft of developer docs for creating new routers and schedulers. Refactor the scheduler interfaces to eliminate the extra map[string]*process.Process in the matrixPlanner and groupPlanner. This makes the scheduler the single owner of "what's running".
The new routing.router and routing.schedule is the new home for router and scheduler settings. Backwards compatibility of old configuration files are maintained with a new config error if both legacy matrix/group are defined along routing.router settings
Wire config.FifoConfig into the FIFO scheduler so queued requests are serviced by descending model priority instead of strict arrival order. - NewFIFO accepts a config.FifoConfig and stores it on the scheduler - enqueue does a stable priority insert: higher-priority models float ahead, equal-priority requests keep FIFO arrival order - group and matrix routers pass conf.Routing.Scheduler.Settings.Fifo - add TestFIFO_PriorityQueueOrder
|
Important Review skippedAuto reviews are disabled on this repository. Please check the settings in the CodeRabbit UI or the ⚙️ Run configurationConfiguration used: Path: .coderabbit.yaml Review profile: CHILL Plan: Pro Run ID: You can disable this status message by setting the Use the checkbox below for a quick retry:
✨ Finishing Touches🧪 Generate unit tests (beta)
Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out. Comment |
There was a problem hiding this comment.
Actionable comments posted: 5
🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.
Inline comments:
In `@config-schema.json`:
- Around line 560-590: The JSON schema for the "router" object allows
combinations that the runtime loader LoadConfigFromReader rejects (e.g., both
settings.groups and settings.matrix present, or use="matrix" without
settings.matrix), so update the "router" schema to express the same constraints
as the runtime: make settings mutually exclusive (only one of groupsConfig or
matrixConfig allowed) and add conditional requirements based on "use" (if "use"
is "group" then require settings.groups; if "use" is "matrix" then require
settings.matrix). Implement this using JSON Schema constructs (oneOf or
dependent schemas / if-then-else) referencing the existing router.use,
router.settings, groupsConfig and matrixConfig so schema validation matches
LoadConfigFromReader behavior.
In `@config.example.yaml`:
- Around line 380-515: The example incorrectly defines both router.groups and
router.matrix together (routing.router.settings.groups and
routing.router.settings.matrix), which the config loader rejects; fix by
providing a single routing strategy per example—either remove the entire matrix
block when demonstrating groups (keep groups and related entries like
"group1","group2","forever") or remove the groups section and keep only the
matrix vars/evict_costs/sets (vars g,q,m..., evict_costs, sets such as
standard/with_rerank/creative/full) so the sample uses one valid configuration
approach.
- Around line 370-378: The FIFO priority example uses undefined model IDs
(A/B/C/D) which will fail validation for
routing.scheduler.settings.fifo.priority; replace those entries with actual
model IDs present in your configuration (or use the documented example IDs from
your models section, e.g., the real model keys such as "gpt-4", "gpt-4o",
"gpt-3.5-turbo" or other defined names), or leave the priority map empty to show
the default; update the priority dictionary key names to match the exact model
IDs used elsewhere so the loader validation succeeds.
In `@internal/router/design.md`:
- Around line 83-101: The fenced code block starting at the diagram (containing
"ServeHTTP", "HandlerReq", "baseRouter.run()", "Scheduler.On*(...)", etc.) lacks
a language tag causing MD040; add an explicit language identifier (e.g.,
```text) to the opening fence so the block is fenced as ```text ... ``` to
satisfy markdownlint and keep the diagram formatting unchanged.
In `@internal/router/scheduler/fifo.go`:
- Around line 343-356: The collidesWith function (parameters target, evict,
active) currently only checks target↔evict collisions but misses overlapping
evict sets between different active swaps (evict ∩ sw.evict), allowing two swaps
to evict the same model concurrently; update collidesWith to also detect any
intersection between the caller's evict slice and each active swap's sw.evict
(e.g., by checking if any element of evict exists in sw.evict or adding a helper
like slicesOverlap(evict, sw.evict)) and return true when such overlap is found,
keeping the existing checks for id==target, containsString(evict, id) and
containsString(sw.evict, target).
🪄 Autofix (Beta)
Fix all unresolved CodeRabbit comments on this PR:
- Push a commit to this branch (recommended)
- Create a new PR with the fixes
ℹ️ Review info
⚙️ Run configuration
Configuration used: Path: .coderabbit.yaml
Review profile: CHILL
Plan: Pro
Run ID: d4c8053d-576b-43a8-ad12-853c55fcf6b4
⛔ Files ignored due to path filters (1)
go.sumis excluded by!**/*.sum
📒 Files selected for processing (27)
.github/workflows/config-schema.ymlMakefileconfig-schema.jsonconfig.example.yamlgo.modinternal/config/config.gointernal/config/config_posix_test.gointernal/config/config_schema_test.gointernal/config/config_test.gointernal/config/config_windows_test.gointernal/config/matrix.gointernal/config/matrix_test.gointernal/perf/monitor.gointernal/router/base.gointernal/router/base_test.gointernal/router/design.mdinternal/router/group.gointernal/router/group_test.gointernal/router/helpers_test.gointernal/router/matrix.gointernal/router/matrix_test.gointernal/router/router.gointernal/router/scheduler/fifo.gointernal/router/scheduler/fifo_test.gointernal/router/scheduler/scheduler.gointernal/server/server.gointernal/server/server_test.go
| router: | ||
| use: group # default, valid: group, matrix | ||
| settings: | ||
| # groups: a dictionary of group settings | ||
| # - optional, default: empty dictionary | ||
| # - provides advanced controls over model swapping behaviour | ||
| # - using groups some models can be kept loaded indefinitely, while others are swapped out | ||
| # - model IDs must be defined in the Models section | ||
| # - a model can only be a member of one group | ||
| # - group behaviour is controlled via the `swap`, `exclusive` and `persistent` fields | ||
| # - see issue #109 for details | ||
| # | ||
| # NOTE: the example below uses model names that are not defined above for demonstration purposes | ||
| groups: | ||
| # group1 works the same as the default behaviour of llama-swap where only one model is allowed | ||
| # to run a time across the whole llama-swap instance | ||
| "group1": | ||
| # swap: controls the model swapping behaviour in within the group | ||
| # - optional, default: true | ||
| # - true : only one model is allowed to run at a time | ||
| # - false: all models can run together, no swapping | ||
| swap: true | ||
|
|
||
| # exclusive: controls how the group affects other groups | ||
| # - optional, default: true | ||
| # - true: causes all other groups to unload when this group runs a model | ||
| # - false: does not affect other groups | ||
| exclusive: true | ||
|
|
||
| # members references the models defined above | ||
| # required | ||
| members: | ||
| - "llama" | ||
| - "qwen-unlisted" | ||
|
|
||
| # Example: | ||
| # - in group2 all models can run at the same time | ||
| # - when a different group is loaded it causes all running models in this group to unload | ||
| "group2": | ||
| swap: false | ||
|
|
||
| # exclusive: false does not unload other groups when a model in group2 is requested | ||
| # - the models in group2 will be loaded but will not unload any other groups | ||
| exclusive: false | ||
| members: | ||
| - "docker-llama" | ||
| - "modelA" | ||
| - "modelB" | ||
|
|
||
| # Example: | ||
| # - a persistent group, prevents other groups from unloading it | ||
| "forever": | ||
| # persistent: prevents over groups from unloading the models in this group | ||
| # - optional, default: false | ||
| # - does not affect individual model behaviour | ||
| persistent: true | ||
|
|
||
| # set swap/exclusive to false to prevent swapping inside the group | ||
| # and the unloading of other groups | ||
| swap: false | ||
| exclusive: false | ||
| members: | ||
| - "forever-modelA" | ||
| - "forever-modelB" | ||
| - "forever-modelc" | ||
|
|
||
| # The matrix declares valid combinations of models that can run concurrently. | ||
| # When a model is requested, the solver finds the cheapest way to make it | ||
| # available by evicting as few (and least costly) running models as possible. | ||
| # | ||
| # Solver behavior: | ||
| # 1. Request arrives for model X | ||
| # 2. If X is already running, forward immediately. Done. | ||
| # 3. Find all sets containing X | ||
| # 4. For each candidate set, compute cost: sum of evict_costs for | ||
| # every running model NOT in that set | ||
| # 5. Pick lowest cost candidate. Ties broken by definition order. | ||
| # 6. Evict what needs to stop. Start X. Forward request. | ||
| # | ||
| # Subset semantics: a set [a, b, c] means any subset is valid. | ||
| # Only the requested model is started — others are not preloaded. | ||
| # | ||
| # A model not appearing in any set can only run alone. | ||
| # | ||
| matrix: | ||
| # vars: short names for models (alphanumeric, 1-8 chars) | ||
| # - required for sets and evict_costs settings | ||
| # - each entry is a short name to a real model ID. Do not use an alias | ||
| # - used to keep set DSL logic short and easier to read | ||
| # - sets and evict_costs only use identifiers defined in vars | ||
| vars: | ||
| g: gemma-model | ||
| q: qwen-model | ||
| m: mistral-model | ||
| v: voxtral-model | ||
| e: reranker-model | ||
| L: llama-70B | ||
| sd: stable-diffusion | ||
|
|
||
| # evict_costs: relative cost of losing a running model (default: 1) | ||
| evict_costs: | ||
| v: 50 # vllm backend, slow cold start | ||
| L: 30 # 70B weights, slow to load | ||
|
|
||
| # sets: named sets of concurrent model combinations | ||
| # Values are DSL strings with operators: | ||
| # & AND (models run together) | ||
| # | OR (alternatives) | ||
| # () grouping | ||
| # +ref inline another set's expression | ||
| # | ||
| # Expansion examples: | ||
| # "L" → [L] | ||
| # "a & b" → [a, b] | ||
| # "a | b" → [a], [b] | ||
| # "(a | b) & c" → [a, c], [b, c] | ||
| # "(a | b) & (c | d)" → [a,c], [a,d], [b,c], [b,d] | ||
| # "+llms & v" → expands llms inline, then applies & v | ||
| sets: | ||
| # LLM + TTS: switching between g/q/m won't evict v | ||
| # expands to: [g,v], [q,v], [m,v] | ||
| standard: "(g | q | m) & v" | ||
|
|
||
| # LLM + TTS + reranker | ||
| # expands to: [g,v,e], [q,v,e] | ||
| with_rerank: "(g | q) & v & e" | ||
|
|
||
| # LLM + image generation, no TTS | ||
| # expands to: [g,sd], [q,sd] | ||
| creative: "(g | q) & sd" | ||
|
|
||
| # 70B model uses all GPUs, can only run alone | ||
| # expands to: [L] | ||
| full: "L" | ||
|
|
||
| # peers: a dictionary of remote peers and models they provide |
There was a problem hiding this comment.
Example mixes groups and matrix under one router config, which loader rejects.
Line 382-Line 515 defines both routing.router.settings.groups and routing.router.settings.matrix. The config loader returns an error for this combination, so this sample is not runnable as written.
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.
In `@config.example.yaml` around lines 380 - 515, The example incorrectly defines
both router.groups and router.matrix together (routing.router.settings.groups
and routing.router.settings.matrix), which the config loader rejects; fix by
providing a single routing strategy per example—either remove the entire matrix
block when demonstrating groups (keep groups and related entries like
"group1","group2","forever") or remove the groups section and keep only the
matrix vars/evict_costs/sets (vars g,q,m..., evict_costs, sets such as
standard/with_rerank/creative/full) so the sample uses one valid configuration
approach.
| func collidesWith(target string, evict []string, active map[string]*activeSwap) bool { | ||
| for id, sw := range active { | ||
| if id == target { | ||
| continue | ||
| } | ||
| if containsString(evict, id) { | ||
| return true | ||
| } | ||
| if containsString(sw.evict, target) { | ||
| return true | ||
| } | ||
| } | ||
| return false | ||
| } |
There was a problem hiding this comment.
Prevent parallel swaps that evict the same model.
Line 348 and Line 351 only guard target↔evict collisions. They miss evict-set overlap (evict ∩ sw.evict), so two active swaps can concurrently stop the same process.
Suggested patch
func collidesWith(target string, evict []string, active map[string]*activeSwap) bool {
for id, sw := range active {
if id == target {
continue
}
if containsString(evict, id) {
return true
}
if containsString(sw.evict, target) {
return true
}
+ if intersects(evict, sw.evict) {
+ return true
+ }
}
return false
}
+func intersects(a, b []string) bool {
+ for _, x := range a {
+ if containsString(b, x) {
+ return true
+ }
+ }
+ return false
+}
+
func conflictsWithInFlight(evict []string, inFlight map[string]int) bool {📝 Committable suggestion
‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.
| func collidesWith(target string, evict []string, active map[string]*activeSwap) bool { | |
| for id, sw := range active { | |
| if id == target { | |
| continue | |
| } | |
| if containsString(evict, id) { | |
| return true | |
| } | |
| if containsString(sw.evict, target) { | |
| return true | |
| } | |
| } | |
| return false | |
| } | |
| func collidesWith(target string, evict []string, active map[string]*activeSwap) bool { | |
| for id, sw := range active { | |
| if id == target { | |
| continue | |
| } | |
| if containsString(evict, id) { | |
| return true | |
| } | |
| if containsString(sw.evict, target) { | |
| return true | |
| } | |
| if intersects(evict, sw.evict) { | |
| return true | |
| } | |
| } | |
| return false | |
| } | |
| func intersects(a, b []string) bool { | |
| for _, x := range a { | |
| if containsString(b, x) { | |
| return true | |
| } | |
| } | |
| return false | |
| } |
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.
In `@internal/router/scheduler/fifo.go` around lines 343 - 356, The collidesWith
function (parameters target, evict, active) currently only checks target↔evict
collisions but misses overlapping evict sets between different active swaps
(evict ∩ sw.evict), allowing two swaps to evict the same model concurrently;
update collidesWith to also detect any intersection between the caller's evict
slice and each active swap's sw.evict (e.g., by checking if any element of evict
exists in sw.evict or adding a helper like slicesOverlap(evict, sw.evict)) and
return true when such overlap is found, keeping the existing checks for
id==target, containsString(evict, id) and containsString(sw.evict, target).
routingconfiguration section that supersedesmatrixandgroupwhile maintaining backwards compatibilityFixes #797