feat: Implement configurable model priority and unload delay#611
feat: Implement configurable model priority and unload delay#611MannerMan wants to merge 2 commits into
Conversation
This change introduces configurable priority for models and an unload delay to improve performance when used with agentic frameworks. Previously, the FIFO queueing could lead to frequent model swapping, hindering progress when multiple agents (e.g., explorer, coder, orchestrator) were running concurrently. This implementation allows assigning priority to each model. Higher priority models are favored, remaining loaded and processing requests until their tasks are complete. An `unloadDelay` is also introduced. This keeps high-priority models loaded for a configurable duration even if no new requests arrive, preventing premature swapping due to minor delays in agentic workflows. Higher priority requests will always interrupt the delay. This addresses performance issues observed with frameworks like OpenCode where background workers frequently request different models.
WalkthroughAdds two per-model config fields, Changes
Estimated code review effort🎯 4 (Complex) | ⏱️ ~75 minutes Possibly related PRs
Suggested labels
Suggested reviewers
🚥 Pre-merge checks | ✅ 3✅ Passed checks (3 passed)
✏️ Tip: You can configure your own custom pre-merge checks in the settings. ✨ Finishing Touches🧪 Generate unit tests (beta)
Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out. Comment |
There was a problem hiding this comment.
Actionable comments posted: 1
🧹 Nitpick comments (1)
proxy/processgroup.go (1)
294-322: Lock held during parallel process stops may cause contention.
StopProcessesholdspg.mufor the entire duration of the parallel stop operation viadefer. While the processes are being stopped in parallel (which could take time), other goroutines callingProxyRequestwill block onpg.mu.Lock().This is likely acceptable since
StopProcessesis typically called during shutdown or group unloading, but worth noting for awareness.🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@proxy/processgroup.go` around lines 294 - 322, StopProcesses currently holds pg.mu for the entire parallel shutdown (defer unlocking after wg.Wait), causing contention; fix by copying pg.processes to a local slice while holding pg.mu, cancelUnloadDelayLocked()/clear activeModel()/cond.Broadcast() and unlock pg.mu before starting the goroutines, then run the parallel Stop/StopImmediately on the local slice, wait for wg, and finally re-acquire pg.mu if any post-stop state mutation is required; references: StopProcesses, pg.mu, pg.processes, pg.cancelUnloadDelayLocked, pg.activeModel, pg.cond.Broadcast, Process.Stop, Process.StopImmediately.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.
Inline comments:
In `@proxy/processgroup.go`:
- Line 33: The struct's swap coordination state block has misaligned field
spacing (gofmt fails); adjust the spacing so all field names and their types
line up consistently — e.g., in the block containing unloadDelayCh (chan
struct{}) and the other three fields, align the type columns so each field
follows the same spacing convention used elsewhere in the file (ensure names
like unloadDelayCh align with the other three field names and their types), then
run gofmt to confirm the layout is correct.
---
Nitpick comments:
In `@proxy/processgroup.go`:
- Around line 294-322: StopProcesses currently holds pg.mu for the entire
parallel shutdown (defer unlocking after wg.Wait), causing contention; fix by
copying pg.processes to a local slice while holding pg.mu,
cancelUnloadDelayLocked()/clear activeModel()/cond.Broadcast() and unlock pg.mu
before starting the goroutines, then run the parallel Stop/StopImmediately on
the local slice, wait for wg, and finally re-acquire pg.mu if any post-stop
state mutation is required; references: StopProcesses, pg.mu, pg.processes,
pg.cancelUnloadDelayLocked, pg.activeModel, pg.cond.Broadcast, Process.Stop,
Process.StopImmediately.
🪄 Autofix (Beta)
Fix all unresolved CodeRabbit comments on this PR:
- Push a commit to this branch (recommended)
- Create a new PR with the fixes
ℹ️ Review info
⚙️ Run configuration
Configuration used: Path: .coderabbit.yaml
Review profile: CHILL
Plan: Pro
Run ID: 3aee4fa6-fcc0-4a45-8e39-d31cbd6a2565
📒 Files selected for processing (6)
config-schema.jsonconfig.example.yamldocs/configuration.mdproxy/config/model_config.goproxy/processgroup.goproxy/processgroup_test.go
There was a problem hiding this comment.
Caution
Some comments are outside the diff and can’t be posted inline due to platform limitations.
⚠️ Outside diff range comments (1)
proxy/processgroup.go (1)
294-322:⚠️ Potential issue | 🔴 CriticalCritical: Deadlock when stopping processes with in-flight requests.
StopProcessesholdspg.muwhile callingwg.Wait()(line 321). If an in-flight request is being served (at line 162), it will try to acquirepg.muat line 164 after completing. Sinceprocess.Stop()waits for in-flight requests via its internal WaitGroup, and the in-flight request can't acquire the lock to decrementpg.inFlight, the result is a deadlock.This affects
swapProcessGroup()inproxymanager.gowhich callsStopProcesseswhen handling exclusive groups.🔒 Proposed fix: Release lock during process stop operations
func (pg *ProcessGroup) StopProcesses(strategy StopStrategy) { pg.mu.Lock() - defer func() { - pg.cancelUnloadDelayLocked() - pg.activeModel = "" - pg.cond.Broadcast() - pg.mu.Unlock() - }() if len(pg.processes) == 0 { + pg.mu.Unlock() return } + pg.cancelUnloadDelayLocked() + pg.activeModel = "" + pg.mu.Unlock() + // stop Processes in parallel var wg sync.WaitGroup for _, process := range pg.processes { wg.Add(1) go func(process *Process) { defer wg.Done() switch strategy { case StopImmediately: process.StopImmediately() default: process.Stop() } }(process) } wg.Wait() + + pg.mu.Lock() + pg.cond.Broadcast() + pg.mu.Unlock() }🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@proxy/processgroup.go` around lines 294 - 322, StopProcesses currently holds pg.mu while waiting for process.Stop()/StopImmediately() which can deadlock; change it to copy pg.processes into a local slice while holding pg.mu, then release pg.mu before launching goroutines and waiting on the WaitGroup, and after wg.Wait() re-acquire pg.mu to call cancelUnloadDelayLocked(), clear pg.activeModel, and pg.cond.Broadcast(). Use the same symbols to locate code: ProcessGroup.StopProcesses, pg.mu, pg.processes, process.Stop/StopImmediately, cancelUnloadDelayLocked(), pg.activeModel, and pg.cond.Broadcast().
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.
Outside diff comments:
In `@proxy/processgroup.go`:
- Around line 294-322: StopProcesses currently holds pg.mu while waiting for
process.Stop()/StopImmediately() which can deadlock; change it to copy
pg.processes into a local slice while holding pg.mu, then release pg.mu before
launching goroutines and waiting on the WaitGroup, and after wg.Wait()
re-acquire pg.mu to call cancelUnloadDelayLocked(), clear pg.activeModel, and
pg.cond.Broadcast(). Use the same symbols to locate code:
ProcessGroup.StopProcesses, pg.mu, pg.processes, process.Stop/StopImmediately,
cancelUnloadDelayLocked(), pg.activeModel, and pg.cond.Broadcast().
ℹ️ Review info
⚙️ Run configuration
Configuration used: Path: .coderabbit.yaml
Review profile: CHILL
Plan: Pro
Run ID: 91c84061-57d5-4e40-9f0f-35a9f06614a4
📒 Files selected for processing (1)
proxy/processgroup.go
|
Hi, You're right that llama-swap has a basic FIFO queue and with agents that make use of different models for sub agents it can cause a lot of thrashing. I use Claude Code and it uses haiku for smaller tasks like summaries. What coding agent and models are you running? I think the right approach is to introduce some choice into the queuing strategy used by llama-swap. The basic FIFO can remain the default and a new FIFO+ with collation of requests can be used. With the new strategy this request queue: A A B C B A B would turn into: A A A B B B C that way the time swapping is minimized. To prevent starvation for C a max wait time can help push it higher into the queue. |
I use mainly use two coding tools, Roo code extension in vscode and opencode with oh-my-opencode-slim plugin; Roo is working very well with the current llama-swap behavior because its subtasks are performed serially, i.e. the orchestrator starts a subtask which is delegated to a specialized agent, and Roo does not 'yield' back to the orchestrator until the agent is finished. Opencode with the plugin is a bit more powerful in that it supports launching background agents in parallel (maybe Roo can do this too, not sure). I tried the regular oh-my-opencode plugin but the prompts are too complex for the local models I use, however the -slim variant of the plugin works well. My setup is 4x RTX 3060 12gb and I mainly use a combination of these models;
-- I think the suggested FIFO+ is elegant from the user perspective (a single toggle on/off to get the behavior). Max wait could also potentially help making sure agent models finish before the orchestrator continues, if being able to set this on a per-model basis. I do however also see some cases where the proposed FIFO+ could be sub-optimal, for instance the case of a orchestrator launching only one background agent; This would likely cause excessive swapping behavior without some sort of debounce feature since the single background agent will not queue up several requests at once for its model. Or it could potentially keep the orchestrator loaded for too long causing it to give up on the agent. I do think being able to configure the priority of individual models in the swapping queue could be very useful for advanced users; This enables control over when swapping happens or not, which can be very useful in different pipelines. I can see the issue of some models 'never' being swapped in though, but maybe that's a problem for said advanced users to deal with :) Maybe the FIFO+ option can also support setting manual priority for models to get the best of both? Or having a global |
|
hey @MannerMan @mostlygeek, are you planning to merge this, refactor or create other complementary solution for queues and priorities? :) This is the only thing lacking in llama-swap from my perspective. I would love to see it officially 👍 |
Hi!
First of all I want to give thanks for llama-swap, it's truly a fantastic piece of software for us gpu starved people experimenting at home!
I have used it extensively with my local llm setup, always been working great.
These days however many agentic frameworks have a concept of background workers, which causes some sub-optimal behavior with llama-swap.
Specifically, the situation I'm running into is something like this with for instance opencode or similar tooling;
Orchestrator Agent - Model1
Explorer Agent - Model2
Coder - Model3
Orchestrator starts with some task, is loaded in by llama-swap and starts reasoning to build a plan. Generally it might end up launching something like 2 explorers and one coder agent in parallel, commonly as 'background workers'. This sends several incoming requests to llama-swap for different model, including the orchestrator model. The current behavior of the queue is something like FIFO, and the first request arriving for Model2 is just a system prompt, as soon as that is done processing it's swap out for model3 which was some other background task, which is then quickly swapped out back to the orchestrator model1 which tends to just give up on the agents after a while since no progress is made. I guess the TLDR is that the majority of the time is spent just swapping model, with little actual processing happening at all.
Therefore, here is a feature implementation that supports configurable priority per model (higher priority wins). This way, it's possible to configure for instance the explorer model with priority 3, coder model with priority 2 and orchestrator model with priority 1. What happens then in the scenario above is instead;
I had to add one more thing to make this work well though;
unloadDelay. This works as a 'debounce' feature that keeps models with higher priority loaded for a configurable time if all other requests in the queue has lower priority. If a new request arrive for the currently loaded model before unload delay has expired, the timer resets and the model stays loaded and process the request. If a request for a higher priority model arrives though, the unloadDelay is ignored. This was required as there is usually a small delay in the agentic frameworks before the next request is triggered, and that was causing similar problems as the original situation. I haveunloadDelayset to 5 for my agent models which works great.I'm not sure
unloadDelayis a great term, it might be better withdebounce,debounceTimer,debounceDelayor something to keep it clearly separate fromttlwhich works differently. Open to suggestions.I'm not used to Golang and had Claude code helped me implement this, I have reviewed the code to the best of my ability and also tested the feature manually. But in case there is any slop in the code I apologize in advance.