feat: Implement configurable model priority and unload delay by MannerMan · Pull Request #611 · mostlygeek/llama-swap

MannerMan · 2026-03-29T19:33:37Z

Hi!

First of all I want to give thanks for llama-swap, it's truly a fantastic piece of software for us gpu starved people experimenting at home!

I have used it extensively with my local llm setup, always been working great.

These days however many agentic frameworks have a concept of background workers, which causes some sub-optimal behavior with llama-swap.

Specifically, the situation I'm running into is something like this with for instance opencode or similar tooling;

Orchestrator Agent - Model1
Explorer Agent - Model2
Coder - Model3

Orchestrator starts with some task, is loaded in by llama-swap and starts reasoning to build a plan. Generally it might end up launching something like 2 explorers and one coder agent in parallel, commonly as 'background workers'. This sends several incoming requests to llama-swap for different model, including the orchestrator model. The current behavior of the queue is something like FIFO, and the first request arriving for Model2 is just a system prompt, as soon as that is done processing it's swap out for model3 which was some other background task, which is then quickly swapped out back to the orchestrator model1 which tends to just give up on the agents after a while since no progress is made. I guess the TLDR is that the majority of the time is spent just swapping model, with little actual processing happening at all.

Therefore, here is a feature implementation that supports configurable priority per model (higher priority wins). This way, it's possible to configure for instance the explorer model with priority 3, coder model with priority 2 and orchestrator model with priority 1. What happens then in the scenario above is instead;

Orchestrator starts with some task, is loaded in by llama-swap and starts reasoning to build a plan.
Orchestrator launches two explorers and one coder
Since the explorer model has the highest priority, llama-swap will keep processing requests for this model until it has finished its tasks
The coder model has second highest priority, and will be loaded in next and be able to fully finish its task
Finally the orchestrator is loaded in again and can gather the result of all agents.

I had to add one more thing to make this work well though; unloadDelay. This works as a 'debounce' feature that keeps models with higher priority loaded for a configurable time if all other requests in the queue has lower priority. If a new request arrive for the currently loaded model before unload delay has expired, the timer resets and the model stays loaded and process the request. If a request for a higher priority model arrives though, the unloadDelay is ignored. This was required as there is usually a small delay in the agentic frameworks before the next request is triggered, and that was causing similar problems as the original situation. I have unloadDelay set to 5 for my agent models which works great.

I'm not sure unloadDelay is a great term, it might be better with debounce, debounceTimer, debounceDelay or something to keep it clearly separate from ttl which works differently. Open to suggestions.

I'm not used to Golang and had Claude code helped me implement this, I have reviewed the code to the best of my ability and also tested the feature manually. But in case there is any slop in the code I apologize in advance.

This change introduces configurable priority for models and an unload delay to improve performance when used with agentic frameworks. Previously, the FIFO queueing could lead to frequent model swapping, hindering progress when multiple agents (e.g., explorer, coder, orchestrator) were running concurrently. This implementation allows assigning priority to each model. Higher priority models are favored, remaining loaded and processing requests until their tasks are complete. An `unloadDelay` is also introduced. This keeps high-priority models loaded for a configurable duration even if no new requests arrive, preventing premature swapping due to minor delays in agentic workflows. Higher priority requests will always interrupt the delay. This addresses performance issues observed with frameworks like OpenCode where background workers frequently request different models.

coderabbitai · 2026-03-29T19:33:49Z

Walkthrough

Adds two per-model config fields, priority and unloadDelay, updates schema, examples and docs, extends ModelConfig with new exported fields, and refactors process-group swap coordination to a priority- and unload-delay–aware synchronization scheme with accompanying tests.

Changes

Cohort / File(s)	Summary
Configuration Schema & Examples `config-schema.json`, `config.example.yaml`, `docs/configuration.md`	Introduce `models.<id>.priority` (integer, >=0, default 0) and `models.<id>.unloadDelay` (integer, >=0, default 0); update schema, example config, and documentation describing priority ordering and unload-delay semantics.
Model Config Types `proxy/config/model_config.go`	Add exported `Priority int \`yaml:"priority"``and`UnloadDelay int `yaml:"unloadDelay"``fields to`ModelConfig` for YAML unmarshalling (zero-value defaults).
Process Group Scheduling `proxy/processgroup.go`	Refactor swap handling to priority-aware coordination: replace `lastUsedProcess` with `activeModel`, `inFlight`, `pendingCount`, `unloadDelayCh`; add mutex+cond synchronization, logic for priority preemption/fairness, unload-delay timers, and cancellation on shutdown.
Tests `proxy/processgroup_test.go`	Add unit and concurrency tests exercising priority selection, preemption behavior, unload-delay hold and bypass cases, timer reset semantics, equal-priority fairness, and zero-delay behavior.

Estimated code review effort

🎯 4 (Complex) | ⏱️ ~75 minutes

Possibly related PRs

proxy/config: create config package and migrate configuration #329: Modifies ModelConfig and proxy/processgroup.go, likely to overlap with the same config and process-group scheduling changes in this PR.

Suggested labels

enhancement, configuration

Suggested reviewers

mostlygeek

🚥 Pre-merge checks | ✅ 3

✅ Passed checks (3 passed)

Check name	Status	Explanation
Title check	✅ Passed	The title clearly and concisely summarizes the main changes: adding configurable priority and unload delay features for model scheduling.
Description check	✅ Passed	The description is directly related to the changeset, providing detailed context about the motivation, implementation approach, and feature behavior across all modified files.
Docstring Coverage	✅ Passed	Docstring coverage is 80.00% which is sufficient. The required threshold is 80.00%.

_{✏️ Tip: You can configure your own custom pre-merge checks in the settings.}

✨ Finishing Touches

🧪 Generate unit tests (beta)

Create PR with unit tests

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

_{Comment @coderabbitai help to get the list of available commands and usage tips.}

coderabbitai

Actionable comments posted: 1

🧹 Nitpick comments (1)

proxy/processgroup.go (1)
294-322: Lock held during parallel process stops may cause contention.

StopProcesses holds pg.mu for the entire duration of the parallel stop operation via defer. While the processes are being stopped in parallel (which could take time), other goroutines calling ProxyRequest will block on pg.mu.Lock().

This is likely acceptable since StopProcesses is typically called during shutdown or group unloading, but worth noting for awareness.
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@proxy/processgroup.go` around lines 294 - 322, StopProcesses currently holds
pg.mu for the entire parallel shutdown (defer unlocking after wg.Wait), causing
contention; fix by copying pg.processes to a local slice while holding pg.mu,
cancelUnloadDelayLocked()/clear activeModel()/cond.Broadcast() and unlock pg.mu
before starting the goroutines, then run the parallel Stop/StopImmediately on
the local slice, wait for wg, and finally re-acquire pg.mu if any post-stop
state mutation is required; references: StopProcesses, pg.mu, pg.processes,
pg.cancelUnloadDelayLocked, pg.activeModel, pg.cond.Broadcast, Process.Stop,
Process.StopImmediately.

🤖 Prompt for all review comments with AI agents

Verify each finding against the current code and only fix it if needed.

Inline comments:
In `@proxy/processgroup.go`:
- Line 33: The struct's swap coordination state block has misaligned field
spacing (gofmt fails); adjust the spacing so all field names and their types
line up consistently — e.g., in the block containing unloadDelayCh (chan
struct{}) and the other three fields, align the type columns so each field
follows the same spacing convention used elsewhere in the file (ensure names
like unloadDelayCh align with the other three field names and their types), then
run gofmt to confirm the layout is correct.

---

Nitpick comments:
In `@proxy/processgroup.go`:
- Around line 294-322: StopProcesses currently holds pg.mu for the entire
parallel shutdown (defer unlocking after wg.Wait), causing contention; fix by
copying pg.processes to a local slice while holding pg.mu,
cancelUnloadDelayLocked()/clear activeModel()/cond.Broadcast() and unlock pg.mu
before starting the goroutines, then run the parallel Stop/StopImmediately on
the local slice, wait for wg, and finally re-acquire pg.mu if any post-stop
state mutation is required; references: StopProcesses, pg.mu, pg.processes,
pg.cancelUnloadDelayLocked, pg.activeModel, pg.cond.Broadcast, Process.Stop,
Process.StopImmediately.

🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

Push a commit to this branch (recommended)
Create a new PR with the fixes

ℹ️ Review info

⚙️ Run configuration

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Pro

Run ID: 3aee4fa6-fcc0-4a45-8e39-d31cbd6a2565

📥 Commits

Reviewing files that changed from the base of the PR and between 1e44077 and 98117e4.

📒 Files selected for processing (6)

config-schema.json
config.example.yaml
docs/configuration.md
proxy/config/model_config.go
proxy/processgroup.go
proxy/processgroup_test.go

coderabbitai

Caution

Some comments are outside the diff and can’t be posted inline due to platform limitations.

⚠️ Outside diff range comments (1)

proxy/processgroup.go (1)
294-322: ⚠️ Potential issue | 🔴 Critical

Critical: Deadlock when stopping processes with in-flight requests.

StopProcesses holds pg.mu while calling wg.Wait() (line 321). If an in-flight request is being served (at line 162), it will try to acquire pg.mu at line 164 after completing. Since process.Stop() waits for in-flight requests via its internal WaitGroup, and the in-flight request can't acquire the lock to decrement pg.inFlight, the result is a deadlock.

This affects swapProcessGroup() in proxymanager.go which calls StopProcesses when handling exclusive groups.
🔒 Proposed fix: Release lock during process stop operations
 func (pg *ProcessGroup) StopProcesses(strategy StopStrategy) {
 	pg.mu.Lock()
-	defer func() {
-		pg.cancelUnloadDelayLocked()
-		pg.activeModel = ""
-		pg.cond.Broadcast()
-		pg.mu.Unlock()
-	}()
 
 	if len(pg.processes) == 0 {
+		pg.mu.Unlock()
 		return
 	}
 
+	pg.cancelUnloadDelayLocked()
+	pg.activeModel = ""
+	pg.mu.Unlock()
+
 	// stop Processes in parallel
 	var wg sync.WaitGroup
 	for _, process := range pg.processes {
 		wg.Add(1)
 		go func(process *Process) {
 			defer wg.Done()
 			switch strategy {
 			case StopImmediately:
 				process.StopImmediately()
 			default:
 				process.Stop()
 			}
 		}(process)
 	}
 	wg.Wait()
+
+	pg.mu.Lock()
+	pg.cond.Broadcast()
+	pg.mu.Unlock()
 }
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@proxy/processgroup.go` around lines 294 - 322, StopProcesses currently holds
pg.mu while waiting for process.Stop()/StopImmediately() which can deadlock;
change it to copy pg.processes into a local slice while holding pg.mu, then
release pg.mu before launching goroutines and waiting on the WaitGroup, and
after wg.Wait() re-acquire pg.mu to call cancelUnloadDelayLocked(), clear
pg.activeModel, and pg.cond.Broadcast(). Use the same symbols to locate code:
ProcessGroup.StopProcesses, pg.mu, pg.processes, process.Stop/StopImmediately,
cancelUnloadDelayLocked(), pg.activeModel, and pg.cond.Broadcast().

🤖 Prompt for all review comments with AI agents

Verify each finding against the current code and only fix it if needed.

Outside diff comments:
In `@proxy/processgroup.go`:
- Around line 294-322: StopProcesses currently holds pg.mu while waiting for
process.Stop()/StopImmediately() which can deadlock; change it to copy
pg.processes into a local slice while holding pg.mu, then release pg.mu before
launching goroutines and waiting on the WaitGroup, and after wg.Wait()
re-acquire pg.mu to call cancelUnloadDelayLocked(), clear pg.activeModel, and
pg.cond.Broadcast(). Use the same symbols to locate code:
ProcessGroup.StopProcesses, pg.mu, pg.processes, process.Stop/StopImmediately,
cancelUnloadDelayLocked(), pg.activeModel, and pg.cond.Broadcast().

ℹ️ Review info

⚙️ Run configuration

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Pro

Run ID: 91c84061-57d5-4e40-9f0f-35a9f06614a4

📥 Commits

Reviewing files that changed from the base of the PR and between 98117e4 and b44c8c9.

📒 Files selected for processing (1)

proxy/processgroup.go

mostlygeek · 2026-03-30T02:19:10Z

Hi,

You're right that llama-swap has a basic FIFO queue and with agents that make use of different models for sub agents it can cause a lot of thrashing. I use Claude Code and it uses haiku for smaller tasks like summaries.

What coding agent and models are you running?

I think the right approach is to introduce some choice into the queuing strategy used by llama-swap. The basic FIFO can remain the default and a new FIFO+ with collation of requests can be used.

With the new strategy this request queue:

A A B C B A B

would turn into:

A A A B B B C

that way the time swapping is minimized.

To prevent starvation for C a max wait time can help push it higher into the queue.

MannerMan · 2026-03-30T20:23:23Z

Hi,

You're right that llama-swap has a basic FIFO queue and with agents that make use of different models for sub agents it can cause a lot of thrashing. I use Claude Code and it uses haiku for smaller tasks like summaries.

What coding agent and models are you running?

I think the right approach is to introduce some choice into the queuing strategy used by llama-swap. The basic FIFO can remain the default and a new FIFO+ with collation of requests can be used.

With the new strategy this request queue:

A A B C B A B

would turn into:

A A A B B B C

that way the time swapping is minimized.

To prevent starvation for C a max wait time can help push it higher into the queue.

I use mainly use two coding tools, Roo code extension in vscode and opencode with oh-my-opencode-slim plugin; Roo is working very well with the current llama-swap behavior because its subtasks are performed serially, i.e. the orchestrator starts a subtask which is delegated to a specialized agent, and Roo does not 'yield' back to the orchestrator until the agent is finished.

Opencode with the plugin is a bit more powerful in that it supports launching background agents in parallel (maybe Roo can do this too, not sure). I tried the regular oh-my-opencode plugin but the prompts are too complex for the local models I use, however the -slim variant of the plugin works well.

My setup is 4x RTX 3060 12gb and I mainly use a combination of these models;

devstral-small-2-24b (solid coder)
qwen-3.5-27b (typical main orchestrator, smartest model of the size)
qwen-3.5-35b-a3b
qwen-coder-next (in Q4 quant)
Ministral-3-14b (gives room for large parallel contexts, good for parallel subagents exploring code etc)

--

I think the suggested FIFO+ is elegant from the user perspective (a single toggle on/off to get the behavior). Max wait could also potentially help making sure agent models finish before the orchestrator continues, if being able to set this on a per-model basis.

I do however also see some cases where the proposed FIFO+ could be sub-optimal, for instance the case of a orchestrator launching only one background agent; This would likely cause excessive swapping behavior without some sort of debounce feature since the single background agent will not queue up several requests at once for its model. Or it could potentially keep the orchestrator loaded for too long causing it to give up on the agent.

I do think being able to configure the priority of individual models in the swapping queue could be very useful for advanced users; This enables control over when swapping happens or not, which can be very useful in different pipelines. I can see the issue of some models 'never' being swapped in though, but maybe that's a problem for said advanced users to deal with :)

Maybe the FIFO+ option can also support setting manual priority for models to get the best of both?

Or having a global queueBehavior option that can be simple/smart/priority for instance.

Luiszzzor · 2026-05-05T14:38:26Z

hey @MannerMan @mostlygeek,

are you planning to merge this, refactor or create other complementary solution for queues and priorities? :) This is the only thing lacking in llama-swap from my perspective. I would love to see it officially 👍

coderabbitai Bot reviewed Mar 29, 2026

View reviewed changes

Comment thread proxy/processgroup.go Outdated

Fix gofmt fault on processgroup.go

b44c8c9

coderabbitai Bot reviewed Mar 29, 2026

View reviewed changes

mostlygeek mentioned this pull request Mar 30, 2026

[Feature Request] Queuing strategies to reduce swapping #612

Open

Luiszzzor mentioned this pull request May 24, 2026

Introduce a new routing backend #784

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feat: Implement configurable model priority and unload delay#611

feat: Implement configurable model priority and unload delay#611
MannerMan wants to merge 2 commits into
mostlygeek:mainfrom
MannerMan:feature/priority-unload-delay

MannerMan commented Mar 29, 2026

Uh oh!

coderabbitai Bot commented Mar 29, 2026 •

edited

Loading

Uh oh!

coderabbitai Bot left a comment

Uh oh!

Uh oh!

coderabbitai Bot left a comment

Uh oh!

mostlygeek commented Mar 30, 2026

Uh oh!

MannerMan commented Mar 30, 2026

Uh oh!

Luiszzzor commented May 5, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

Conversation

MannerMan commented Mar 29, 2026

Uh oh!

coderabbitai Bot commented Mar 29, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Walkthrough

Changes

Estimated code review effort

Possibly related PRs

Suggested labels

Suggested reviewers

Uh oh!

coderabbitai Bot left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

coderabbitai Bot left a comment

Choose a reason for hiding this comment

Uh oh!

mostlygeek commented Mar 30, 2026

Uh oh!

MannerMan commented Mar 30, 2026

Uh oh!

Luiszzzor commented May 5, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

coderabbitai Bot commented Mar 29, 2026 •

edited

Loading