Skip to content

feat: Implement configurable model priority and unload delay#611

Open
MannerMan wants to merge 2 commits into
mostlygeek:mainfrom
MannerMan:feature/priority-unload-delay
Open

feat: Implement configurable model priority and unload delay#611
MannerMan wants to merge 2 commits into
mostlygeek:mainfrom
MannerMan:feature/priority-unload-delay

Conversation

@MannerMan

Copy link
Copy Markdown

Hi!

First of all I want to give thanks for llama-swap, it's truly a fantastic piece of software for us gpu starved people experimenting at home!

I have used it extensively with my local llm setup, always been working great.

These days however many agentic frameworks have a concept of background workers, which causes some sub-optimal behavior with llama-swap.

Specifically, the situation I'm running into is something like this with for instance opencode or similar tooling;

Orchestrator Agent - Model1
Explorer Agent - Model2
Coder - Model3

Orchestrator starts with some task, is loaded in by llama-swap and starts reasoning to build a plan. Generally it might end up launching something like 2 explorers and one coder agent in parallel, commonly as 'background workers'. This sends several incoming requests to llama-swap for different model, including the orchestrator model. The current behavior of the queue is something like FIFO, and the first request arriving for Model2 is just a system prompt, as soon as that is done processing it's swap out for model3 which was some other background task, which is then quickly swapped out back to the orchestrator model1 which tends to just give up on the agents after a while since no progress is made. I guess the TLDR is that the majority of the time is spent just swapping model, with little actual processing happening at all.

Therefore, here is a feature implementation that supports configurable priority per model (higher priority wins). This way, it's possible to configure for instance the explorer model with priority 3, coder model with priority 2 and orchestrator model with priority 1. What happens then in the scenario above is instead;

  • Orchestrator starts with some task, is loaded in by llama-swap and starts reasoning to build a plan.
  • Orchestrator launches two explorers and one coder
  • Since the explorer model has the highest priority, llama-swap will keep processing requests for this model until it has finished its tasks
  • The coder model has second highest priority, and will be loaded in next and be able to fully finish its task
  • Finally the orchestrator is loaded in again and can gather the result of all agents.

I had to add one more thing to make this work well though; unloadDelay. This works as a 'debounce' feature that keeps models with higher priority loaded for a configurable time if all other requests in the queue has lower priority. If a new request arrive for the currently loaded model before unload delay has expired, the timer resets and the model stays loaded and process the request. If a request for a higher priority model arrives though, the unloadDelay is ignored. This was required as there is usually a small delay in the agentic frameworks before the next request is triggered, and that was causing similar problems as the original situation. I have unloadDelay set to 5 for my agent models which works great.

I'm not sure unloadDelay is a great term, it might be better with debounce, debounceTimer, debounceDelay or something to keep it clearly separate from ttl which works differently. Open to suggestions.

I'm not used to Golang and had Claude code helped me implement this, I have reviewed the code to the best of my ability and also tested the feature manually. But in case there is any slop in the code I apologize in advance.

This change introduces configurable priority for models and an unload
delay to improve performance when used with agentic frameworks.

Previously, the FIFO queueing could lead to frequent model swapping,
hindering progress when multiple agents (e.g., explorer, coder,
orchestrator) were running concurrently.

This implementation allows assigning priority to each model. Higher
priority models are favored, remaining loaded and processing requests
until their tasks are complete.

An `unloadDelay` is also introduced. This keeps high-priority models
loaded for a configurable duration even if no new requests arrive,
preventing premature swapping due to minor delays in agentic workflows.
Higher priority requests will always interrupt the delay.

This addresses performance issues observed with frameworks like OpenCode
where background workers frequently request different models.
@coderabbitai

coderabbitai Bot commented Mar 29, 2026

Copy link
Copy Markdown

Walkthrough

Adds two per-model config fields, priority and unloadDelay, updates schema, examples and docs, extends ModelConfig with new exported fields, and refactors process-group swap coordination to a priority- and unload-delay–aware synchronization scheme with accompanying tests.

Changes

Cohort / File(s) Summary
Configuration Schema & Examples
config-schema.json, config.example.yaml, docs/configuration.md
Introduce models.<id>.priority (integer, >=0, default 0) and models.<id>.unloadDelay (integer, >=0, default 0); update schema, example config, and documentation describing priority ordering and unload-delay semantics.
Model Config Types
proxy/config/model_config.go
Add exported Priority int \yaml:"priority"`andUnloadDelay int `yaml:"unloadDelay"`fields toModelConfig` for YAML unmarshalling (zero-value defaults).
Process Group Scheduling
proxy/processgroup.go
Refactor swap handling to priority-aware coordination: replace lastUsedProcess with activeModel, inFlight, pendingCount, unloadDelayCh; add mutex+cond synchronization, logic for priority preemption/fairness, unload-delay timers, and cancellation on shutdown.
Tests
proxy/processgroup_test.go
Add unit and concurrency tests exercising priority selection, preemption behavior, unload-delay hold and bypass cases, timer reset semantics, equal-priority fairness, and zero-delay behavior.

Estimated code review effort

🎯 4 (Complex) | ⏱️ ~75 minutes

Possibly related PRs

Suggested labels

enhancement, configuration

Suggested reviewers

  • mostlygeek
🚥 Pre-merge checks | ✅ 3
✅ Passed checks (3 passed)
Check name Status Explanation
Title check ✅ Passed The title clearly and concisely summarizes the main changes: adding configurable priority and unload delay features for model scheduling.
Description check ✅ Passed The description is directly related to the changeset, providing detailed context about the motivation, implementation approach, and feature behavior across all modified files.
Docstring Coverage ✅ Passed Docstring coverage is 80.00% which is sufficient. The required threshold is 80.00%.

✏️ Tip: You can configure your own custom pre-merge checks in the settings.

✨ Finishing Touches
🧪 Generate unit tests (beta)
  • Create PR with unit tests

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

Comment @coderabbitai help to get the list of available commands and usage tips.

@coderabbitai coderabbitai Bot left a comment

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 1

🧹 Nitpick comments (1)
proxy/processgroup.go (1)

294-322: Lock held during parallel process stops may cause contention.

StopProcesses holds pg.mu for the entire duration of the parallel stop operation via defer. While the processes are being stopped in parallel (which could take time), other goroutines calling ProxyRequest will block on pg.mu.Lock().

This is likely acceptable since StopProcesses is typically called during shutdown or group unloading, but worth noting for awareness.

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@proxy/processgroup.go` around lines 294 - 322, StopProcesses currently holds
pg.mu for the entire parallel shutdown (defer unlocking after wg.Wait), causing
contention; fix by copying pg.processes to a local slice while holding pg.mu,
cancelUnloadDelayLocked()/clear activeModel()/cond.Broadcast() and unlock pg.mu
before starting the goroutines, then run the parallel Stop/StopImmediately on
the local slice, wait for wg, and finally re-acquire pg.mu if any post-stop
state mutation is required; references: StopProcesses, pg.mu, pg.processes,
pg.cancelUnloadDelayLocked, pg.activeModel, pg.cond.Broadcast, Process.Stop,
Process.StopImmediately.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.

Inline comments:
In `@proxy/processgroup.go`:
- Line 33: The struct's swap coordination state block has misaligned field
spacing (gofmt fails); adjust the spacing so all field names and their types
line up consistently — e.g., in the block containing unloadDelayCh (chan
struct{}) and the other three fields, align the type columns so each field
follows the same spacing convention used elsewhere in the file (ensure names
like unloadDelayCh align with the other three field names and their types), then
run gofmt to confirm the layout is correct.

---

Nitpick comments:
In `@proxy/processgroup.go`:
- Around line 294-322: StopProcesses currently holds pg.mu for the entire
parallel shutdown (defer unlocking after wg.Wait), causing contention; fix by
copying pg.processes to a local slice while holding pg.mu,
cancelUnloadDelayLocked()/clear activeModel()/cond.Broadcast() and unlock pg.mu
before starting the goroutines, then run the parallel Stop/StopImmediately on
the local slice, wait for wg, and finally re-acquire pg.mu if any post-stop
state mutation is required; references: StopProcesses, pg.mu, pg.processes,
pg.cancelUnloadDelayLocked, pg.activeModel, pg.cond.Broadcast, Process.Stop,
Process.StopImmediately.
🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

  • Push a commit to this branch (recommended)
  • Create a new PR with the fixes

ℹ️ Review info
⚙️ Run configuration

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Pro

Run ID: 3aee4fa6-fcc0-4a45-8e39-d31cbd6a2565

📥 Commits

Reviewing files that changed from the base of the PR and between 1e44077 and 98117e4.

📒 Files selected for processing (6)
  • config-schema.json
  • config.example.yaml
  • docs/configuration.md
  • proxy/config/model_config.go
  • proxy/processgroup.go
  • proxy/processgroup_test.go

Comment thread proxy/processgroup.go Outdated

@coderabbitai coderabbitai Bot left a comment

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Caution

Some comments are outside the diff and can’t be posted inline due to platform limitations.

⚠️ Outside diff range comments (1)
proxy/processgroup.go (1)

294-322: ⚠️ Potential issue | 🔴 Critical

Critical: Deadlock when stopping processes with in-flight requests.

StopProcesses holds pg.mu while calling wg.Wait() (line 321). If an in-flight request is being served (at line 162), it will try to acquire pg.mu at line 164 after completing. Since process.Stop() waits for in-flight requests via its internal WaitGroup, and the in-flight request can't acquire the lock to decrement pg.inFlight, the result is a deadlock.

This affects swapProcessGroup() in proxymanager.go which calls StopProcesses when handling exclusive groups.

🔒 Proposed fix: Release lock during process stop operations
 func (pg *ProcessGroup) StopProcesses(strategy StopStrategy) {
 	pg.mu.Lock()
-	defer func() {
-		pg.cancelUnloadDelayLocked()
-		pg.activeModel = ""
-		pg.cond.Broadcast()
-		pg.mu.Unlock()
-	}()
 
 	if len(pg.processes) == 0 {
+		pg.mu.Unlock()
 		return
 	}
 
+	pg.cancelUnloadDelayLocked()
+	pg.activeModel = ""
+	pg.mu.Unlock()
+
 	// stop Processes in parallel
 	var wg sync.WaitGroup
 	for _, process := range pg.processes {
 		wg.Add(1)
 		go func(process *Process) {
 			defer wg.Done()
 			switch strategy {
 			case StopImmediately:
 				process.StopImmediately()
 			default:
 				process.Stop()
 			}
 		}(process)
 	}
 	wg.Wait()
+
+	pg.mu.Lock()
+	pg.cond.Broadcast()
+	pg.mu.Unlock()
 }
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@proxy/processgroup.go` around lines 294 - 322, StopProcesses currently holds
pg.mu while waiting for process.Stop()/StopImmediately() which can deadlock;
change it to copy pg.processes into a local slice while holding pg.mu, then
release pg.mu before launching goroutines and waiting on the WaitGroup, and
after wg.Wait() re-acquire pg.mu to call cancelUnloadDelayLocked(), clear
pg.activeModel, and pg.cond.Broadcast(). Use the same symbols to locate code:
ProcessGroup.StopProcesses, pg.mu, pg.processes, process.Stop/StopImmediately,
cancelUnloadDelayLocked(), pg.activeModel, and pg.cond.Broadcast().
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.

Outside diff comments:
In `@proxy/processgroup.go`:
- Around line 294-322: StopProcesses currently holds pg.mu while waiting for
process.Stop()/StopImmediately() which can deadlock; change it to copy
pg.processes into a local slice while holding pg.mu, then release pg.mu before
launching goroutines and waiting on the WaitGroup, and after wg.Wait()
re-acquire pg.mu to call cancelUnloadDelayLocked(), clear pg.activeModel, and
pg.cond.Broadcast(). Use the same symbols to locate code:
ProcessGroup.StopProcesses, pg.mu, pg.processes, process.Stop/StopImmediately,
cancelUnloadDelayLocked(), pg.activeModel, and pg.cond.Broadcast().

ℹ️ Review info
⚙️ Run configuration

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Pro

Run ID: 91c84061-57d5-4e40-9f0f-35a9f06614a4

📥 Commits

Reviewing files that changed from the base of the PR and between 98117e4 and b44c8c9.

📒 Files selected for processing (1)
  • proxy/processgroup.go

@mostlygeek

Copy link
Copy Markdown
Owner

Hi,

You're right that llama-swap has a basic FIFO queue and with agents that make use of different models for sub agents it can cause a lot of thrashing. I use Claude Code and it uses haiku for smaller tasks like summaries.

What coding agent and models are you running?

I think the right approach is to introduce some choice into the queuing strategy used by llama-swap. The basic FIFO can remain the default and a new FIFO+ with collation of requests can be used.

With the new strategy this request queue:

A A B C B A B

would turn into:

A A A B B B C

that way the time swapping is minimized.

To prevent starvation for C a max wait time can help push it higher into the queue.

@MannerMan

Copy link
Copy Markdown
Author

Hi,

You're right that llama-swap has a basic FIFO queue and with agents that make use of different models for sub agents it can cause a lot of thrashing. I use Claude Code and it uses haiku for smaller tasks like summaries.

What coding agent and models are you running?

I think the right approach is to introduce some choice into the queuing strategy used by llama-swap. The basic FIFO can remain the default and a new FIFO+ with collation of requests can be used.

With the new strategy this request queue:

A A B C B A B

would turn into:

A A A B B B C

that way the time swapping is minimized.

To prevent starvation for C a max wait time can help push it higher into the queue.

I use mainly use two coding tools, Roo code extension in vscode and opencode with oh-my-opencode-slim plugin; Roo is working very well with the current llama-swap behavior because its subtasks are performed serially, i.e. the orchestrator starts a subtask which is delegated to a specialized agent, and Roo does not 'yield' back to the orchestrator until the agent is finished.

Opencode with the plugin is a bit more powerful in that it supports launching background agents in parallel (maybe Roo can do this too, not sure). I tried the regular oh-my-opencode plugin but the prompts are too complex for the local models I use, however the -slim variant of the plugin works well.

My setup is 4x RTX 3060 12gb and I mainly use a combination of these models;

  • devstral-small-2-24b (solid coder)
  • qwen-3.5-27b (typical main orchestrator, smartest model of the size)
  • qwen-3.5-35b-a3b
  • qwen-coder-next (in Q4 quant)
  • Ministral-3-14b (gives room for large parallel contexts, good for parallel subagents exploring code etc)

--

I think the suggested FIFO+ is elegant from the user perspective (a single toggle on/off to get the behavior). Max wait could also potentially help making sure agent models finish before the orchestrator continues, if being able to set this on a per-model basis.

I do however also see some cases where the proposed FIFO+ could be sub-optimal, for instance the case of a orchestrator launching only one background agent; This would likely cause excessive swapping behavior without some sort of debounce feature since the single background agent will not queue up several requests at once for its model. Or it could potentially keep the orchestrator loaded for too long causing it to give up on the agent.

I do think being able to configure the priority of individual models in the swapping queue could be very useful for advanced users; This enables control over when swapping happens or not, which can be very useful in different pipelines. I can see the issue of some models 'never' being swapped in though, but maybe that's a problem for said advanced users to deal with :)

Maybe the FIFO+ option can also support setting manual priority for models to get the best of both?

Or having a global queueBehavior option that can be simple/smart/priority for instance.

@Luiszzzor

Copy link
Copy Markdown
Contributor

hey @MannerMan @mostlygeek,

are you planning to merge this, refactor or create other complementary solution for queues and priorities? :) This is the only thing lacking in llama-swap from my perspective. I would love to see it officially 👍

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants