llama-router, the C++ "llama-swap" for llama.cpp #17629

ServeurpersoCom · 2025-11-30T20:59:36Z

Designed to:

Work out of the box (runs like llama-server, HF Hub download included)
Be fully compatible with existing code, no legacy or technical debt
Satisfy advanced users (per-model config, other backends like vLLM)
Enable SvelteUI to evolve into an HF Hub Browser and per model configuration (LM Studio style)
Alternative approach to the legendary server: introduce API for serving / loading / unloading multiple models #17470
Expandable

TODO:

Windows and MacOS testing
Integrate SvelteUI like llama-server does (to work out of the box)
Complete /admin endpoint (Download / Delete / Config....)
Config templates (default settings with runtime per-model overrides)
Import the llama-swap config.yaml (migration path)
Optional Weight Loading Indicator inside the delta.reasoning_content
Reuse existing SSE code from server-http.h ?
Stress testing tool covering all process edge case / (rewrite a real state machine core, overkill?)
... ?

Auto-spawns llama-server instances on /v1/chat/completions requests. Discovers GGUF models in cache, allocates ports dynamically, manages process lifecycle. Integrated HF download for zero-config deployment Makes WebUI model selector and distributed serving plug-and-play

Configures common logger before main() to prevent crashes from early LOG_*() calls. Console-only, matches llama-server behavior

Define spawn args and router options in C++ instead of JSON config. Scan HF manifest files to auto-link mmproj with vision models. - get_default_spawn() and get_default_router_options() as single source - Remove orphaned log_level field (logging uses static init) - Use fs_get_cache_directory() for portability - Port 8082, base_port 50000 (dynamic range), host 127.0.0.1 (secure default)

Implement optional 'group' field for models to control cohabitation. Models without group default to their name (unique) for swap exclusivity. Models with same explicit group can run concurrently. - get_model_group() returns group or name as fallback - ensure_running() kills only models outside target group - Log group transitions for visibility - Example: 3 light models with group='concurrent' cohabit, heavy models without group swap exclusively Enables flexible VRAM sharing: exclusive heavy models, concurrent light models, or mixed configurations per user needs

Add /props, /slots, /health at root that proxy to last spawned model. Enables backward compatibility for existing llama-server clients. - Track last_spawned_model in RouterApp with thread-safe getter - Shared lambda proxy_last_spawned for /props, /slots, /health - Return 503 'no models running' if empty - Modern per-model endpoints (/{model}/props) remain available WebUI and other llama-server clients work without modification when running single model or after initial spawn.

Add server-sent events streaming for real-time chat completions. Implement graceful shutdown, admin authentication, and hardened config. Streaming infrastructure: - Detect stream requests via Accept header or JSON stream:true - StreamState with mutex/cv for thread-safe chunk queueing - set_chunked_content_provider for incremental SSE relay - Dedicated upstream thread prevents blocking on slow clients Production hardening: - Signal handlers (SIGINT/SIGTERM) for clean shutdown - Separate server thread + atomic shutdown flag - wait_for_process_exit() prevents port conflicts on model swap - Admin endpoints require Bearer token or X-Admin-Token header - Subprocess stdout/stderr captured to log_dir per model Config validation: - Port range checks (1-65535) at load time - Model path existence verification - Spawn command executability validation (absolute paths) - Auto-create log_dir with proper error handling Configurable timeouts: - connection_timeout_s and read_timeout_s in router section - Defaults: 5s connect, 600s read (10min for large generations) Enables zero-downtime deploys via /admin/reload, debug model crashes via per-model logs, and seamless WebUI streaming UX. Eliminates race conditions from concurrent model lifecycle ops

Static constructor called router_log_init() before main(), invoking common_log_main() when its singleton may not exist yet (undefined cross-TU init order). Explicit call in main() already present.

…end ready Resolve llama-server from router binary location instead of PATH. Capture subprocess stdout/stderr on both platforms. Block spawn until backend reports /health 200 to prevent client 502 errors. Binary detection: - Linux: readlink(/proc/self/exe) + parent + llama-server - Windows: GetModuleFileNameA() + parent + llama-server.exe - Fallback to PATH if detection fails Log capture: - Windows: STARTUPINFO hStdOutput/hStdError with inherited handles - Linux: dup2 on child process stdout/stderr Health check: - Poll /health every 500ms after spawn with 60s timeout - Cleanup process/ports on readiness failure - Prevents proxying to backends still loading model into VRAM

…lifecycle monitoring - Replace log file redirection with native pipe-based stdout/stderr capture - Launch dedicated threads (Windows: ReadFile/WriteFile, Linux: read/write) for real-time output forwarding to parent process - Add process health checks during backend readiness wait (detect early exits) - Fix fork() safety: remove LOG_* calls in child, use raw write() for diagnostics - Implement proper cleanup: join I/O threads, close pipe handles/fds - Add verbose progress logging during backend startup (1s intervals) - Reduce timeouts: 10s readiness, 1s graceful shutdown, 200ms health polls - Add move semantics to ProcessHandle for proper thread ownership transfer This achieves plug-and-play sibling binary execution (llama-router spawns llama-server from same directory) with full output visibility, matching the behavior of Node.js child_process stdout.pipe(process.stdout)

…dpoints - Introduce SpawnConfig struct: command, proxy_endpoints, health_endpoint - Replace vector<string> default_spawn with full SpawnConfig - Support per-model spawn override (vLLM, TGI, etc. alongside llama.cpp) - Implement prefix-based endpoint filtering (simple startswith, no wildcards) - Health endpoint now configurable per spawn config - Validate spawn commands and proxy endpoints before execution Default config enables /v1/, /health, /slots, /props endpoints. Single router can now manage heterogeneous inference backends

- Fix use-after-free: capture request data by value (path, method, body) instead of by-reference to avoid stack variable invalidation when proxy_request() returns while upstream thread still running - Use shared_ptr<httplib::Client> to ensure lifetime during async ops - Fix streaming termination: explicitly call sink.done() when upstream completes to signal httplib connection closure (fixes infinite hang) - Add unlock() before all provider returns to prevent mutex deadlock - Handle spurious wakeups: pause and retry when queue temporarily empty

Auto-rescan models on startup: - Scan cache directory and add new .gguf files as 'auto' models - Remove 'auto' models no longer present in cache - Never touch 'manual' models (user-managed configuration) - Preserve custom spawn/group settings for existing models - New /admin/rescan endpoint for on-demand rescanning Separate admin endpoints: - Extract /admin routes to router-admin.cpp/h - Clean separation: router-endpoints.cpp = public API only - Add RouterApp::update_config() for live config updates - Support both Bearer token and X-Admin-Token header auth Fixes: - Fix /model/(health|props|slots) path rewriting for backends - Thread-safe streaming: eliminate parent scope captures - Robust JSON parsing for 'stream' field detection - Simplified signal handlers (remove redundant stop_all) - Initialize logger before any LOG_* calls

New CLI flag --import-dir <path> recursively scans local directories and imports GGUF models as manual state (spawn on-demand only) Features: - Smart mmproj detection: skips mmproj files as standalone models - Auto-associates mmproj to models in same directory - Priority: BF16 > F16 > F32 when multiple mmproj variants exist - All quants of same model share the same prioritized mmproj - Idempotent: won't duplicate existing models on re-import - Manifest-optional: works without HF manifests for local collections Fixes: - Robust manifest handling: no crash if manifest JSON missing - PATH binary check: only validates paths with separators Example directory structure: /mnt/models/ ├─ unsloth/ │ ├─ Qwen3-VL-32B-Instruct-GGUF/ │ │ ├─ Qwen3-VL-32B-Instruct-Q4_K_M.gguf ─┐ │ │ ├─ Qwen3-VL-32B-Instruct-Q5_K_M.gguf ─┼─> all use mmproj-BF16.gguf │ │ ├─ Qwen3-VL-32B-Instruct-Q6_K.gguf ─┘ │ │ ├─ mmproj-BF16.gguf <- priority 1 (selected) │ │ ├─ mmproj-F16.gguf <- priority 2 │ │ └─ mmproj-F32.gguf <- priority 3 │ └── DeepSeek-R1-Distill-Qwen-32B-GGUF/ │ ├─ DeepSeek-R1-Distill-Qwen-32B-Q5_K_M.gguf │ └─ DeepSeek-R1-Distill-Qwen-32B-Q6_K.gguf └── bartowski/ └─ Valkyrie-49B-v2-GGUF/ ├─ Valkyrie-49B-v2-Q4_K_M.gguf └─ Valkyrie-49B-v2-IQ4_NL.gguf Usage: llama-router --import-dir /mnt/models/ llama-router --import-dir ~/my-gguf-collection All imported models are set to manual state (never auto-removed by rescan)

…stem architecture

…docs

Poll process_running() at ROUTER_POLL_INTERVAL_MS until exit confirmed. Ensures VRAM freed before hot-swap. Update docs to use constant names

ROUTER_POLL_INTERVAL_MS -> ROUTER_PROCESS_POLL_INTERVAL_MS PROCESS = OS operations (PID, fork, kill) BACKEND = HTTP operations (health, readiness) Adjust timeouts: 2s shutdown, 60s ready, 100ms polls

ngxson · 2025-11-30T21:22:14Z

There seems to be quite some TODO, would you mind passing the PR to "draft"?

Beside, I think we should also publicly mention our discussion so far around this: In term of functionalities, this is good and more likely be an extension to #17470 , we will ultimately takes implements the ideas into llama-server in the future (to avoid duplicated efforts)

Also, just want to mention one concern is that a big PR usually come with security risks, so we actually want to deliver a limited set of features at a time (this is part of the reason why #17470 misses these functionalities, even though I technically can add them all right now)

ServeurpersoCom · 2025-11-30T21:40:45Z

Yes, as we discussed, it works standalone and decoupled (basically a llama-swap in C++ with some llama.cpp and HF integration to work out of the box), but we can adapt it however you want.

For /admin endpoint, security barriers needed :

Authentication (prevents DoS via resource allocation)
Strict parameter filtering (maintainable whitelist synced with llama-server params)
Better alternative to cmdline args? (ongoing refactor?)

Implement notification sink to stream lifecycle events during model swaps. Notifications sent via delta.reasoning_content (OpenAI-compatible) Progress events emitted during ensure_running() - Unloading previous model(s) - Loading new model - Backend ready confirmation Refactor proxy_request() to handle ensure_running() with optional sink attachment for streaming feedback { "router": { "notify_model_swap": true } }

…ECTURE

Implement optional model preloading at router boot via startup_model field in configuration. Model is spawned synchronously before HTTP server starts, ensuring /props and other endpoints work immediately. Configuration changes: - Add startup_model field to RouterConfig struct - Validate startup_model exists in configured models during load_config() - Fail fast on startup if configured model cannot be spawned Runtime behavior: - If startup_model is empty, retain pure on-demand behavior - If set, call ensure_running() before server.listen() - HTTP server starts only after model reports ready

When downloading a model via -hf and no startup_model is configured, automatically set the downloaded model as startup_model. This provides plug-and-play experience: first download preloads automatically, making /props and other endpoints work immediately without manual configuration Subsequent downloads leave startup_model unchanged to preserve user choice

Remove automatic --model/--port/--host appending in favor of $path, $port, $host placeholders in spawn commands. All parameters now visible in configuration for full transparency and flexibility

ServeurpersoCom added 21 commits November 30, 2025 21:33

llama-router: fix logging init via static constructor

25f1433

Configures common logger before main() to prevent crashes from early LOG_*() calls. Console-only, matches llama-server behavior

llama-router: add comprehensive debug logging

70eec73

llama-router: fix segfault from static initialization order fiasco

dac95e8

Static constructor called router_log_init() before main(), invoking common_log_main() when its singleton may not exist yet (undefined cross-TU init order). Explicit call in main() already present.

llama-router: validate binary before spawn, clean child error handling

728bccc

llama-router: add README with CLI reference and configuration guide

bfb3e62

llama-router: document KISS philosophy, optimization patterns, and sy…

4bc8f69

…stem architecture

llama-router: fix PATH binary support and macOS detection

b14ea20

llama-router: separate quick-start guide from technical architecture …

c5fdd3a

…docs

llama-router: async polling for process termination after SIGKILL

cb44f59

Poll process_running() at ROUTER_POLL_INTERVAL_MS until exit confirmed. Ensures VRAM freed before hot-swap. Update docs to use constant names

llama-router: separate PROCESS (OS) and BACKEND (HTTP) polling constants

85f418d

ROUTER_POLL_INTERVAL_MS -> ROUTER_PROCESS_POLL_INTERVAL_MS PROCESS = OS operations (PID, fork, kill) BACKEND = HTTP operations (health, readiness) Adjust timeouts: 2s shutdown, 60s ready, 100ms polls

ServeurpersoCom requested a review from ngxson November 30, 2025 21:00

ServeurpersoCom self-assigned this Nov 30, 2025

ServeurpersoCom added testing Everything test related need feedback Testing and feedback with results are needed labels Nov 30, 2025

github-actions bot added the examples label Nov 30, 2025

ServeurpersoCom requested a review from ggerganov November 30, 2025 21:09

ServeurpersoCom marked this pull request as draft November 30, 2025 21:29

ServeurpersoCom added 7 commits December 1, 2025 14:54

llama-router: document notify_model_swap feature in README and ARCHIT…

da65c5f

…ECTURE

llama-router: add embedded WebUI support

919e581

llama-router: document startup_model in README and ARCHITECTURE

6e93322

llama-router: add --jinja to default spawn configuration

1a014b2

github-actions bot added the server label Dec 1, 2025

llama-router: replace implicit arg injection with explicit placeholders

d99d952

Remove automatic --model/--port/--host appending in favor of $path, $port, $host placeholders in spawn commands. All parameters now visible in configuration for full transparency and flexibility

ServeurpersoCom force-pushed the llama-router branch from 474e940 to d99d952 Compare December 1, 2025 22:19

loci-dev mentioned this pull request Dec 1, 2025

UPSTREAM PR #17629: llama-router, the C++ "llama-swap" for llama.cpp auroralabs-loci/llama.cpp#397

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

llama-router, the C++ "llama-swap" for llama.cpp #17629

llama-router, the C++ "llama-swap" for llama.cpp #17629

ServeurpersoCom commented Nov 30, 2025 •

edited

Loading

Uh oh!

ngxson commented Nov 30, 2025 •

edited

Loading

Uh oh!

ServeurpersoCom commented Nov 30, 2025 •

edited

Loading

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

llama-router, the C++ "llama-swap" for llama.cpp #17629

Are you sure you want to change the base?

llama-router, the C++ "llama-swap" for llama.cpp #17629

Conversation

ServeurpersoCom commented Nov 30, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Designed to:

TODO:

Uh oh!

ngxson commented Nov 30, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

ServeurpersoCom commented Nov 30, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

ServeurpersoCom commented Nov 30, 2025 •

edited

Loading

ngxson commented Nov 30, 2025 •

edited

Loading

ServeurpersoCom commented Nov 30, 2025 •

edited

Loading