-
Notifications
You must be signed in to change notification settings - Fork 13.9k
llama-router, the C++ "llama-swap" for llama.cpp #17629
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: master
Are you sure you want to change the base?
Conversation
Auto-spawns llama-server instances on /v1/chat/completions requests. Discovers GGUF models in cache, allocates ports dynamically, manages process lifecycle. Integrated HF download for zero-config deployment Makes WebUI model selector and distributed serving plug-and-play
Configures common logger before main() to prevent crashes from early LOG_*() calls. Console-only, matches llama-server behavior
Define spawn args and router options in C++ instead of JSON config. Scan HF manifest files to auto-link mmproj with vision models. - get_default_spawn() and get_default_router_options() as single source - Remove orphaned log_level field (logging uses static init) - Use fs_get_cache_directory() for portability - Port 8082, base_port 50000 (dynamic range), host 127.0.0.1 (secure default)
Implement optional 'group' field for models to control cohabitation. Models without group default to their name (unique) for swap exclusivity. Models with same explicit group can run concurrently. - get_model_group() returns group or name as fallback - ensure_running() kills only models outside target group - Log group transitions for visibility - Example: 3 light models with group='concurrent' cohabit, heavy models without group swap exclusively Enables flexible VRAM sharing: exclusive heavy models, concurrent light models, or mixed configurations per user needs
Add /props, /slots, /health at root that proxy to last spawned model.
Enables backward compatibility for existing llama-server clients.
- Track last_spawned_model in RouterApp with thread-safe getter
- Shared lambda proxy_last_spawned for /props, /slots, /health
- Return 503 'no models running' if empty
- Modern per-model endpoints (/{model}/props) remain available
WebUI and other llama-server clients work without modification
when running single model or after initial spawn.
Add server-sent events streaming for real-time chat completions. Implement graceful shutdown, admin authentication, and hardened config. Streaming infrastructure: - Detect stream requests via Accept header or JSON stream:true - StreamState with mutex/cv for thread-safe chunk queueing - set_chunked_content_provider for incremental SSE relay - Dedicated upstream thread prevents blocking on slow clients Production hardening: - Signal handlers (SIGINT/SIGTERM) for clean shutdown - Separate server thread + atomic shutdown flag - wait_for_process_exit() prevents port conflicts on model swap - Admin endpoints require Bearer token or X-Admin-Token header - Subprocess stdout/stderr captured to log_dir per model Config validation: - Port range checks (1-65535) at load time - Model path existence verification - Spawn command executability validation (absolute paths) - Auto-create log_dir with proper error handling Configurable timeouts: - connection_timeout_s and read_timeout_s in router section - Defaults: 5s connect, 600s read (10min for large generations) Enables zero-downtime deploys via /admin/reload, debug model crashes via per-model logs, and seamless WebUI streaming UX. Eliminates race conditions from concurrent model lifecycle ops
Static constructor called router_log_init() before main(), invoking common_log_main() when its singleton may not exist yet (undefined cross-TU init order). Explicit call in main() already present.
…end ready Resolve llama-server from router binary location instead of PATH. Capture subprocess stdout/stderr on both platforms. Block spawn until backend reports /health 200 to prevent client 502 errors. Binary detection: - Linux: readlink(/proc/self/exe) + parent + llama-server - Windows: GetModuleFileNameA() + parent + llama-server.exe - Fallback to PATH if detection fails Log capture: - Windows: STARTUPINFO hStdOutput/hStdError with inherited handles - Linux: dup2 on child process stdout/stderr Health check: - Poll /health every 500ms after spawn with 60s timeout - Cleanup process/ports on readiness failure - Prevents proxying to backends still loading model into VRAM
…lifecycle monitoring - Replace log file redirection with native pipe-based stdout/stderr capture - Launch dedicated threads (Windows: ReadFile/WriteFile, Linux: read/write) for real-time output forwarding to parent process - Add process health checks during backend readiness wait (detect early exits) - Fix fork() safety: remove LOG_* calls in child, use raw write() for diagnostics - Implement proper cleanup: join I/O threads, close pipe handles/fds - Add verbose progress logging during backend startup (1s intervals) - Reduce timeouts: 10s readiness, 1s graceful shutdown, 200ms health polls - Add move semantics to ProcessHandle for proper thread ownership transfer This achieves plug-and-play sibling binary execution (llama-router spawns llama-server from same directory) with full output visibility, matching the behavior of Node.js child_process stdout.pipe(process.stdout)
…dpoints - Introduce SpawnConfig struct: command, proxy_endpoints, health_endpoint - Replace vector<string> default_spawn with full SpawnConfig - Support per-model spawn override (vLLM, TGI, etc. alongside llama.cpp) - Implement prefix-based endpoint filtering (simple startswith, no wildcards) - Health endpoint now configurable per spawn config - Validate spawn commands and proxy endpoints before execution Default config enables /v1/, /health, /slots, /props endpoints. Single router can now manage heterogeneous inference backends
- Fix use-after-free: capture request data by value (path, method, body) instead of by-reference to avoid stack variable invalidation when proxy_request() returns while upstream thread still running - Use shared_ptr<httplib::Client> to ensure lifetime during async ops - Fix streaming termination: explicitly call sink.done() when upstream completes to signal httplib connection closure (fixes infinite hang) - Add unlock() before all provider returns to prevent mutex deadlock - Handle spurious wakeups: pause and retry when queue temporarily empty
Auto-rescan models on startup: - Scan cache directory and add new .gguf files as 'auto' models - Remove 'auto' models no longer present in cache - Never touch 'manual' models (user-managed configuration) - Preserve custom spawn/group settings for existing models - New /admin/rescan endpoint for on-demand rescanning Separate admin endpoints: - Extract /admin routes to router-admin.cpp/h - Clean separation: router-endpoints.cpp = public API only - Add RouterApp::update_config() for live config updates - Support both Bearer token and X-Admin-Token header auth Fixes: - Fix /model/(health|props|slots) path rewriting for backends - Thread-safe streaming: eliminate parent scope captures - Robust JSON parsing for 'stream' field detection - Simplified signal handlers (remove redundant stop_all) - Initialize logger before any LOG_* calls
New CLI flag --import-dir <path> recursively scans local directories
and imports GGUF models as manual state (spawn on-demand only)
Features:
- Smart mmproj detection: skips mmproj files as standalone models
- Auto-associates mmproj to models in same directory
- Priority: BF16 > F16 > F32 when multiple mmproj variants exist
- All quants of same model share the same prioritized mmproj
- Idempotent: won't duplicate existing models on re-import
- Manifest-optional: works without HF manifests for local collections
Fixes:
- Robust manifest handling: no crash if manifest JSON missing
- PATH binary check: only validates paths with separators
Example directory structure:
/mnt/models/
├─ unsloth/
│ ├─ Qwen3-VL-32B-Instruct-GGUF/
│ │ ├─ Qwen3-VL-32B-Instruct-Q4_K_M.gguf ─┐
│ │ ├─ Qwen3-VL-32B-Instruct-Q5_K_M.gguf ─┼─> all use mmproj-BF16.gguf
│ │ ├─ Qwen3-VL-32B-Instruct-Q6_K.gguf ─┘
│ │ ├─ mmproj-BF16.gguf <- priority 1 (selected)
│ │ ├─ mmproj-F16.gguf <- priority 2
│ │ └─ mmproj-F32.gguf <- priority 3
│ └── DeepSeek-R1-Distill-Qwen-32B-GGUF/
│ ├─ DeepSeek-R1-Distill-Qwen-32B-Q5_K_M.gguf
│ └─ DeepSeek-R1-Distill-Qwen-32B-Q6_K.gguf
└── bartowski/
└─ Valkyrie-49B-v2-GGUF/
├─ Valkyrie-49B-v2-Q4_K_M.gguf
└─ Valkyrie-49B-v2-IQ4_NL.gguf
Usage:
llama-router --import-dir /mnt/models/
llama-router --import-dir ~/my-gguf-collection
All imported models are set to manual state (never auto-removed by rescan)
…stem architecture
Poll process_running() at ROUTER_POLL_INTERVAL_MS until exit confirmed. Ensures VRAM freed before hot-swap. Update docs to use constant names
ROUTER_POLL_INTERVAL_MS -> ROUTER_PROCESS_POLL_INTERVAL_MS PROCESS = OS operations (PID, fork, kill) BACKEND = HTTP operations (health, readiness) Adjust timeouts: 2s shutdown, 60s ready, 100ms polls
|
There seems to be quite some TODO, would you mind passing the PR to "draft"? Beside, I think we should also publicly mention our discussion so far around this: In term of functionalities, this is good and more likely be an extension to #17470 , we will ultimately takes implements the ideas into Also, just want to mention one concern is that a big PR usually come with security risks, so we actually want to deliver a limited set of features at a time (this is part of the reason why #17470 misses these functionalities, even though I technically can add them all right now) |
|
Yes, as we discussed, it works standalone and decoupled (basically a llama-swap in C++ with some llama.cpp and HF integration to work out of the box), but we can adapt it however you want. For /admin endpoint, security barriers needed :
|
Implement notification sink to stream lifecycle events during model
swaps. Notifications sent via delta.reasoning_content (OpenAI-compatible)
Progress events emitted during ensure_running()
- Unloading previous model(s)
- Loading new model
- Backend ready confirmation
Refactor proxy_request() to handle ensure_running() with optional
sink attachment for streaming feedback
{
"router": {
"notify_model_swap": true
}
}
Implement optional model preloading at router boot via startup_model field in configuration. Model is spawned synchronously before HTTP server starts, ensuring /props and other endpoints work immediately. Configuration changes: - Add startup_model field to RouterConfig struct - Validate startup_model exists in configured models during load_config() - Fail fast on startup if configured model cannot be spawned Runtime behavior: - If startup_model is empty, retain pure on-demand behavior - If set, call ensure_running() before server.listen() - HTTP server starts only after model reports ready
When downloading a model via -hf and no startup_model is configured, automatically set the downloaded model as startup_model. This provides plug-and-play experience: first download preloads automatically, making /props and other endpoints work immediately without manual configuration Subsequent downloads leave startup_model unchanged to preserve user choice
Remove automatic --model/--port/--host appending in favor of $path, $port, $host placeholders in spawn commands. All parameters now visible in configuration for full transparency and flexibility
474e940 to
d99d952
Compare
Designed to:
TODO:
Closes #16487