Skip to content

Conversation

@ServeurpersoCom
Copy link
Collaborator

@ServeurpersoCom ServeurpersoCom commented Nov 30, 2025

Designed to:

  • Work out of the box (runs like llama-server, HF Hub download included)
  • Be fully compatible with existing code, no legacy or technical debt
  • Satisfy advanced users (per-model config, other backends like vLLM)
  • Enable SvelteUI to evolve into an HF Hub Browser and per model configuration (LM Studio style)
  • Alternative approach to the legendary server: introduce API for serving / loading / unloading multiple models #17470
  • Expandable

TODO:

  • Windows and MacOS testing
  • Integrate SvelteUI like llama-server does (to work out of the box)
  • Complete /admin endpoint (Download / Delete / Config....)
  • Config templates (default settings with runtime per-model overrides)
  • Import the llama-swap config.yaml (migration path)
  • Optional Weight Loading Indicator inside the delta.reasoning_content
  • Reuse existing SSE code from server-http.h ?
  • Stress testing tool covering all process edge case / (rewrite a real state machine core, overkill?)
  • ... ?

Closes #16487

Auto-spawns llama-server instances on /v1/chat/completions requests.
Discovers GGUF models in cache, allocates ports dynamically, manages
process lifecycle. Integrated HF download for zero-config deployment

Makes WebUI model selector and distributed serving plug-and-play
Configures common logger before main() to prevent crashes from
early LOG_*() calls. Console-only, matches llama-server behavior
Define spawn args and router options in C++ instead of JSON config.
Scan HF manifest files to auto-link mmproj with vision models.

- get_default_spawn() and get_default_router_options() as single source
- Remove orphaned log_level field (logging uses static init)
- Use fs_get_cache_directory() for portability
- Port 8082, base_port 50000 (dynamic range), host 127.0.0.1 (secure default)
Implement optional 'group' field for models to control cohabitation.
Models without group default to their name (unique) for swap exclusivity.
Models with same explicit group can run concurrently.

- get_model_group() returns group or name as fallback
- ensure_running() kills only models outside target group
- Log group transitions for visibility
- Example: 3 light models with group='concurrent' cohabit,
  heavy models without group swap exclusively

Enables flexible VRAM sharing: exclusive heavy models, concurrent
light models, or mixed configurations per user needs
Add /props, /slots, /health at root that proxy to last spawned model.
Enables backward compatibility for existing llama-server clients.

- Track last_spawned_model in RouterApp with thread-safe getter
- Shared lambda proxy_last_spawned for /props, /slots, /health
- Return 503 'no models running' if empty
- Modern per-model endpoints (/{model}/props) remain available

WebUI and other llama-server clients work without modification
when running single model or after initial spawn.
Add server-sent events streaming for real-time chat completions.
Implement graceful shutdown, admin authentication, and hardened config.

Streaming infrastructure:
- Detect stream requests via Accept header or JSON stream:true
- StreamState with mutex/cv for thread-safe chunk queueing
- set_chunked_content_provider for incremental SSE relay
- Dedicated upstream thread prevents blocking on slow clients

Production hardening:
- Signal handlers (SIGINT/SIGTERM) for clean shutdown
- Separate server thread + atomic shutdown flag
- wait_for_process_exit() prevents port conflicts on model swap
- Admin endpoints require Bearer token or X-Admin-Token header
- Subprocess stdout/stderr captured to log_dir per model

Config validation:
- Port range checks (1-65535) at load time
- Model path existence verification
- Spawn command executability validation (absolute paths)
- Auto-create log_dir with proper error handling

Configurable timeouts:
- connection_timeout_s and read_timeout_s in router section
- Defaults: 5s connect, 600s read (10min for large generations)

Enables zero-downtime deploys via /admin/reload, debug model
crashes via per-model logs, and seamless WebUI streaming UX.
Eliminates race conditions from concurrent model lifecycle ops
Static constructor called router_log_init() before main(), invoking
common_log_main() when its singleton may not exist yet (undefined
cross-TU init order). Explicit call in main() already present.
…end ready

Resolve llama-server from router binary location instead of PATH.
Capture subprocess stdout/stderr on both platforms. Block spawn
until backend reports /health 200 to prevent client 502 errors.

Binary detection:
- Linux: readlink(/proc/self/exe) + parent + llama-server
- Windows: GetModuleFileNameA() + parent + llama-server.exe
- Fallback to PATH if detection fails

Log capture:
- Windows: STARTUPINFO hStdOutput/hStdError with inherited handles
- Linux: dup2 on child process stdout/stderr

Health check:
- Poll /health every 500ms after spawn with 60s timeout
- Cleanup process/ports on readiness failure
- Prevents proxying to backends still loading model into VRAM
…lifecycle monitoring

- Replace log file redirection with native pipe-based stdout/stderr capture
- Launch dedicated threads (Windows: ReadFile/WriteFile, Linux: read/write)
  for real-time output forwarding to parent process
- Add process health checks during backend readiness wait (detect early exits)
- Fix fork() safety: remove LOG_* calls in child, use raw write() for diagnostics
- Implement proper cleanup: join I/O threads, close pipe handles/fds
- Add verbose progress logging during backend startup (1s intervals)
- Reduce timeouts: 10s readiness, 1s graceful shutdown, 200ms health polls
- Add move semantics to ProcessHandle for proper thread ownership transfer

This achieves plug-and-play sibling binary execution (llama-router spawns
llama-server from same directory) with full output visibility, matching
the behavior of Node.js child_process stdout.pipe(process.stdout)
…dpoints

- Introduce SpawnConfig struct: command, proxy_endpoints, health_endpoint
- Replace vector<string> default_spawn with full SpawnConfig
- Support per-model spawn override (vLLM, TGI, etc. alongside llama.cpp)
- Implement prefix-based endpoint filtering (simple startswith, no wildcards)
- Health endpoint now configurable per spawn config
- Validate spawn commands and proxy endpoints before execution

Default config enables /v1/, /health, /slots, /props endpoints.
Single router can now manage heterogeneous inference backends
- Fix use-after-free: capture request data by value (path, method, body)
  instead of by-reference to avoid stack variable invalidation when
  proxy_request() returns while upstream thread still running
- Use shared_ptr<httplib::Client> to ensure lifetime during async ops
- Fix streaming termination: explicitly call sink.done() when upstream
  completes to signal httplib connection closure (fixes infinite hang)
- Add unlock() before all provider returns to prevent mutex deadlock
- Handle spurious wakeups: pause and retry when queue temporarily empty
Auto-rescan models on startup:
- Scan cache directory and add new .gguf files as 'auto' models
- Remove 'auto' models no longer present in cache
- Never touch 'manual' models (user-managed configuration)
- Preserve custom spawn/group settings for existing models
- New /admin/rescan endpoint for on-demand rescanning

Separate admin endpoints:
- Extract /admin routes to router-admin.cpp/h
- Clean separation: router-endpoints.cpp = public API only
- Add RouterApp::update_config() for live config updates
- Support both Bearer token and X-Admin-Token header auth

Fixes:
- Fix /model/(health|props|slots) path rewriting for backends
- Thread-safe streaming: eliminate parent scope captures
- Robust JSON parsing for 'stream' field detection
- Simplified signal handlers (remove redundant stop_all)
- Initialize logger before any LOG_* calls
New CLI flag --import-dir <path> recursively scans local directories
and imports GGUF models as manual state (spawn on-demand only)

Features:
- Smart mmproj detection: skips mmproj files as standalone models
- Auto-associates mmproj to models in same directory
- Priority: BF16 > F16 > F32 when multiple mmproj variants exist
- All quants of same model share the same prioritized mmproj
- Idempotent: won't duplicate existing models on re-import
- Manifest-optional: works without HF manifests for local collections

Fixes:
- Robust manifest handling: no crash if manifest JSON missing
- PATH binary check: only validates paths with separators

Example directory structure:
/mnt/models/
├─ unsloth/
│  ├─ Qwen3-VL-32B-Instruct-GGUF/
│  │  ├─ Qwen3-VL-32B-Instruct-Q4_K_M.gguf ─┐
│  │  ├─ Qwen3-VL-32B-Instruct-Q5_K_M.gguf ─┼─> all use mmproj-BF16.gguf
│  │  ├─ Qwen3-VL-32B-Instruct-Q6_K.gguf   ─┘
│  │  ├─ mmproj-BF16.gguf <- priority 1 (selected)
│  │  ├─ mmproj-F16.gguf  <- priority 2
│  │  └─ mmproj-F32.gguf  <- priority 3
│  └── DeepSeek-R1-Distill-Qwen-32B-GGUF/
│      ├─ DeepSeek-R1-Distill-Qwen-32B-Q5_K_M.gguf
│      └─ DeepSeek-R1-Distill-Qwen-32B-Q6_K.gguf
└── bartowski/
    └─ Valkyrie-49B-v2-GGUF/
       ├─ Valkyrie-49B-v2-Q4_K_M.gguf
       └─ Valkyrie-49B-v2-IQ4_NL.gguf

Usage:
  llama-router --import-dir /mnt/models/
  llama-router --import-dir ~/my-gguf-collection

All imported models are set to manual state (never auto-removed by rescan)
Poll process_running() at ROUTER_POLL_INTERVAL_MS until exit confirmed.
Ensures VRAM freed before hot-swap. Update docs to use constant names
ROUTER_POLL_INTERVAL_MS -> ROUTER_PROCESS_POLL_INTERVAL_MS
PROCESS = OS operations (PID, fork, kill)
BACKEND = HTTP operations (health, readiness)
Adjust timeouts: 2s shutdown, 60s ready, 100ms polls
@ServeurpersoCom ServeurpersoCom self-assigned this Nov 30, 2025
@ServeurpersoCom ServeurpersoCom added testing Everything test related need feedback Testing and feedback with results are needed labels Nov 30, 2025
@ngxson
Copy link
Collaborator

ngxson commented Nov 30, 2025

There seems to be quite some TODO, would you mind passing the PR to "draft"?

Beside, I think we should also publicly mention our discussion so far around this: In term of functionalities, this is good and more likely be an extension to #17470 , we will ultimately takes implements the ideas into llama-server in the future (to avoid duplicated efforts)

Also, just want to mention one concern is that a big PR usually come with security risks, so we actually want to deliver a limited set of features at a time (this is part of the reason why #17470 misses these functionalities, even though I technically can add them all right now)

@ServeurpersoCom ServeurpersoCom marked this pull request as draft November 30, 2025 21:29
@ServeurpersoCom
Copy link
Collaborator Author

ServeurpersoCom commented Nov 30, 2025

Yes, as we discussed, it works standalone and decoupled (basically a llama-swap in C++ with some llama.cpp and HF integration to work out of the box), but we can adapt it however you want.

For /admin endpoint, security barriers needed :

  • Authentication (prevents DoS via resource allocation)
  • Strict parameter filtering (maintainable whitelist synced with llama-server params)
  • Better alternative to cmdline args? (ongoing refactor?)

Implement notification sink to stream lifecycle events during model
swaps. Notifications sent via delta.reasoning_content (OpenAI-compatible)

Progress events emitted during ensure_running()
- Unloading previous model(s)
- Loading new model
- Backend ready confirmation

Refactor proxy_request() to handle ensure_running() with optional
sink attachment for streaming feedback

{
  "router": {
    "notify_model_swap": true
  }
}
Implement optional model preloading at router boot via startup_model field
in configuration. Model is spawned synchronously before HTTP server starts,
ensuring /props and other endpoints work immediately.

Configuration changes:
- Add startup_model field to RouterConfig struct
- Validate startup_model exists in configured models during load_config()
- Fail fast on startup if configured model cannot be spawned

Runtime behavior:
- If startup_model is empty, retain pure on-demand behavior
- If set, call ensure_running() before server.listen()
- HTTP server starts only after model reports ready
When downloading a model via -hf and no startup_model is configured,
automatically set the downloaded model as startup_model. This provides
plug-and-play experience: first download preloads automatically, making
/props and other endpoints work immediately without manual configuration

Subsequent downloads leave startup_model unchanged to preserve user choice
@github-actions github-actions bot added the server label Dec 1, 2025
Remove automatic --model/--port/--host appending in favor of $path,
$port, $host placeholders in spawn commands. All parameters now visible
in configuration for full transparency and flexibility
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

examples need feedback Testing and feedback with results are needed server testing Everything test related

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Feature request: allow load/unload models on server

2 participants