Skip to content
Draft
Show file tree
Hide file tree
Changes from all commits
Commits
Show all changes
29 commits
Select commit Hold shift + click to select a range
07b506d
llama-router: multi-model serving with dynamic backends
ServeurpersoCom Nov 29, 2025
25f1433
llama-router: fix logging init via static constructor
ServeurpersoCom Nov 29, 2025
4cbedbc
llama-router: centralize defaults and add mmproj auto-detection
ServeurpersoCom Nov 29, 2025
cb7c489
llama-router: add process grouping for selective VRAM management
ServeurpersoCom Nov 29, 2025
dbf3250
llama-router: add legacy endpoint support for single-model compat
ServeurpersoCom Nov 29, 2025
70eec73
llama-router: add comprehensive debug logging
ServeurpersoCom Nov 29, 2025
0f090e2
llama-router: implement SSE streaming and production safety features
ServeurpersoCom Nov 29, 2025
dac95e8
llama-router: fix segfault from static initialization order fiasco
ServeurpersoCom Nov 29, 2025
ee94bfc
llama-router: auto-detect sibling binary, capture logs, wait for back…
ServeurpersoCom Nov 29, 2025
e472330
llama-router: implement cross-platform subprocess I/O forwarding and …
ServeurpersoCom Nov 30, 2025
728bccc
llama-router: validate binary before spawn, clean child error handling
ServeurpersoCom Nov 30, 2025
cbcc8a8
llama-router: add multi-engine support with configurable spawn and en…
ServeurpersoCom Nov 30, 2025
635b70d
llama-router: fix SSE streaming termination and use-after-free
ServeurpersoCom Nov 30, 2025
232799a
llama-router: auto-rescan, admin endpoints, and fixes
ServeurpersoCom Nov 30, 2025
7f274d5
llama-router: add --import-dir for custom model collections
ServeurpersoCom Nov 30, 2025
bfb3e62
llama-router: add README with CLI reference and configuration guide
ServeurpersoCom Nov 30, 2025
4bc8f69
llama-router: document KISS philosophy, optimization patterns, and sy…
ServeurpersoCom Nov 30, 2025
b14ea20
llama-router: fix PATH binary support and macOS detection
ServeurpersoCom Nov 30, 2025
c5fdd3a
llama-router: separate quick-start guide from technical architecture …
ServeurpersoCom Nov 30, 2025
cb44f59
llama-router: async polling for process termination after SIGKILL
ServeurpersoCom Nov 30, 2025
85f418d
llama-router: separate PROCESS (OS) and BACKEND (HTTP) polling constants
ServeurpersoCom Nov 30, 2025
41f506a
llama-router: add real-time model swap notifications via SSE
ServeurpersoCom Dec 1, 2025
da65c5f
llama-router: document notify_model_swap feature in README and ARCHIT…
ServeurpersoCom Dec 1, 2025
919e581
llama-router: add embedded WebUI support
ServeurpersoCom Dec 1, 2025
b248838
llama-router: add startup_model configuration option
ServeurpersoCom Dec 1, 2025
6e93322
llama-router: document startup_model in README and ARCHITECTURE
ServeurpersoCom Dec 1, 2025
47408bc
llama-router: auto-configure startup_model on first HF download
ServeurpersoCom Dec 1, 2025
1a014b2
llama-router: add --jinja to default spawn configuration
ServeurpersoCom Dec 1, 2025
d99d952
llama-router: replace implicit arg injection with explicit placeholders
ServeurpersoCom Dec 1, 2025
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
1 change: 1 addition & 0 deletions tools/CMakeLists.txt
Original file line number Diff line number Diff line change
Expand Up @@ -23,6 +23,7 @@ else()
add_subdirectory(quantize)
if (LLAMA_BUILD_SERVER)
add_subdirectory(server)
add_subdirectory(router)
endif()
add_subdirectory(run)
add_subdirectory(tokenize)
Expand Down
264 changes: 264 additions & 0 deletions tools/router/ARCHITECTURE.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,264 @@
# llama-router Architecture

Technical documentation for developers and contributors.

---

## Design Philosophy

llama-router follows KISS (Keep It Simple, Stupid) principles:

- **Minimal configuration**: Works out-of-box with HF cache scanning
- **Explicit persistence**: Config changes are written explicitly via admin endpoints, never hidden in business logic
- **Separation of concerns**: Core routing logic (`RouterApp`) has zero I/O, persistence handled by admin layer
- **Simple endpoint matching**: Prefix-based matching, no complex regex
- **Transparent proxy**: Headers and streaming forwarded as-is
- **On-demand by default**: Models start on first request; `startup_model` can preload a chosen backend at boot
- **Plug-and-play defaults**: Hugging Face downloads set `startup_model` automatically when it is unset
- **Transparent operations**: Optional real-time notifications for swap feedback via SSE

### The auto + default_spawn Workflow

Models discovered from the HuggingFace cache are marked as `auto` and inherit the `default_spawn` configuration. This creates a powerful optimization pattern:

1. **Tune `default_spawn` once** with your preferred parameters (GPU layers, KV cache quantization, context size, etc.)
2. **All `auto` models automatically use these settings** - no per-model configuration needed
3. **Change `default_spawn` and reload** - all `auto` models instantly updated
4. **Customize individual models** by switching to `manual` state first to prevent rescan overwrites

This ensures consistent, optimized behavior across your entire model collection while allowing per-model overrides when needed. **Always set models to `manual` before customizing their spawn parameters** - otherwise your changes will be lost on the next rescan.

## Multi-Engine Support

llama-router is engine-agnostic. Any OpenAI-compatible inference backend can be orchestrated by configuring the appropriate spawn command and endpoints. The router simply:

1. Spawns the command specified in `spawn.command`
2. Polls `health_endpoint` until it returns HTTP 200 (customizable per backend)
3. Proxies requests matching `proxy_endpoints` to the running instance

This design allows you to mix llama.cpp, vLLM, Ollama, Text Generation Inference, or any custom backend in a single router configuration. Set models to `manual` state when using non-llama.cpp backends to prevent automatic cache rescans from removing them.

### Future: WebUI Administration (TODO)

The admin API endpoints (`/admin/reload`, `/admin/rescan`) are designed to support hot configuration and model management. A future WebUI will enable:

- **Live model downloads** from HuggingFace directly through the interface
- **Hot reconfiguration** of `default_spawn` and per-model settings without restart
- **Real-time monitoring** of running instances and resource usage
- **Interactive model management** (add, remove, customize spawn parameters)

This aligns with the project philosophy: **everything configurable at runtime, zero downtime required**. The current CLI and JSON-based workflow is production-ready; the WebUI will provide a more accessible interface to the same underlying admin API.

---

## Architecture

### System Architecture

```
┌─────────────────────────────────────────────────────────────┐
│ llama-router │
│ (port 8082) │
├─────────────────────────────────────────────────────────────┤
│ ┌─────────────┐ ┌─────────────┐ ┌─────────────────────┐ │
│ │ Config │ │ Scanner │ │ Process Manager │ │
│ │ Loader │ │ (HF cache) │ │ (spawn/terminate) │ │
│ └─────────────┘ └─────────────┘ └─────────────────────┘ │
│ │ │
│ ┌─────────────────────────┴────────────────────────────┐ │
│ │ HTTP Proxy │ │
│ │ (streaming support, header forwarding) │ │
│ └──────────────────────────────────────────────────────┘ │
└──────────────────────────┬──────────────────────────────────┘
┌──────────────────┼──────────────────┐
▼ ▼ ▼
┌───────────────┐ ┌───────────────┐ ┌───────────────┐
│ llama-server │ │ llama-server │ │ llama-server │
│ (port 50000) │ │ (port 50001) │ │ (port 50002) │
│ Model A │ │ Model B │ │ Model C │
└───────────────┘ └───────────────┘ └───────────────┘
```

### Request Flow

1. Client sends POST to `/v1/chat/completions` with `"model": "ModelA"`
2. Router checks if ModelA is already running
3. If not running, or if a conflicting group is active:
- Terminate conflicting backends
- Spawn new llama-server with assigned port
- Poll `/health` until ready (`ROUTER_BACKEND_READY_TIMEOUT_MS` timeout)
4. Forward request to backend, streaming response back to client
5. Backend remains running for subsequent requests

### Process Lifecycle

- **Spawn**: `fork()`/`CreateProcess()` with stdout/stderr capture
- **Health polling**: `ROUTER_BACKEND_HEALTH_POLL_MS` intervals, `ROUTER_BACKEND_READY_TIMEOUT_MS` timeout
- **Graceful shutdown**: SIGTERM → wait `ROUTER_PROCESS_SHUTDOWN_TIMEOUT_MS` → SIGKILL → poll every `ROUTER_PROCESS_POLL_INTERVAL_MS` until exit
- **Cleanup**: File descriptors closed, waitpid() called

---

## File Structure & Separation of Concerns

| Component | Files | Responsibility |
|-----------|-------|----------------|
| **Core** | `router-app.cpp/h` | Model lifecycle, spawn orchestration, group logic, progress notification emission (zero I/O except notifications) |
| **HTTP Endpoints** | `router-endpoints.cpp/h` | Public API routes (`/v1/models`, `/v1/chat/completions`) |
| **Admin** | `router-admin.cpp/h` | Admin routes with explicit config persistence |
| **Proxy** | `router-proxy.cpp/h` | HTTP forwarding, SSE streaming, header management |
| **Process** | `router-process.cpp/h` | Cross-platform subprocess spawning, I/O capture |
| **Config** | `router-config.cpp/h` | JSON load/write, rescan logic, `RescanResult` |
| **Scanner** | `router-scanner.cpp/h` | HF cache discovery, `--import-dir`, mmproj detection |
| **Main** | `router.cpp` | CLI parsing, server setup, signal handlers |
| **Utils** | `logging.cpp/h`, `router-constants.h` | Shared logging and constants |

**Design principles enforced:**
- `router-app`: Pure business logic, no filesystem I/O
- `router-admin`: Owns config persistence, explicit writes only
- `router-proxy`: Streaming & forwarding, value-captured lambdas to avoid use-after-free
- `router-process`: Platform abstraction, child processes never call parent logging functions

---

## Technical Notes

### Cross-Platform Process Management

The router handles subprocess spawning differently per platform:

**Linux/macOS:** Uses `fork()` + `execvp()` with careful attention to post-fork behavior. Child processes **must not** call logging functions that access parent singletons - they write directly to `STDERR_FILENO` instead to avoid use-after-fork crashes.

**Windows:** Uses `CreateProcess()` with separate process information structures and handle management.

### SSE Streaming Implementation

Server-Sent Events streaming required careful lifetime management to avoid use-after-free bugs:

1. **Capture by value**: Lambda captures must copy request data (headers, path, body), not reference stack variables that become invalid after the handler returns
2. **Explicit termination**: Call `sink.done()` followed by `return false` to signal httplib to close the connection properly - without this, streams deliver tokens correctly but never terminate

### PATH Binary Resolution

Spawn commands support both absolute/relative paths and PATH-based binaries:

- **Paths with separators**: `/usr/bin/llama-server`, `./llama-server`, `C:\llama\server.exe` - existence validated before spawn
- **PATH binaries**: `python`, `vllm`, `ollama`, `llama-server` - no validation, relies on shell PATH resolution

The router only validates file existence for commands containing `/` or `\\` path separators, allowing seamless use of system-installed binaries.

### Spawn Command Placeholders

The router expands placeholders in spawn commands:
- `$path` → The model file path from `path` field
- `$port` → Dynamically assigned port (increments from `base_port`)
- `$host` → Always expands to `127.0.0.1` for security

### Model-Scoped Route Stripping

Routes like `/<model>/health` are router-side aliases for convenience. Before proxying to the backend, the router strips the model prefix:

- User request: `GET /Qwen3-8B-Q4_K_M.gguf/health`
- Forwarded to backend: `GET /health`

Backends remain unaware of model-scoped routing - they expose standard endpoints like `/health`, `/v1/chat/completions`, etc.

### HTTP Header Management

The router strips `Content-Length` and `Transfer-Encoding` headers before forwarding requests. This is standard reverse-proxy behavior to handle chunked requests/responses properly and avoid conflicts when the proxy re-chunks data.

All other headers are forwarded transparently to preserve client context (authentication, user-agent, etc.).

### Real-Time Swap Notifications

The router implements an opt-in notification system for streaming swap progress to clients:

**Architecture:**
- `NotificationSink`: Function-based callback system in `router-config.h`
- `RouterApp::set/clear_notification_sink()`: Attach/detach sink before/after operations
- Progress emitted at 3 lifecycle points in `ensure_running()`:
* After `terminate_process()` - unload notification
* After `spawn_process()` - load notification
* After `wait_for_backend_ready()` - ready notification

**Implementation:**
The proxy layer owns the full request lifecycle. For streaming requests with `notify_model_swap=true`:
1. Attach sink that enqueues formatted SSE chunks into the stream state
2. Call `ensure_running()` - notifications flow directly into the SSE queue
3. Clear sink before forwarding to backend (prevents backend logs in stream)

Messages use OpenAI-compatible `delta.reasoning_content` field, prefixed with `[llama-router]` to distinguish router operations from model reasoning.

**Design rationale:**
- Sink pattern allows clean separation: RouterApp emits events, proxy consumes them
- Notifications sent synchronously during operations = accurate timing perception
- Thread-safe via separate `notification_mutex` to avoid deadlock with main mutex
- Zero overhead when disabled (sink check + early return)

### Health Endpoint Purpose

The `health_endpoint` configuration field serves **spawn readiness polling only** - the router uses it to detect when a backend has finished loading and is ready to serve requests.

This is separate from user-facing health routes. Clients can still call `/<model>/health` or `/health` for their own monitoring needs. The backend must expose standard endpoints regardless of what `health_endpoint` is configured for polling.

### Multimodal Projector Priority

When importing collections with `--import-dir`, mmproj files are automatically detected with this search priority:

1. `*-bf16.gguf` (selected first)
2. `*-f16.gguf` (selected if BF16 not found)
3. `*-f32.gguf` (selected if neither BF16 nor F16 found)

All quantization variants of a model (Q4_K_M, Q5_K_M, Q6_K, etc.) found in the same directory share the same mmproj file.

**For manual models:** mmproj auto-detection applies only during initial import. You can edit `spawn.command` to remove `--mmproj` if unwanted - your changes persist across restarts. Only `auto` models get their spawn configuration regenerated on rescan.

### Manifest Robustness

The HF cache scanner gracefully handles missing or corrupted manifest files:

- If `~/.cache/llama.cpp/` doesn't exist, scanner returns empty mapping
- If individual manifest files are missing, they're silently skipped
- Models without manifest entries load successfully, just without mmproj auto-detection

**Cache structure example:**
```
~/.cache/llama.cpp/
├── bartowski_Qwen2.5-1.5B-Instruct-GGUF_Qwen2.5-1.5B-Instruct-Q4_K_M.gguf
├── bartowski_Qwen2.5-1.5B-Instruct-GGUF_Qwen2.5-1.5B-Instruct-Q4_K_M.gguf.etag
├── manifest=bartowski=Qwen2.5-1.5B-Instruct-GGUF=latest.json
├── unsloth_Qwen3-VL-4B-Instruct-GGUF_Qwen3-VL-4B-Instruct-Q6_K.gguf
├── unsloth_Qwen3-VL-4B-Instruct-GGUF_Qwen3-VL-4B-Instruct-Q6_K.gguf.etag
├── unsloth_Qwen3-VL-4B-Instruct-GGUF_mmproj-F16.gguf
├── unsloth_Qwen3-VL-4B-Instruct-GGUF_mmproj-F16.gguf.etag
└── manifest=unsloth=Qwen3-VL-4B-Instruct-GGUF=Q6_K.json
```

Manifest files (`manifest=vendor=repo=quant.json`) contain metadata for mmproj auto-detection. The scanner uses underscore separators: `vendor_repo_filename.gguf`.

This ensures the router remains operational even with incomplete cache metadata.

---

## Signals and Shutdown

The router handles graceful shutdown on:
- `SIGINT` (Ctrl+C)
- `SIGTERM`

Shutdown sequence:
1. Stop accepting new connections
2. Terminate all managed llama-server processes
3. Wait for process cleanup
4. Exit

---

## Contributing

llama-router is part of the llama.cpp project. Contributions welcome via pull request.

## License

MIT License - See llama.cpp repository for details.
35 changes: 35 additions & 0 deletions tools/router/CMakeLists.txt
Original file line number Diff line number Diff line change
@@ -0,0 +1,35 @@
set(TARGET llama-router)

set(TARGET_SRCS
router.cpp
router-app.cpp
router-config.cpp
router-admin.cpp
router-endpoints.cpp
logging.cpp
router-process.cpp
router-proxy.cpp
router-scanner.cpp
)

set(PUBLIC_ASSETS
index.html.gz
)

foreach(asset ${PUBLIC_ASSETS})
set(input "${PROJECT_SOURCE_DIR}/tools/server/public/${asset}")
set(output "${CMAKE_CURRENT_BINARY_DIR}/${asset}.hpp")
list(APPEND TARGET_SRCS ${output})
add_custom_command(
DEPENDS "${input}"
OUTPUT "${output}"
COMMAND "${CMAKE_COMMAND}" "-DINPUT=${input}" "-DOUTPUT=${output}" -P "${PROJECT_SOURCE_DIR}/scripts/xxd.cmake"
)
set_source_files_properties(${output} PROPERTIES GENERATED TRUE)
endforeach()

add_executable(${TARGET} ${TARGET_SRCS})

target_link_libraries(${TARGET} PRIVATE common cpp-httplib ${CMAKE_THREAD_LIBS_INIT})

target_include_directories(${TARGET} PRIVATE ${PROJECT_SOURCE_DIR} ${CMAKE_CURRENT_SOURCE_DIR} ${CMAKE_CURRENT_BINARY_DIR})
Loading