Skip to content

Conversation

@ngxson
Copy link
Owner

@ngxson ngxson commented Nov 19, 2025

The feature was implemented using subprocess, meaning there will be one main "router" server whose the job is to create other "child" processes that will actually run the inference.

This system was design and test against these unexpected cases:

  • Child process suddenly exit due to error (for example, a GGML_ASSERT)
  • Child process failed to load (for example, the system cannot launch the process)
  • Router process suddenly exit due to error. In this case, child processes automatically stop themself.

These steps happen when user request the router to launch a model instance:

  1. Check if the model already had a process; if yes, skip
  2. Construct argv and envp to launch the child process; a random HTTP port is selected for each child process
  3. Start the child process
  4. Create a thread to read child's stdout/stderr and forward it to main process, with a prefix [port_number]
  5. Inside child process, it notifies router server its "ready" status, then spawn a thread to monitor its stdin

If the child process exits, router server knows that as soon as stdout/stderr is closed; In reversed, if router process exits (or router request child process to stop), it simply close the stdin of the child. The child detects this and internally calls its SIGTERM/SIGINT handler.


Other changes included in the PR:

  • Remove DEFAULT_MODEL_PATH
  • If -m, --model is not specified, common_params_parse_ex will return an error (except for server)

Summary by CodeRabbit

  • New Features

    • Router mode: manage multiple models with load/unload/status endpoints and automatic routing
    • UI: model selector dropdown, detailed Model Information dialog, and improved audio recording controls (auto-mic)
    • Attachments: better handling/previews for audio, images, PDFs, and text; unified attachment UX
  • Documentation

    • New guide for multi-model router setup and usage
  • Tests

    • Added router streaming endpoint test suite

✏️ Tip: You can customize this high-level summary in your review settings.

@ngxson ngxson changed the title server: add model management and proxy (staging PR) server: add model management and proxy Nov 19, 2025
@coderabbitai

This comment was marked as resolved.

Comment on lines 5125 to 5136
res->ok({
// TODO: add support for this on web UI
{"role", "router"},
{"max_instances", 4}, // dummy value for testing
// this is a dummy response to make sure webui doesn't break
{"model_alias", "llama-server"},
{"model_path", "none"},
{"default_generation_settings", {
{"params", json{}},
{"n_ctx", 0},
}},
});
Copy link
Owner Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@allozaur this is the data that /props will return if it is a router server (management server)

the model_alias, model_path and default_generation_settings will disappear once web UI has support for managing models

@ngxson
Copy link
Owner Author

ngxson commented Nov 20, 2025

@coderabbitai draw some sequence diagrams ​for the lifetime of a "child" server process, how the stdin is used to signal between 2 processes

@ngxson ngxson marked this pull request as ready for review November 20, 2025 15:29
@coderabbitai

This comment was marked as resolved.

@ngxson
Copy link
Owner Author

ngxson commented Nov 20, 2025

@coderabbitai review

@coderabbitai
Copy link

coderabbitai bot commented Nov 20, 2025

✅ Actions performed

Review triggered.

Note: CodeRabbit is an incremental review system and does not re-review already reviewed commits. This command is applicable only when automatic reviews are paused.

Copy link

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 6

Caution

Some comments are outside the diff and can’t be posted inline due to platform limitations.

⚠️ Outside diff range comments (1)
tools/server/server.cpp (1)

4471-4572: Router mode still uses local /metrics, but worker loop never runs

In router mode you never call ctx_server.load_model() or ctx_server.queue_tasks.start_loop(), yet /metrics continues to be wired to routes.get_metrics, which posts a SERVER_TASK_TYPE_METRICS task and then blocks on queue_results.recv(task_id). With no queue_tasks loop, a router server started with --metrics will hang /metrics requests.

Consider explicitly disabling or proxying metrics in router mode, e.g.:

  • In the router branch (after constructing server_routes), override routes.get_metrics with a handler that returns ERROR_TYPE_NOT_SUPPORTED, or
  • Implement router-level aggregation and bind /metrics to that instead.

This avoids blocking on an idle queue in router-only processes.

Also applies to: 5662-5735

♻️ Duplicate comments (1)
tools/server/server-models.cpp (1)

302-335: unload_all joins threads correctly but still suffers from destroy race

unload_all() correctly collects inst.th into to_join and joins after releasing the mutex, which avoids deadlocks:

for (auto & [name, inst] : mapping) {
    if (inst.meta.is_active()) {
        subprocess_destroy(inst.subproc.get());
        // status change will be handled by the managing thread
    }
    to_join.push_back(std::move(inst.th));
}
...
for (auto & th : to_join) {
    if (th.joinable()) {
        th.join();
    }
}

However, this still uses subprocess_destroy() from the router thread, with the same race against the management thread’s fgets() / subprocess_join() as noted in the other comment.

Once you centralize join/destroy in the management thread (and switch unload/unload_all to signal shutdown only), this to_join logic should remain, but the explicit subprocess_destroy() here can be dropped.

🧹 Nitpick comments (6)
tools/server/tests/unit/test_router.py (2)

1-4: Replace from utils import * with explicit imports to satisfy linters and clarify dependencies

Ruff is flagging the star import / possibly-undefined names. Importing only what you use keeps the test clearer and avoids F403/F405 without adding complexity.

Consider:

-import pytest
-from utils import *
+import pytest
+from utils import ServerError, ServerPreset, ServerProcess

The rest of the file can stay as-is (the global server: ServerProcess annotation and the ServerError catch will then be fully explicit).

Also applies to: 8-9, 24-25, 42-43


19-50: Router streaming test logic is sound; TODO about pre-warming cache can remain a follow-up

The way the test distinguishes success/failure (via ServerError vs accumulated content) and checks finish_reason/delta fields is aligned with the streaming contract exercised in make_stream_request. The TODO about calling ServerPreset.load_all() to ensure models are cached is reasonable but not required to land this test; it can be handled in a later PR if you start seeing flakiness from cold HF downloads.

tools/server/tests/utils.py (1)

49-50: model_hf_repo/tinygemma3/router preset wiring looks consistent; consider excluding router() from load_all()

  • Making model_hf_repo str | None and using None in ServerPreset.router() cleanly expresses the “no model attached” router process; start() already checks truthiness before adding --hf-repo, so this is safe.
  • The tinygemma3 preset change (model_hf_file = None, model_hf_repo = "ggml-org/tinygemma3-GGUF:Q8_0") matches the CLI -hf user/model:quant style and should simplify mmproj handling.

One minor design point: ServerPreset.load_all() now instantiates and starts every preset except load_all, including router(), even though the router has no model to cache. If you want load_all() to strictly mean “warm model caches,” you could either filter out router there or add a short comment that starting the router as part of preload is intentional.

Also applies to: 517-545, 415-427

tools/server/server-models.h (1)

3-13: Make the header self-contained by including all standard library headers it relies on

This header uses std::string, std::map, std::optional, std::vector, and std::thread (via server_model_meta, mapping, base_args/env, and instance_t/thread) but only explicitly includes a subset of the needed headers. Relying on common.h / server-http.h to transitively provide these can make the header fragile.

Consider adding the missing includes here:

 #include "common.h"
 #include "server-http.h"

 #include <sheredom/subprocess.h>

-#include <queue>
-#include <mutex>
-#include <condition_variable>
-#include <functional>
-#include <memory>
+#include <queue>
+#include <mutex>
+#include <condition_variable>
+#include <functional>
+#include <memory>
+#include <map>
+#include <optional>
+#include <string>
+#include <thread>
+#include <vector>

That way server-models.h can be included on its own without depending on upstream headers to pull these in.

Also applies to: 54-115, 121-143

tools/server/server.cpp (1)

5592-5612: Fix typo in exception log message

The SRV_ERR line in ex_wrapper logs:

SRV_ERR("got another exception: %s | while hanlding exception: %s\n", e.what(), message.c_str());

hanlding should be handling for clearer logs:

- SRV_ERR("got another exception: %s | while hanlding exception: %s\n", e.what(), message.c_str());
+ SRV_ERR("got another exception: %s | while handling exception: %s\n", e.what(), message.c_str());
tools/server/server-models.cpp (1)

432-477: pipe_t and server_http_proxy streaming design is generally solid

The pipe_t<T> abstraction and server_http_proxy implementation nicely decouple the upstream HTTP connection from the downstream child server:

  • Thread-safe queue with writer_closed/reader_closed flags prevents blocking writers when the client disconnects.
  • server_http_proxy::next reads from the pipe using the caller’s should_stop and stops cleanly on EOF or cancellation.
  • The proxy thread uses httplib::ResponseHandler + ContentReceiverWithProgress to stream headers first and then body chunks, mapping them into msg_t pieces via the pipe.

One small enhancement you might consider (not urgent):

  • Check and handle the boolean return from the initial pipe->read(header, should_stop); (e.g., treat false before any header as a transport error and keep the default 500 + minimal diagnostic body), to make failure behavior more explicit.

Otherwise, the streaming proxy looks correct and fits the surrounding server/http abstractions.

Also applies to: 479-559

📜 Review details

Configuration used: CodeRabbit UI

Review profile: CHILL

Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between 4c91f26 and 5805ca7.

📒 Files selected for processing (9)
  • README.md (1 hunks)
  • scripts/sync_vendor.py (1 hunks)
  • tools/server/CMakeLists.txt (1 hunks)
  • tools/server/server-models.cpp (1 hunks)
  • tools/server/server-models.h (1 hunks)
  • tools/server/server.cpp (7 hunks)
  • tools/server/tests/unit/test_router.py (1 hunks)
  • tools/server/tests/utils.py (3 hunks)
  • vendor/sheredom/subprocess.h (1 hunks)
🧰 Additional context used
📓 Path-based instructions (3)
{**/CMakeLists.txt,cmake/**/*.cmake}

📄 CodeRabbit inference engine (.github/copilot-instructions.md)

{**/CMakeLists.txt,cmake/**/*.cmake}: Use CMake (Makefile is deprecated); maintain build configuration via CMake files
Require CMake 3.14+ in build configuration

Files:

  • tools/server/CMakeLists.txt
**/*.{c,cc,cpp,h,hpp}

📄 CodeRabbit inference engine (.github/copilot-instructions.md)

**/*.{c,cc,cpp,h,hpp}: Always format all C/C++ code with the repository’s .clang-format before committing
Use 4-space indentation in C/C++ code
Limit C/C++ lines to 120 columns
Place function-opening braces on the same line in C/C++
Use middle pointer alignment (e.g., void * ptr) in C/C++
Use middle reference alignment (e.g., int & ref) in C/C++
Ensure C/C++ code is compatible with C++17 (project standard)

Files:

  • vendor/sheredom/subprocess.h
  • tools/server/server.cpp
  • tools/server/server-models.cpp
  • tools/server/server-models.h
**/*.py

📄 CodeRabbit inference engine (.github/copilot-instructions.md)

**/*.py: Python code should pass flake8 per .flake8 (e.g., max-line-length=125, configured excludes)
Python code should pass pyright type checking per pyrightconfig.json

Files:

  • scripts/sync_vendor.py
  • tools/server/tests/utils.py
  • tools/server/tests/unit/test_router.py
🧠 Learnings (2)
📓 Common learnings
Learnt from: ngxson
Repo: ngxson/llama.cpp PR: 14
File: examples/lookahead/lookahead.cpp:97-98
Timestamp: 2025-03-13T22:56:47.417Z
Learning: ngxson prefers to prioritize simplification in some cases and defer adding error handling to a later time, as indicated by their response to a suggestion about adding error checks for llama_decode_ext calls.
📚 Learning: 2025-09-18T02:06:49.548Z
Learnt from: CR
Repo: ngxson/llama.cpp PR: 0
File: .github/copilot-instructions.md:0-0
Timestamp: 2025-09-18T02:06:49.548Z
Learning: Applies to include/llama.h : Treat changes to the public API header include/llama.h as high-risk and review carefully for API stability

Applied to files:

  • tools/server/CMakeLists.txt
🧬 Code graph analysis (4)
tools/server/server.cpp (3)
tools/server/utils.hpp (2)
  • json_value (44-56)
  • json_value (44-44)
tools/server/server-models.h (2)
  • string (44-52)
  • server_model_status_from_string (30-42)
tools/server/server-models.cpp (2)
  • setup_child_server (388-425)
  • setup_child_server (388-388)
tools/server/tests/unit/test_router.py (1)
tools/server/tests/utils.py (5)
  • ServerProcess (44-408)
  • router (533-545)
  • start (111-257)
  • ServerError (38-41)
  • make_stream_request (297-318)
tools/server/server-models.cpp (2)
tools/server/server-models.h (1)
  • string (44-52)
vendor/sheredom/subprocess.h (4)
  • subprocess_create_ex (488-924)
  • subprocess_stdout (930-932)
  • subprocess_join (942-1000)
  • subprocess_destroy (1002-1039)
tools/server/server-models.h (1)
tools/server/server-models.cpp (1)
  • server_http_proxy (479-559)
🪛 GitHub Actions: EditorConfig Checker
tools/server/server-models.h

[error] 16-16: Trailing whitespace detected by editorconfig-checker.

🪛 Ruff (0.14.5)
tools/server/tests/unit/test_router.py

2-2: from utils import * used; unable to detect undefined names

(F403)


4-4: ServerProcess may be undefined, or defined from star imports

(F405)


9-9: ServerPreset may be undefined, or defined from star imports

(F405)


24-24: ServerError may be undefined, or defined from star imports

(F405)


42-42: ServerError may be undefined, or defined from star imports

(F405)

⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (19)
  • GitHub Check: build-cmake-pkg / linux
  • GitHub Check: build-linux-cross / debian-13-loongarch64-cpu-cross
  • GitHub Check: build-linux-cross / ubuntu-24-riscv64-cpu-spacemit-ime-cross
  • GitHub Check: macOS-latest-cmake-ios
  • GitHub Check: ggml-ci-x64-cpu-low-perf
  • GitHub Check: ggml-ci-arm64-cpu-kleidiai
  • GitHub Check: ios-xcode-build
  • GitHub Check: ubuntu-latest-cmake-cuda
  • GitHub Check: ubuntu-latest-cmake-sanitizer (ADDRESS, Debug)
  • GitHub Check: ubuntu-latest-cmake-sanitizer (UNDEFINED, Debug)
  • GitHub Check: ubuntu-24-cmake-webgpu
  • GitHub Check: ubuntu-latest-cmake-sanitizer (THREAD, Debug)
  • GitHub Check: ubuntu-latest-cmake-rpc
  • GitHub Check: ubuntu-24-cmake-vulkan
  • GitHub Check: ubuntu-22-cmake-musa
  • GitHub Check: ubuntu-latest-llguidance
  • GitHub Check: macOS-latest-cmake-arm64
  • GitHub Check: macOS-latest-cmake-arm64-webgpu
  • GitHub Check: macOS-latest-cmake-x64
🔇 Additional comments (6)
tools/server/CMakeLists.txt (1)

14-21: Wiring server-models into llama-server target looks consistent

Adding server-models.cpp/.h alongside the existing server sources follows the current pattern and should integrate cleanly into the build.

README.md (1)

606-615: Documenting subprocess.h as a dependency is clear and consistent

The new entry cleanly describes the role of subprocess.h and matches its use as a vendor single-header dependency elsewhere in the PR.

scripts/sync_vendor.py (1)

19-22: Pinned vendor mapping for subprocess.h aligns with the new dependency

Adding the fixed-commit URL → vendor/sheredom/subprocess.h entry follows the existing vendor pattern and keeps the dependency reproducible. Just ensure this hash stays in sync with whatever version you expect in vendor/ and the README.

vendor/sheredom/subprocess.h (1)

1-1202: Vendored subprocess header looks consistent and suitable for use

This header appears to be an unmodified sheredom subprocess.h with the expected API surface (subprocess_create_ex, subprocess_join, subprocess_destroy, async read helpers, etc.), and is integrated in a standard way from the call sites.

I would avoid local edits here and keep any lifecycle / concurrency fixes confined to callers like server_models, to stay aligned with upstream behavior.

tools/server/server-models.cpp (2)

388-425: Child server stdin monitor is a neat design; just ensure only stdin closure is used for shutdown

server_models::setup_child_server:

  • Notifies the router that the model is ready via POST /models/status.
  • Spawns a detached thread that blocks on getchar() and invokes shutdown_handler(0) on EOF.

This is a clean mechanism to let the router signal shutdown by closing the child’s stdin, and to detect parent death as well.

Given that, the safest approach (in line with the other comments) is:

  • Use only stdin closure (or subprocess_terminate if you really need a hard kill) for unload semantics.
  • Avoid closing stdout/stderr from the router thread so the monitor thread and the main loop in the child can run to completion and clean up naturally.

With that constraint, this stdin-driven shutdown channel should be robust across platforms.


29-65: Unable to verify review comment — file not found in repository

The file tools/server/server-models.cpp does not exist in the llama.cpp repository (searched master branch and all branches). Only server.cpp and server-http.cpp are present in tools/server/.

This file appears to exist only on the PR branch being reviewed. Without access to the actual code, I cannot verify the specific claims about:

  • exec_path computation and usage
  • proxy_request stale meta.port behavior
  • subprocess_destroy race conditions with management threads
  • get_free_port() error handling
  • ensure_model_loaded() FAILED state handling

Manually verify the code changes in the PR branch against the issues identified in the review comment, particularly:

  1. Whether exec_path is actually used or remains unused in load()
  2. Whether meta.port captured before ensure_model_loaded() creates a stale reference
  3. Whether concurrent subprocess_destroy() calls pose thread-safety risks

Copy link

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 5

Caution

Some comments are outside the diff and can’t be posted inline due to platform limitations.

⚠️ Outside diff range comments (3)
tools/server/server.cpp (3)

5592-5612: Fix typo in error log message: "hanlding" → "handling"

Confirmed typo at line 5610 in tools/server/server.cpp:

-            SRV_ERR("got another exception: %s | while hanlding exception: %s\n", e.what(), message.c_str());
+            SRV_ERR("got another exception: %s | while handling exception: %s\n", e.what(), message.c_str());

5617-5623: Use standard C++ main signature to ensure portability

The three-argument main(int argc, char** argv, char** envp) form used at line 5617 is not part of the C++ standard—it is an implementation-defined extension. While widely supported on POSIX systems, it may fail or behave inconsistently on other platforms like Windows.

Since envp is passed to server_models at line 5669, refactor to one of these approaches:

  • Restore int main(int argc, char** argv) and acquire the environment via platform APIs (extern char **environ on POSIX, GetEnvironmentStringsW on Windows), passing that to server_models; or
  • Move the environment acquisition inside server_models to eliminate the caller's dependency on envp.

Verify that all target compilers accept the implementation before merging.


5617-5694: Router-mode detection via params.model.path == DEFAULT_MODEL_PATH is confirmed as fragile

The concern is valid. Router mode detection at line 5666 relies on a string comparison against a real file path ("models/7B/ggml-model-f16.gguf"), not a dedicated sentinel. This creates two problems:

  1. Path collision risk: If a user explicitly passes the same path as DEFAULT_MODEL_PATH, they unexpectedly enter router mode (no local model loaded; all requests proxied to child instances).

  2. Semantic coupling: Any future change to DEFAULT_MODEL_PATH in common/common.h:29 silently alters router detection semantics. The codebase even acknowledges this at line 5665 with // TODO: improve this by changing arg.cpp.

Currently, when is_router_server is true (line 5736), the process skips model loading and acts as a request router; when false (line 5774), it loads the model locally. There is no explicit --router flag to disambiguate the intent.

Replace the implicit path-based detection with an explicit router mode flag (e.g., a dedicated CLI switch or a boolean parameter in common_params) that cannot collide with real model file paths.

🧹 Nitpick comments (8)
tools/server/server-models.cpp (2)

29-65: get_server_exec_path() result currently unused; consider either wiring it or dropping it

get_server_exec_path() is a fairly involved helper with multiple platform branches, but in load() the exec_path value is computed and then never used:

std::string exec_path = get_server_exec_path().string();
SRV_INF("spawning server instance ...");

Given that child processes are launched via base_args rather than this resolved path, you might either:

  • Use exec_path to overwrite child_args[0] to ensure an absolute, stable path to the current binary, or
  • Drop the call and local variable entirely for now to avoid doing extra filesystem/syscall work on every load.

124-182: Race condition is real; suggest adopting robust port allocation pattern

The review comment is technically accurate. Web search confirms this bind/close/spawn pattern is explicitly listed among patterns to avoid in modern C++ servers. The codebase exhibits the documented race: get_free_port() closes the socket at line 181, then subprocess_create_ex() spawns the child at line 257, creating a window where another process can claim that port before the child binds it.

Verified findings:

  • Router uses multi-process worker spawning (subprocess_create_ex, line 257)
  • Port is allocated at line 219, spawned at line 257—no mitigation bridges this gap
  • No existing synchronization or alternative strategies in place
  • Standard robust alternatives confirmed by web search: parent-creates-and-passes FD, socket-activation, or SO_REUSEPORT (platform-specific)

Under typical usage this may be acceptable (low concurrency, sufficient time before collisions). However, high concurrency could trigger occasional "address already in use" failures. Adopting one of the web-verified patterns (especially FD passing or socket-activation) would eliminate the race entirely and align with modern server practices.

tools/server/server-models.h (2)

30-52: Header hygiene: include the standard headers for types you use

This header uses several standard library types that are not explicitly included here:

  • std::string, std::vector, std::map, std::optional, std::thread.

They likely come in transitively via common.h / server-http.h, but that’s not guaranteed and can break when upstream headers change.

Consider adding the minimal set of direct includes, for example:

 #include "common.h"
 #include "server-http.h"

 #include <sheredom/subprocess.h>

-#include <queue>
-#include <mutex>
-#include <condition_variable>
-#include <functional>
-#include <memory>
+#include <queue>
+#include <map>
+#include <mutex>
+#include <condition_variable>
+#include <functional>
+#include <memory>
+#include <optional>
+#include <string>
+#include <thread>
+#include <vector>

This keeps the header self‑contained and more robust to refactors.


121-143: Consider documenting streaming/cleanup semantics of server_http_proxy

server_http_proxy owns a background thread and exposes a cleanup callback called from the destructor. Since this is part of the public API, it might help future readers/tests if you briefly document:

  • That responses are streamed via next and backed by a detached thread.
  • That cleanup closes the internal pipe to allow the thread to terminate.

No code change needed; just a short comment near the struct definition.

tools/server/tests/utils.py (1)

518-531: tinygemma3 HF repo change and router preset behavior

  • Switching tinygemma3 to use the combined HF repo spec ("ggml-org/tinygemma3-GGUF:Q8_0") with model_hf_file = None is consistent with the CLI args construction and should work as long as llama-server expects the revision-qualified repo string on --hf-repo.
  • The new ServerPreset.router() correctly clears all model sources so the process starts in router mode only; the small context/batch/slots values are fine since no local inference happens.
  • Note that ServerPreset.load_all() will now also instantiate the router preset, briefly starting and stopping a router-only server. If you intended load_all() purely as a model cache warmer, you may want to exclude router from that loop to avoid starting an extra control-plane instance during test setup.
tools/server/tests/unit/test_router.py (1)

1-4: Avoid from utils import * in tests

The wildcard import makes it harder to see which test dependencies are actually used and trips static analysis (F403/F405). It’s more robust to import only what you need, e.g.:

-import pytest
-from utils import *
+import pytest
+from utils import ServerProcess, ServerPreset, ServerError

This keeps the namespace explicit while preserving current behavior.

tools/server/server.cpp (2)

5142-5155: Consider validating missing/empty model in proxy_post/proxy_get before delegating

Both proxy_get and proxy_post call models->ensure_model_loaded(name) where name comes from a query parameter or JSON body. If the client omits model (or sends an empty string), this will pass an empty name through to ensure_model_loaded, leaving behavior entirely to server_models.

To make the router API fail fast and more user-friendly, you could treat an empty model name as a client error:

-        std::string name = json_value(body, "model", std::string());
-        models->ensure_model_loaded(name);
-        return models->proxy_request(req, method, name);
+        std::string name = json_value(body, "model", std::string());
+        if (name.empty()) {
+            auto res = std::make_unique<server_res_generator>(ctx_server);
+            res->error(format_error_response("\"model\" must be provided", ERROR_TYPE_INVALID_REQUEST));
+            return res;
+        }
+        models->ensure_model_loaded(name);
+        return models->proxy_request(req, method, name);

A similar guard could be added to proxy_get if you don’t want /... ?model= to fall back to the router’s own identity.


5733-5800: Router vs non-router startup/cleanup flow looks consistent, but duplicate is_ready set

  • The new clean_up lambdas for router and non-router correctly encapsulate the different teardown logic (models->unload_all() vs. queue termination + ctx_http.stop()), and they’re called both on early exit paths and at the end of the main loop.
  • Router mode now starts HTTP without loading a model and skips queue_tasks.start_loop(), while non-router mode preserves the existing path and additionally registers as a child with server_models::setup_child_server when LLAMA_SERVER_ROUTER_PORT is set; this separation is clear and avoids interleaving the two lifecycles.
  • Minor nit: in router mode ctx_http.is_ready.store(true) is called both before ctx_http.start() and again just before joining the HTTP thread; one of these writes is redundant.

Functionally this all hangs together; the only actionable change I’d suggest here is to remove the duplicate is_ready store for clarity.

Also applies to: 5816-5843

📜 Review details

Configuration used: CodeRabbit UI

Review profile: CHILL

Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between 4c91f26 and 5805ca7.

📒 Files selected for processing (9)
  • README.md (1 hunks)
  • scripts/sync_vendor.py (1 hunks)
  • tools/server/CMakeLists.txt (1 hunks)
  • tools/server/server-models.cpp (1 hunks)
  • tools/server/server-models.h (1 hunks)
  • tools/server/server.cpp (7 hunks)
  • tools/server/tests/unit/test_router.py (1 hunks)
  • tools/server/tests/utils.py (3 hunks)
  • vendor/sheredom/subprocess.h (1 hunks)
🧰 Additional context used
📓 Path-based instructions (3)
{**/CMakeLists.txt,cmake/**/*.cmake}

📄 CodeRabbit inference engine (.github/copilot-instructions.md)

{**/CMakeLists.txt,cmake/**/*.cmake}: Use CMake (Makefile is deprecated); maintain build configuration via CMake files
Require CMake 3.14+ in build configuration

Files:

  • tools/server/CMakeLists.txt
**/*.{c,cc,cpp,h,hpp}

📄 CodeRabbit inference engine (.github/copilot-instructions.md)

**/*.{c,cc,cpp,h,hpp}: Always format all C/C++ code with the repository’s .clang-format before committing
Use 4-space indentation in C/C++ code
Limit C/C++ lines to 120 columns
Place function-opening braces on the same line in C/C++
Use middle pointer alignment (e.g., void * ptr) in C/C++
Use middle reference alignment (e.g., int & ref) in C/C++
Ensure C/C++ code is compatible with C++17 (project standard)

Files:

  • vendor/sheredom/subprocess.h
  • tools/server/server-models.h
  • tools/server/server-models.cpp
  • tools/server/server.cpp
**/*.py

📄 CodeRabbit inference engine (.github/copilot-instructions.md)

**/*.py: Python code should pass flake8 per .flake8 (e.g., max-line-length=125, configured excludes)
Python code should pass pyright type checking per pyrightconfig.json

Files:

  • tools/server/tests/unit/test_router.py
  • scripts/sync_vendor.py
  • tools/server/tests/utils.py
🧠 Learnings (1)
📓 Common learnings
Learnt from: ngxson
Repo: ngxson/llama.cpp PR: 14
File: examples/lookahead/lookahead.cpp:97-98
Timestamp: 2025-03-13T22:56:47.417Z
Learning: ngxson prefers to prioritize simplification in some cases and defer adding error handling to a later time, as indicated by their response to a suggestion about adding error checks for llama_decode_ext calls.
🪛 GitHub Actions: EditorConfig Checker
tools/server/server-models.h

[error] 16-16: Trailing whitespace found on line 16. Remove trailing whitespace.

🪛 Ruff (0.14.5)
tools/server/tests/unit/test_router.py

2-2: from utils import * used; unable to detect undefined names

(F403)


4-4: ServerProcess may be undefined, or defined from star imports

(F405)


9-9: ServerPreset may be undefined, or defined from star imports

(F405)


24-24: ServerError may be undefined, or defined from star imports

(F405)


42-42: ServerError may be undefined, or defined from star imports

(F405)

⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (11)
  • GitHub Check: Run WebUI tests
  • GitHub Check: server-windows
  • GitHub Check: ggml-ci-x64-cpu-high-perf
  • GitHub Check: android-ndk-build (arm64-snapdragon, --preset arm64-android-snapdragon-release)
  • GitHub Check: ggml-ci-arm64-cpu-high-perf-sve
  • GitHub Check: windows-latest-cmake (vulkan-x64, x64, -DCMAKE_BUILD_TYPE=Release -DGGML_NATIVE=OFF -DLLAMA_BUILD...
  • GitHub Check: windows-latest-cmake (cpu-x64 (static), x64, -G "Ninja Multi-Config" -D CMAKE_TOOLCHAIN_FILE=cmak...
  • GitHub Check: ggml-ci-arm64-cpu-kleidiai
  • GitHub Check: ios-xcode-build
  • GitHub Check: ubuntu-24-cmake-vulkan
  • GitHub Check: ubuntu-22-cmake-musa
🔇 Additional comments (6)
README.md (1)

606-615: Documenting subprocess.h dependency looks correct

The new subprocess.h entry is consistent with the vendored header path and usage; no issues from a docs or dependency-discovery standpoint.

scripts/sync_vendor.py (1)

19-22: Pinned subprocess.h vendor mapping is reasonable

Adding subprocess.h to the vendor map with a commit-pinned raw URL and a clear destination path is consistent with the existing pattern and should integrate cleanly with the sync script.

tools/server/CMakeLists.txt (1)

14-21: Including server-models.{cpp,h} in the server target is correct

Wiring server-models.cpp/server-models.h into TARGET_SRCS matches the new include in server.cpp and ensures router/model-management code is actually built and linked.

tools/server/tests/utils.py (1)

45-55: Allowing model_hf_repo to be nullable looks correct

Relaxing model_hf_repo to str | None matches how it’s used in start() (flag only emitted when truthy) and is required for the router() preset; no behavioral regressions apparent.

tools/server/server.cpp (1)

4453-4457: Router model-management endpoints and server_models ownership look structurally sound

  • Storing server_models as a std::unique_ptr inside server_routes keeps router-specific state co-located with the HTTP handlers and avoids extra globals.
  • The router endpoints (get_router_props, get_router_models, /models/load, /models/unload, /models/status) correctly delegate to server_models and use the shared format_error_response machinery for consistent error payloads.
  • get_router_props falling back to proxy_get when model is provided allows /props?model=... to behave like the legacy single-model endpoint, while /props without a model exposes router metadata.
  • All handlers that dereference models are only wired when router mode is active, so null dereferences are avoided as long as server_models construction succeeds.

No functional issues stand out here; behavior mostly depends on the correctness of server_models itself.

Also applies to: 5115-5222

tools/server/tests/unit/test_router.py (1)

6-50: The review comment's primary claim is incorrect

The autouse fixture in test_router.py does not "never stop the server." The conftest.py file contains an autouse fixture stop_server_after_each_test() that runs after each test and calls .stop() on all instances tracked in server_instances. When ServerProcess.start() is called, it adds itself to the server_instances set, and stop() removes it and kills the process.

The conftest cleanup executes between each parametrized test iteration, so the resource leakage and port conflict concerns described in the review do not actually occur—servers are properly cleaned up after each test.

That said, the suggested refactor remains valid as a design improvement. Using global state with autouse fixtures is an anti-pattern even when cleanup is guaranteed elsewhere. A scoped fixture per test (as proposed) would be cleaner and more maintainable. However, the refactor is a stylistic improvement rather than a fix for an actual resource leak.

Likely an incorrect or invalid review comment.

allozaur and others added 29 commits November 29, 2025 02:36
…_v1_2' into allozaur/server_model_management_v1_2
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants