-
Notifications
You must be signed in to change notification settings - Fork 5
(staging PR) server: add model management and proxy #39
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: master
Are you sure you want to change the base?
Conversation
This comment was marked as resolved.
This comment was marked as resolved.
tools/server/server.cpp
Outdated
| res->ok({ | ||
| // TODO: add support for this on web UI | ||
| {"role", "router"}, | ||
| {"max_instances", 4}, // dummy value for testing | ||
| // this is a dummy response to make sure webui doesn't break | ||
| {"model_alias", "llama-server"}, | ||
| {"model_path", "none"}, | ||
| {"default_generation_settings", { | ||
| {"params", json{}}, | ||
| {"n_ctx", 0}, | ||
| }}, | ||
| }); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@allozaur this is the data that /props will return if it is a router server (management server)
the model_alias, model_path and default_generation_settings will disappear once web UI has support for managing models
|
@coderabbitai draw some sequence diagrams for the lifetime of a "child" server process, how the stdin is used to signal between 2 processes |
This comment was marked as resolved.
This comment was marked as resolved.
…' into allozaur/server_model_management_v1_2
|
@coderabbitai review |
✅ Actions performedReview triggered.
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actionable comments posted: 6
Caution
Some comments are outside the diff and can’t be posted inline due to platform limitations.
⚠️ Outside diff range comments (1)
tools/server/server.cpp (1)
4471-4572: Router mode still uses local/metrics, but worker loop never runsIn router mode you never call
ctx_server.load_model()orctx_server.queue_tasks.start_loop(), yet/metricscontinues to be wired toroutes.get_metrics, which posts aSERVER_TASK_TYPE_METRICStask and then blocks onqueue_results.recv(task_id). With no queue_tasks loop, a router server started with--metricswill hang/metricsrequests.Consider explicitly disabling or proxying metrics in router mode, e.g.:
- In the router branch (after constructing
server_routes), overrideroutes.get_metricswith a handler that returnsERROR_TYPE_NOT_SUPPORTED, or- Implement router-level aggregation and bind
/metricsto that instead.This avoids blocking on an idle queue in router-only processes.
Also applies to: 5662-5735
♻️ Duplicate comments (1)
tools/server/server-models.cpp (1)
302-335:unload_alljoins threads correctly but still suffers from destroy race
unload_all()correctly collectsinst.thintoto_joinand joins after releasing the mutex, which avoids deadlocks:for (auto & [name, inst] : mapping) { if (inst.meta.is_active()) { subprocess_destroy(inst.subproc.get()); // status change will be handled by the managing thread } to_join.push_back(std::move(inst.th)); } ... for (auto & th : to_join) { if (th.joinable()) { th.join(); } }However, this still uses
subprocess_destroy()from the router thread, with the same race against the management thread’sfgets()/subprocess_join()as noted in the other comment.Once you centralize
join/destroyin the management thread (and switch unload/unload_all to signal shutdown only), thisto_joinlogic should remain, but the explicitsubprocess_destroy()here can be dropped.
🧹 Nitpick comments (6)
tools/server/tests/unit/test_router.py (2)
1-4: Replacefrom utils import *with explicit imports to satisfy linters and clarify dependenciesRuff is flagging the star import / possibly-undefined names. Importing only what you use keeps the test clearer and avoids F403/F405 without adding complexity.
Consider:
-import pytest -from utils import * +import pytest +from utils import ServerError, ServerPreset, ServerProcessThe rest of the file can stay as-is (the global
server: ServerProcessannotation and theServerErrorcatch will then be fully explicit).Also applies to: 8-9, 24-25, 42-43
19-50: Router streaming test logic is sound; TODO about pre-warming cache can remain a follow-upThe way the test distinguishes success/failure (via
ServerErrorvs accumulatedcontent) and checksfinish_reason/deltafields is aligned with the streaming contract exercised inmake_stream_request. The TODO about callingServerPreset.load_all()to ensure models are cached is reasonable but not required to land this test; it can be handled in a later PR if you start seeing flakiness from cold HF downloads.tools/server/tests/utils.py (1)
49-50:model_hf_repo/tinygemma3/router preset wiring looks consistent; consider excludingrouter()fromload_all()
- Making
model_hf_repostr | Noneand usingNoneinServerPreset.router()cleanly expresses the “no model attached” router process;start()already checks truthiness before adding--hf-repo, so this is safe.- The tinygemma3 preset change (
model_hf_file = None,model_hf_repo = "ggml-org/tinygemma3-GGUF:Q8_0") matches the CLI-hf user/model:quantstyle and should simplify mmproj handling.One minor design point:
ServerPreset.load_all()now instantiates and starts every preset exceptload_all, includingrouter(), even though the router has no model to cache. If you wantload_all()to strictly mean “warm model caches,” you could either filter outrouterthere or add a short comment that starting the router as part of preload is intentional.Also applies to: 517-545, 415-427
tools/server/server-models.h (1)
3-13: Make the header self-contained by including all standard library headers it relies onThis header uses
std::string,std::map,std::optional,std::vector, andstd::thread(viaserver_model_meta,mapping,base_args/env, andinstance_t/thread) but only explicitly includes a subset of the needed headers. Relying oncommon.h/server-http.hto transitively provide these can make the header fragile.Consider adding the missing includes here:
#include "common.h" #include "server-http.h" #include <sheredom/subprocess.h> -#include <queue> -#include <mutex> -#include <condition_variable> -#include <functional> -#include <memory> +#include <queue> +#include <mutex> +#include <condition_variable> +#include <functional> +#include <memory> +#include <map> +#include <optional> +#include <string> +#include <thread> +#include <vector>That way
server-models.hcan be included on its own without depending on upstream headers to pull these in.Also applies to: 54-115, 121-143
tools/server/server.cpp (1)
5592-5612: Fix typo in exception log messageThe SRV_ERR line in
ex_wrapperlogs:SRV_ERR("got another exception: %s | while hanlding exception: %s\n", e.what(), message.c_str());
hanldingshould behandlingfor clearer logs:- SRV_ERR("got another exception: %s | while hanlding exception: %s\n", e.what(), message.c_str()); + SRV_ERR("got another exception: %s | while handling exception: %s\n", e.what(), message.c_str());tools/server/server-models.cpp (1)
432-477:pipe_tandserver_http_proxystreaming design is generally solidThe
pipe_t<T>abstraction andserver_http_proxyimplementation nicely decouple the upstream HTTP connection from the downstream child server:
- Thread-safe queue with
writer_closed/reader_closedflags prevents blocking writers when the client disconnects.server_http_proxy::nextreads from the pipe using the caller’sshould_stopand stops cleanly on EOF or cancellation.- The proxy thread uses
httplib::ResponseHandler+ContentReceiverWithProgressto stream headers first and then body chunks, mapping them intomsg_tpieces via the pipe.One small enhancement you might consider (not urgent):
- Check and handle the boolean return from the initial
pipe->read(header, should_stop);(e.g., treatfalsebefore any header as a transport error and keep the default 500 + minimal diagnostic body), to make failure behavior more explicit.Otherwise, the streaming proxy looks correct and fits the surrounding server/http abstractions.
Also applies to: 479-559
📜 Review details
Configuration used: CodeRabbit UI
Review profile: CHILL
Plan: Pro
📒 Files selected for processing (9)
README.md(1 hunks)scripts/sync_vendor.py(1 hunks)tools/server/CMakeLists.txt(1 hunks)tools/server/server-models.cpp(1 hunks)tools/server/server-models.h(1 hunks)tools/server/server.cpp(7 hunks)tools/server/tests/unit/test_router.py(1 hunks)tools/server/tests/utils.py(3 hunks)vendor/sheredom/subprocess.h(1 hunks)
🧰 Additional context used
📓 Path-based instructions (3)
{**/CMakeLists.txt,cmake/**/*.cmake}
📄 CodeRabbit inference engine (.github/copilot-instructions.md)
{**/CMakeLists.txt,cmake/**/*.cmake}: Use CMake (Makefile is deprecated); maintain build configuration via CMake files
Require CMake 3.14+ in build configuration
Files:
tools/server/CMakeLists.txt
**/*.{c,cc,cpp,h,hpp}
📄 CodeRabbit inference engine (.github/copilot-instructions.md)
**/*.{c,cc,cpp,h,hpp}: Always format all C/C++ code with the repository’s .clang-format before committing
Use 4-space indentation in C/C++ code
Limit C/C++ lines to 120 columns
Place function-opening braces on the same line in C/C++
Use middle pointer alignment (e.g., void * ptr) in C/C++
Use middle reference alignment (e.g., int & ref) in C/C++
Ensure C/C++ code is compatible with C++17 (project standard)
Files:
vendor/sheredom/subprocess.htools/server/server.cpptools/server/server-models.cpptools/server/server-models.h
**/*.py
📄 CodeRabbit inference engine (.github/copilot-instructions.md)
**/*.py: Python code should pass flake8 per .flake8 (e.g., max-line-length=125, configured excludes)
Python code should pass pyright type checking per pyrightconfig.json
Files:
scripts/sync_vendor.pytools/server/tests/utils.pytools/server/tests/unit/test_router.py
🧠 Learnings (2)
📓 Common learnings
Learnt from: ngxson
Repo: ngxson/llama.cpp PR: 14
File: examples/lookahead/lookahead.cpp:97-98
Timestamp: 2025-03-13T22:56:47.417Z
Learning: ngxson prefers to prioritize simplification in some cases and defer adding error handling to a later time, as indicated by their response to a suggestion about adding error checks for llama_decode_ext calls.
📚 Learning: 2025-09-18T02:06:49.548Z
Learnt from: CR
Repo: ngxson/llama.cpp PR: 0
File: .github/copilot-instructions.md:0-0
Timestamp: 2025-09-18T02:06:49.548Z
Learning: Applies to include/llama.h : Treat changes to the public API header include/llama.h as high-risk and review carefully for API stability
Applied to files:
tools/server/CMakeLists.txt
🧬 Code graph analysis (4)
tools/server/server.cpp (3)
tools/server/utils.hpp (2)
json_value(44-56)json_value(44-44)tools/server/server-models.h (2)
string(44-52)server_model_status_from_string(30-42)tools/server/server-models.cpp (2)
setup_child_server(388-425)setup_child_server(388-388)
tools/server/tests/unit/test_router.py (1)
tools/server/tests/utils.py (5)
ServerProcess(44-408)router(533-545)start(111-257)ServerError(38-41)make_stream_request(297-318)
tools/server/server-models.cpp (2)
tools/server/server-models.h (1)
string(44-52)vendor/sheredom/subprocess.h (4)
subprocess_create_ex(488-924)subprocess_stdout(930-932)subprocess_join(942-1000)subprocess_destroy(1002-1039)
tools/server/server-models.h (1)
tools/server/server-models.cpp (1)
server_http_proxy(479-559)
🪛 GitHub Actions: EditorConfig Checker
tools/server/server-models.h
[error] 16-16: Trailing whitespace detected by editorconfig-checker.
🪛 Ruff (0.14.5)
tools/server/tests/unit/test_router.py
2-2: from utils import * used; unable to detect undefined names
(F403)
4-4: ServerProcess may be undefined, or defined from star imports
(F405)
9-9: ServerPreset may be undefined, or defined from star imports
(F405)
24-24: ServerError may be undefined, or defined from star imports
(F405)
42-42: ServerError may be undefined, or defined from star imports
(F405)
⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (19)
- GitHub Check: build-cmake-pkg / linux
- GitHub Check: build-linux-cross / debian-13-loongarch64-cpu-cross
- GitHub Check: build-linux-cross / ubuntu-24-riscv64-cpu-spacemit-ime-cross
- GitHub Check: macOS-latest-cmake-ios
- GitHub Check: ggml-ci-x64-cpu-low-perf
- GitHub Check: ggml-ci-arm64-cpu-kleidiai
- GitHub Check: ios-xcode-build
- GitHub Check: ubuntu-latest-cmake-cuda
- GitHub Check: ubuntu-latest-cmake-sanitizer (ADDRESS, Debug)
- GitHub Check: ubuntu-latest-cmake-sanitizer (UNDEFINED, Debug)
- GitHub Check: ubuntu-24-cmake-webgpu
- GitHub Check: ubuntu-latest-cmake-sanitizer (THREAD, Debug)
- GitHub Check: ubuntu-latest-cmake-rpc
- GitHub Check: ubuntu-24-cmake-vulkan
- GitHub Check: ubuntu-22-cmake-musa
- GitHub Check: ubuntu-latest-llguidance
- GitHub Check: macOS-latest-cmake-arm64
- GitHub Check: macOS-latest-cmake-arm64-webgpu
- GitHub Check: macOS-latest-cmake-x64
🔇 Additional comments (6)
tools/server/CMakeLists.txt (1)
14-21: Wiringserver-modelsintollama-servertarget looks consistentAdding
server-models.cpp/.halongside the existing server sources follows the current pattern and should integrate cleanly into the build.README.md (1)
606-615: Documentingsubprocess.has a dependency is clear and consistentThe new entry cleanly describes the role of
subprocess.hand matches its use as a vendor single-header dependency elsewhere in the PR.scripts/sync_vendor.py (1)
19-22: Pinned vendor mapping forsubprocess.haligns with the new dependencyAdding the fixed-commit URL →
vendor/sheredom/subprocess.hentry follows the existing vendor pattern and keeps the dependency reproducible. Just ensure this hash stays in sync with whatever version you expect invendor/and the README.vendor/sheredom/subprocess.h (1)
1-1202: Vendored subprocess header looks consistent and suitable for useThis header appears to be an unmodified sheredom
subprocess.hwith the expected API surface (subprocess_create_ex,subprocess_join,subprocess_destroy, async read helpers, etc.), and is integrated in a standard way from the call sites.I would avoid local edits here and keep any lifecycle / concurrency fixes confined to callers like
server_models, to stay aligned with upstream behavior.tools/server/server-models.cpp (2)
388-425: Child server stdin monitor is a neat design; just ensure only stdin closure is used for shutdown
server_models::setup_child_server:
- Notifies the router that the model is ready via
POST /models/status.- Spawns a detached thread that blocks on
getchar()and invokesshutdown_handler(0)on EOF.This is a clean mechanism to let the router signal shutdown by closing the child’s stdin, and to detect parent death as well.
Given that, the safest approach (in line with the other comments) is:
- Use only stdin closure (or
subprocess_terminateif you really need a hard kill) for unload semantics.- Avoid closing stdout/stderr from the router thread so the monitor thread and the main loop in the child can run to completion and clean up naturally.
With that constraint, this stdin-driven shutdown channel should be robust across platforms.
29-65: Unable to verify review comment — file not found in repositoryThe file
tools/server/server-models.cppdoes not exist in the llama.cpp repository (searched master branch and all branches). Onlyserver.cppandserver-http.cppare present intools/server/.This file appears to exist only on the PR branch being reviewed. Without access to the actual code, I cannot verify the specific claims about:
exec_pathcomputation and usageproxy_requeststalemeta.portbehaviorsubprocess_destroyrace conditions with management threadsget_free_port()error handlingensure_model_loaded()FAILED state handlingManually verify the code changes in the PR branch against the issues identified in the review comment, particularly:
- Whether
exec_pathis actually used or remains unused inload()- Whether
meta.portcaptured beforeensure_model_loaded()creates a stale reference- Whether concurrent
subprocess_destroy()calls pose thread-safety risks
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actionable comments posted: 5
Caution
Some comments are outside the diff and can’t be posted inline due to platform limitations.
⚠️ Outside diff range comments (3)
tools/server/server.cpp (3)
5592-5612: Fix typo in error log message: "hanlding" → "handling"Confirmed typo at line 5610 in
tools/server/server.cpp:- SRV_ERR("got another exception: %s | while hanlding exception: %s\n", e.what(), message.c_str()); + SRV_ERR("got another exception: %s | while handling exception: %s\n", e.what(), message.c_str());
5617-5623: Use standard C++ main signature to ensure portabilityThe three-argument
main(int argc, char** argv, char** envp)form used at line 5617 is not part of the C++ standard—it is an implementation-defined extension. While widely supported on POSIX systems, it may fail or behave inconsistently on other platforms like Windows.Since
envpis passed toserver_modelsat line 5669, refactor to one of these approaches:
- Restore
int main(int argc, char** argv)and acquire the environment via platform APIs (extern char **environon POSIX,GetEnvironmentStringsWon Windows), passing that toserver_models; or- Move the environment acquisition inside
server_modelsto eliminate the caller's dependency onenvp.Verify that all target compilers accept the implementation before merging.
5617-5694: Router-mode detection viaparams.model.path == DEFAULT_MODEL_PATHis confirmed as fragileThe concern is valid. Router mode detection at line 5666 relies on a string comparison against a real file path (
"models/7B/ggml-model-f16.gguf"), not a dedicated sentinel. This creates two problems:
Path collision risk: If a user explicitly passes the same path as
DEFAULT_MODEL_PATH, they unexpectedly enter router mode (no local model loaded; all requests proxied to child instances).Semantic coupling: Any future change to
DEFAULT_MODEL_PATHincommon/common.h:29silently alters router detection semantics. The codebase even acknowledges this at line 5665 with// TODO: improve this by changing arg.cpp.Currently, when
is_router_serveris true (line 5736), the process skips model loading and acts as a request router; when false (line 5774), it loads the model locally. There is no explicit--routerflag to disambiguate the intent.Replace the implicit path-based detection with an explicit router mode flag (e.g., a dedicated CLI switch or a boolean parameter in
common_params) that cannot collide with real model file paths.
🧹 Nitpick comments (8)
tools/server/server-models.cpp (2)
29-65:get_server_exec_path()result currently unused; consider either wiring it or dropping it
get_server_exec_path()is a fairly involved helper with multiple platform branches, but inload()theexec_pathvalue is computed and then never used:std::string exec_path = get_server_exec_path().string(); SRV_INF("spawning server instance ...");Given that child processes are launched via
base_argsrather than this resolved path, you might either:
- Use
exec_pathto overwritechild_args[0]to ensure an absolute, stable path to the current binary, or- Drop the call and local variable entirely for now to avoid doing extra filesystem/syscall work on every load.
124-182: Race condition is real; suggest adopting robust port allocation patternThe review comment is technically accurate. Web search confirms this bind/close/spawn pattern is explicitly listed among patterns to avoid in modern C++ servers. The codebase exhibits the documented race:
get_free_port()closes the socket at line 181, thensubprocess_create_ex()spawns the child at line 257, creating a window where another process can claim that port before the child binds it.Verified findings:
- Router uses multi-process worker spawning (subprocess_create_ex, line 257)
- Port is allocated at line 219, spawned at line 257—no mitigation bridges this gap
- No existing synchronization or alternative strategies in place
- Standard robust alternatives confirmed by web search: parent-creates-and-passes FD, socket-activation, or SO_REUSEPORT (platform-specific)
Under typical usage this may be acceptable (low concurrency, sufficient time before collisions). However, high concurrency could trigger occasional "address already in use" failures. Adopting one of the web-verified patterns (especially FD passing or socket-activation) would eliminate the race entirely and align with modern server practices.
tools/server/server-models.h (2)
30-52: Header hygiene: include the standard headers for types you useThis header uses several standard library types that are not explicitly included here:
std::string,std::vector,std::map,std::optional,std::thread.They likely come in transitively via
common.h/server-http.h, but that’s not guaranteed and can break when upstream headers change.Consider adding the minimal set of direct includes, for example:
#include "common.h" #include "server-http.h" #include <sheredom/subprocess.h> -#include <queue> -#include <mutex> -#include <condition_variable> -#include <functional> -#include <memory> +#include <queue> +#include <map> +#include <mutex> +#include <condition_variable> +#include <functional> +#include <memory> +#include <optional> +#include <string> +#include <thread> +#include <vector>This keeps the header self‑contained and more robust to refactors.
121-143: Consider documenting streaming/cleanup semantics ofserver_http_proxy
server_http_proxyowns a background thread and exposes acleanupcallback called from the destructor. Since this is part of the public API, it might help future readers/tests if you briefly document:
- That responses are streamed via
nextand backed by a detached thread.- That
cleanupcloses the internal pipe to allow the thread to terminate.No code change needed; just a short comment near the struct definition.
tools/server/tests/utils.py (1)
518-531: tinygemma3 HF repo change and router preset behavior
- Switching tinygemma3 to use the combined HF repo spec (
"ggml-org/tinygemma3-GGUF:Q8_0") withmodel_hf_file = Noneis consistent with the CLI args construction and should work as long as llama-server expects the revision-qualified repo string on--hf-repo.- The new
ServerPreset.router()correctly clears all model sources so the process starts in router mode only; the small context/batch/slots values are fine since no local inference happens.- Note that
ServerPreset.load_all()will now also instantiate the router preset, briefly starting and stopping a router-only server. If you intendedload_all()purely as a model cache warmer, you may want to excluderouterfrom that loop to avoid starting an extra control-plane instance during test setup.tools/server/tests/unit/test_router.py (1)
1-4: Avoidfrom utils import *in testsThe wildcard import makes it harder to see which test dependencies are actually used and trips static analysis (F403/F405). It’s more robust to import only what you need, e.g.:
-import pytest -from utils import * +import pytest +from utils import ServerProcess, ServerPreset, ServerErrorThis keeps the namespace explicit while preserving current behavior.
tools/server/server.cpp (2)
5142-5155: Consider validating missing/emptymodelinproxy_post/proxy_getbefore delegatingBoth
proxy_getandproxy_postcallmodels->ensure_model_loaded(name)wherenamecomes from a query parameter or JSON body. If the client omitsmodel(or sends an empty string), this will pass an empty name through toensure_model_loaded, leaving behavior entirely toserver_models.To make the router API fail fast and more user-friendly, you could treat an empty model name as a client error:
- std::string name = json_value(body, "model", std::string()); - models->ensure_model_loaded(name); - return models->proxy_request(req, method, name); + std::string name = json_value(body, "model", std::string()); + if (name.empty()) { + auto res = std::make_unique<server_res_generator>(ctx_server); + res->error(format_error_response("\"model\" must be provided", ERROR_TYPE_INVALID_REQUEST)); + return res; + } + models->ensure_model_loaded(name); + return models->proxy_request(req, method, name);A similar guard could be added to
proxy_getif you don’t want/... ?model=to fall back to the router’s own identity.
5733-5800: Router vs non-router startup/cleanup flow looks consistent, but duplicateis_readyset
- The new
clean_uplambdas for router and non-router correctly encapsulate the different teardown logic (models->unload_all()vs. queue termination +ctx_http.stop()), and they’re called both on early exit paths and at the end of the main loop.- Router mode now starts HTTP without loading a model and skips
queue_tasks.start_loop(), while non-router mode preserves the existing path and additionally registers as a child withserver_models::setup_child_serverwhenLLAMA_SERVER_ROUTER_PORTis set; this separation is clear and avoids interleaving the two lifecycles.- Minor nit: in router mode
ctx_http.is_ready.store(true)is called both beforectx_http.start()and again just before joining the HTTP thread; one of these writes is redundant.Functionally this all hangs together; the only actionable change I’d suggest here is to remove the duplicate
is_readystore for clarity.Also applies to: 5816-5843
📜 Review details
Configuration used: CodeRabbit UI
Review profile: CHILL
Plan: Pro
📒 Files selected for processing (9)
README.md(1 hunks)scripts/sync_vendor.py(1 hunks)tools/server/CMakeLists.txt(1 hunks)tools/server/server-models.cpp(1 hunks)tools/server/server-models.h(1 hunks)tools/server/server.cpp(7 hunks)tools/server/tests/unit/test_router.py(1 hunks)tools/server/tests/utils.py(3 hunks)vendor/sheredom/subprocess.h(1 hunks)
🧰 Additional context used
📓 Path-based instructions (3)
{**/CMakeLists.txt,cmake/**/*.cmake}
📄 CodeRabbit inference engine (.github/copilot-instructions.md)
{**/CMakeLists.txt,cmake/**/*.cmake}: Use CMake (Makefile is deprecated); maintain build configuration via CMake files
Require CMake 3.14+ in build configuration
Files:
tools/server/CMakeLists.txt
**/*.{c,cc,cpp,h,hpp}
📄 CodeRabbit inference engine (.github/copilot-instructions.md)
**/*.{c,cc,cpp,h,hpp}: Always format all C/C++ code with the repository’s .clang-format before committing
Use 4-space indentation in C/C++ code
Limit C/C++ lines to 120 columns
Place function-opening braces on the same line in C/C++
Use middle pointer alignment (e.g., void * ptr) in C/C++
Use middle reference alignment (e.g., int & ref) in C/C++
Ensure C/C++ code is compatible with C++17 (project standard)
Files:
vendor/sheredom/subprocess.htools/server/server-models.htools/server/server-models.cpptools/server/server.cpp
**/*.py
📄 CodeRabbit inference engine (.github/copilot-instructions.md)
**/*.py: Python code should pass flake8 per .flake8 (e.g., max-line-length=125, configured excludes)
Python code should pass pyright type checking per pyrightconfig.json
Files:
tools/server/tests/unit/test_router.pyscripts/sync_vendor.pytools/server/tests/utils.py
🧠 Learnings (1)
📓 Common learnings
Learnt from: ngxson
Repo: ngxson/llama.cpp PR: 14
File: examples/lookahead/lookahead.cpp:97-98
Timestamp: 2025-03-13T22:56:47.417Z
Learning: ngxson prefers to prioritize simplification in some cases and defer adding error handling to a later time, as indicated by their response to a suggestion about adding error checks for llama_decode_ext calls.
🪛 GitHub Actions: EditorConfig Checker
tools/server/server-models.h
[error] 16-16: Trailing whitespace found on line 16. Remove trailing whitespace.
🪛 Ruff (0.14.5)
tools/server/tests/unit/test_router.py
2-2: from utils import * used; unable to detect undefined names
(F403)
4-4: ServerProcess may be undefined, or defined from star imports
(F405)
9-9: ServerPreset may be undefined, or defined from star imports
(F405)
24-24: ServerError may be undefined, or defined from star imports
(F405)
42-42: ServerError may be undefined, or defined from star imports
(F405)
⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (11)
- GitHub Check: Run WebUI tests
- GitHub Check: server-windows
- GitHub Check: ggml-ci-x64-cpu-high-perf
- GitHub Check: android-ndk-build (arm64-snapdragon, --preset arm64-android-snapdragon-release)
- GitHub Check: ggml-ci-arm64-cpu-high-perf-sve
- GitHub Check: windows-latest-cmake (vulkan-x64, x64, -DCMAKE_BUILD_TYPE=Release -DGGML_NATIVE=OFF -DLLAMA_BUILD...
- GitHub Check: windows-latest-cmake (cpu-x64 (static), x64, -G "Ninja Multi-Config" -D CMAKE_TOOLCHAIN_FILE=cmak...
- GitHub Check: ggml-ci-arm64-cpu-kleidiai
- GitHub Check: ios-xcode-build
- GitHub Check: ubuntu-24-cmake-vulkan
- GitHub Check: ubuntu-22-cmake-musa
🔇 Additional comments (6)
README.md (1)
606-615: Documenting subprocess.h dependency looks correctThe new subprocess.h entry is consistent with the vendored header path and usage; no issues from a docs or dependency-discovery standpoint.
scripts/sync_vendor.py (1)
19-22: Pinned subprocess.h vendor mapping is reasonableAdding subprocess.h to the vendor map with a commit-pinned raw URL and a clear destination path is consistent with the existing pattern and should integrate cleanly with the sync script.
tools/server/CMakeLists.txt (1)
14-21: Including server-models.{cpp,h} in the server target is correctWiring server-models.cpp/server-models.h into TARGET_SRCS matches the new include in server.cpp and ensures router/model-management code is actually built and linked.
tools/server/tests/utils.py (1)
45-55: Allowingmodel_hf_repoto be nullable looks correctRelaxing
model_hf_repotostr | Nonematches how it’s used instart()(flag only emitted when truthy) and is required for therouter()preset; no behavioral regressions apparent.tools/server/server.cpp (1)
4453-4457: Router model-management endpoints andserver_modelsownership look structurally sound
- Storing
server_modelsas astd::unique_ptrinsideserver_routeskeeps router-specific state co-located with the HTTP handlers and avoids extra globals.- The router endpoints (
get_router_props,get_router_models,/models/load,/models/unload,/models/status) correctly delegate toserver_modelsand use the sharedformat_error_responsemachinery for consistent error payloads.get_router_propsfalling back toproxy_getwhenmodelis provided allows/props?model=...to behave like the legacy single-model endpoint, while/propswithout a model exposes router metadata.- All handlers that dereference
modelsare only wired when router mode is active, so null dereferences are avoided as long asserver_modelsconstruction succeeds.No functional issues stand out here; behavior mostly depends on the correctness of
server_modelsitself.Also applies to: 5115-5222
tools/server/tests/unit/test_router.py (1)
6-50: The review comment's primary claim is incorrectThe autouse fixture in
test_router.pydoes not "never stop the server." Theconftest.pyfile contains an autouse fixturestop_server_after_each_test()that runs after each test and calls.stop()on all instances tracked inserver_instances. WhenServerProcess.start()is called, it adds itself to theserver_instancesset, andstop()removes it and kills the process.The conftest cleanup executes between each parametrized test iteration, so the resource leakage and port conflict concerns described in the review do not actually occur—servers are properly cleaned up after each test.
That said, the suggested refactor remains valid as a design improvement. Using global state with autouse fixtures is an anti-pattern even when cleanup is guaranteed elsewhere. A scoped fixture per test (as proposed) would be cleaner and more maintainable. However, the refactor is a stylistic improvement rather than a fix for an actual resource leak.
Likely an incorrect or invalid review comment.
…_v1_2' into allozaur/server_model_management_v1_2
…v1_2' into xsn/server_model_management_v1_2
…' into xsn/server_model_management_v1_2
The feature was implemented using subprocess, meaning there will be one main "router" server whose the job is to create other "child" processes that will actually run the inference.
This system was design and test against these unexpected cases:
GGML_ASSERT)These steps happen when user request the router to launch a model instance:
[port_number]If the child process exits, router server knows that as soon as stdout/stderr is closed; In reversed, if router process exits (or router request child process to stop), it simply close the stdin of the child. The child detects this and internally calls its SIGTERM/SIGINT handler.
Other changes included in the PR:
DEFAULT_MODEL_PATH-m, --modelis not specified,common_params_parse_exwill return an error (except for server)Summary by CodeRabbit
New Features
Documentation
Tests
✏️ Tip: You can customize this high-level summary in your review settings.