-
Notifications
You must be signed in to change notification settings - Fork 13.9k
server: introduce API for serving / loading / unloading multiple models #17470
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
server: introduce API for serving / loading / unloading multiple models #17470
Conversation
…' into allozaur/server_model_management_v1_2
…' into allozaur/server_model_management_v1_2
ggerganov
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Very cool stuff!
WebUI ChangesThe WebUI was refactored to support both single-model ( High-Level Architectureflowchart TB
subgraph Routes["📍 Routes"]
R1["/ (Welcome)"]
R2["/chat/[id]"]
RL["+layout.svelte"]
end
subgraph Components["🧩 Components"]
C_Sidebar["ChatSidebar"]
C_Screen["ChatScreen"]
C_Form["ChatForm"]
C_Messages["ChatMessages"]
C_ModelsSelector["ModelsSelector"]
C_Settings["ChatSettings"]
end
subgraph Hooks["🪝 Hooks"]
H1["useModelChangeValidation"]
H2["useProcessingState"]
end
subgraph Stores["🗄️ Stores"]
S1["chatStore<br/><i>Chat interactions & streaming</i>"]
S2["conversationsStore<br/><i>Conversation data & messages</i>"]
S3["modelsStore<br/><i>Model selection & loading</i>"]
S4["serverStore<br/><i>Server props & role detection</i>"]
S5["settingsStore<br/><i>User configuration</i>"]
end
subgraph Services["⚙️ Services"]
SV1["ChatService"]
SV2["ModelsService"]
SV3["PropsService"]
SV4["DatabaseService"]
SV5["ParameterSyncService"]
end
subgraph Storage["💾 Storage"]
ST1["IndexedDB<br/><i>conversations, messages</i>"]
ST2["LocalStorage<br/><i>config, userOverrides</i>"]
end
subgraph APIs["🌐 llama-server API"]
API1["/v1/chat/completions"]
API2["/props"]
API3["/models/*"]
API4["/v1/models"]
end
R1 & R2 --> C_Screen
RL --> C_Sidebar
C_Screen --> C_Form & C_Messages & C_Settings
C_Form & C_Messages --> C_ModelsSelector
C_Form & C_Messages --> H1 & H2
H1 --> S3 & S4
H2 --> S1 & S5
C_Screen --> S1 & S2
C_Sidebar --> S2
C_ModelsSelector --> S3 & S4
C_Settings --> S5
S1 --> SV1 & SV4
S2 --> SV4
S3 --> SV2 & SV3
S4 --> SV3
S5 --> SV5
SV4 --> ST1
SV5 --> ST2
SV1 --> API1
SV2 --> API3 & API4
SV3 --> API2
MODEL vs ROUTER Mode Data Flow ComparisonsequenceDiagram
participant User as 👤 User
participant UI as 🧩 UI
participant Stores as 🗄️ Stores
participant DB as 💾 IndexedDB
participant API as 🌐 llama-server
Note over User,API: 🚀 Initialization
UI->>Stores: initialize()
Stores->>DB: load conversations
Stores->>API: GET /props
alt MODEL mode
API-->>Stores: {role: "model", modalities: [...]}
Stores->>API: GET /v1/models
API-->>Stores: single model (auto-selected)
else ROUTER mode
API-->>Stores: {role: "router"}
Stores->>API: GET /models
API-->>Stores: models[] with status (loaded/available)
loop each loaded model
Stores->>API: GET /props?model=X
API-->>Stores: modalities (vision/audio)
end
end
Note over User,API: 🔄 Model Selection
alt MODEL mode
Note right of Stores: Model auto-selected<br/>No user action needed
else ROUTER mode
User->>UI: select model
alt model not loaded
Stores->>API: POST /models/load
loop poll status
Stores->>API: GET /models
end
Stores->>API: GET /props?model=X
end
Stores->>Stores: validate modalities vs conversation
alt invalid modalities
Stores->>API: POST /models/unload
UI->>User: error toast
end
end
Note over User,API: 💬 Chat Flow
User->>UI: send message
Stores->>DB: save user message
alt MODEL mode
Stores->>API: POST /v1/chat/completions
loop streaming
API-->>Stores: SSE chunks
Stores-->>UI: reactive update
end
else ROUTER mode
Stores->>API: POST /v1/chat/completions {model: X}
Note right of API: router forwards to model
loop streaming
API-->>Stores: SSE chunks + model info
Stores-->>UI: reactive update
end
Stores->>DB: save model used in response
end
Stores->>DB: save assistant message
Summary of Changes🏗️ Architecture RefactoringServices (API layer):
Stores (Reactive state):
Code organization:
🎯 Model Management (main feature)New components:
Business logic:
Modality detection:
🧩 New UI ComponentsBadges & statistics:
Attachments:
Other:
🗑️ Removed Components
⚙️ Settings Changes
📁 Types and Enums
🧪 Tests Reorganization
|
This comment was marked as resolved.
This comment was marked as resolved.
|
Nice, thanks! Merging this once CI are all green. (Note: I'm running the same CI in my fork to skip the long waiting line) |
allozaur
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
All is looking good on my end, let's ship it 🚀
|
Self-note: this also needs to be included in server's changelog |
Close #16487
Close #16256
Close #17556
For more detailes on WebUI changes, please refer to this comment from @allozaur : #17470 (comment)
This PR introduces the ability to use multiple models, unload/load them on the fly in
llama-serverThe API was designed to take advantage of OAI-compat
/v1/modelsendpoint, as well as the"model"in body payload for POST requests like/v1/chat/completions. By default, if the model is not yet loaded, it will be loaded automatically on-demand.This is the first version of the feature, aims to be experimental. Here is the list of capabilities:
"model"fieldOther features like downloading new models, delete cached models, real-time events, etc are planned for next iteration.
Example commands:
For the full documentation, please refer to the "Using multiple models" section of the new documentation
Note: waiting for further webui changes from @allozaur
Screen.Recording.2025-11-24.at.15.20.05.mp4
Implementation
The feature was implemented using multi-process approach. The reason for this choice is to be more resilient in case the model crashes.
Most of the implementation is confined inside
tools/server/server-models.cppThere will be one main "router" server whose the job is to create other "child" processes that will actually run the inference.
This system was design and test against these unexpected cases:
GGML_ASSERT)These steps happen when user request the router to launch a model instance:
[port_number]If the child process exits, router server knows that as soon as stdout/stderr is closed
In reverse, from the router server:
exit(1)which cause the child process to exit immediatelysequenceDiagram router->>child: spawn with args child->>child: load model child->>router: POST ready status via API Note over child,router: Routing HTTP requests alt request shutdown router->>child: exit command (via stdin) child->>child: clean up & exit else router dead router-->child: stdin close child->>child: force exit endOther changes included in the PR:
DEFAULT_MODEL_PATH-m, --modelis not specified,common_params_parse_exwill return an error (except for server)AI usage disclosure: Most of the code here is human-written, except for
pipe_timplementation used byserver_http_proxyget_free_port()function