Skip to content

Conversation

@ngxson
Copy link
Collaborator

@ngxson ngxson commented Nov 24, 2025

Close #16487
Close #16256
Close #17556

For more detailes on WebUI changes, please refer to this comment from @allozaur : #17470 (comment)

This PR introduces the ability to use multiple models, unload/load them on the fly in llama-server

The API was designed to take advantage of OAI-compat /v1/models endpoint, as well as the "model" in body payload for POST requests like /v1/chat/completions. By default, if the model is not yet loaded, it will be loaded automatically on-demand.

This is the first version of the feature, aims to be experimental. Here is the list of capabilities:

  • API for listing, loading, unloading models
  • Routing request based on "model" field
  • Limit maximum number of models to be loaded at the same time
  • Allow loading models from a local directory
  • (Advanced) allow specifying custom per-model config via API

Other features like downloading new models, delete cached models, real-time events, etc are planned for next iteration.

Example commands:

# start the server as router (using models in cache)
llama-server

# use GGUFs from a local directory - see directory structure in README.md
llama-server --models-dir ./my_models

# specify default arguments to be passed to models
llama-server -n 128 -ctx 8192 -ngl 4

# allow setting the arguments per-model via API (warning: only used in trusted network)
llama-server --models-allow-extra-args

For the full documentation, please refer to the "Using multiple models" section of the new documentation

Note: waiting for further webui changes from @allozaur

Screen.Recording.2025-11-24.at.15.20.05.mp4

Implementation

The feature was implemented using multi-process approach. The reason for this choice is to be more resilient in case the model crashes.

Most of the implementation is confined inside tools/server/server-models.cpp

There will be one main "router" server whose the job is to create other "child" processes that will actually run the inference.

This system was design and test against these unexpected cases:

  • Child process suddenly exit due to error (for example, a GGML_ASSERT)
  • Child process failed to load (for example, the system cannot launch the process)
  • Router process suddenly exit due to error. In this case, child processes automatically stop themself.

These steps happen when user request the router to launch a model instance:

  1. Check if the model already had a process; if yes, skip
  2. Construct argv and envp to launch the child process; a random HTTP port is selected for each child process
  3. Start the child process
  4. Create a thread to read child's stdout/stderr and forward it to main process, with a prefix [port_number]
  5. Inside child process, it notifies router server its "ready" status, then spawn a thread to monitor its stdin

If the child process exits, router server knows that as soon as stdout/stderr is closed

In reverse, from the router server:

  • If the router server send a special command via stdin, the child process detects this command, call clean up function and exit gracefully
  • If router server crashes, the stdin will be closed. This will trigger exit(1) which cause the child process to exit immediately
sequenceDiagram
    router->>child: spawn with args
    child->>child: load model
    child->>router: POST ready status via API
    Note over child,router: Routing HTTP requests
    alt request shutdown
        router->>child: exit command (via stdin)
        child->>child: clean up & exit
    else router dead
        router-->child: stdin close
        child->>child: force exit
    end
Loading

Other changes included in the PR:

  • Added subprocess.h as new vedor
  • Remove DEFAULT_MODEL_PATH
  • If -m, --model is not specified, common_params_parse_ex will return an error (except for server)

AI usage disclosure: Most of the code here is human-written, except for

  • pipe_t implementation used by server_http_proxy
  • get_free_port() function

Copy link
Member

@ggerganov ggerganov left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Very cool stuff!

@allozaur
Copy link
Collaborator

allozaur commented Dec 1, 2025

WebUI Changes

The WebUI was refactored to support both single-model (MODEL mode) and multi-model (ROUTER mode) operation.

High-Level Architecture

flowchart TB
    subgraph Routes["📍 Routes"]
        R1["/ (Welcome)"]
        R2["/chat/[id]"]
        RL["+layout.svelte"]
    end

    subgraph Components["🧩 Components"]
        C_Sidebar["ChatSidebar"]
        C_Screen["ChatScreen"]
        C_Form["ChatForm"]
        C_Messages["ChatMessages"]
        C_ModelsSelector["ModelsSelector"]
        C_Settings["ChatSettings"]
    end

    subgraph Hooks["🪝 Hooks"]
        H1["useModelChangeValidation"]
        H2["useProcessingState"]
    end

    subgraph Stores["🗄️ Stores"]
        S1["chatStore<br/><i>Chat interactions & streaming</i>"]
        S2["conversationsStore<br/><i>Conversation data & messages</i>"]
        S3["modelsStore<br/><i>Model selection & loading</i>"]
        S4["serverStore<br/><i>Server props & role detection</i>"]
        S5["settingsStore<br/><i>User configuration</i>"]
    end

    subgraph Services["⚙️ Services"]
        SV1["ChatService"]
        SV2["ModelsService"]
        SV3["PropsService"]
        SV4["DatabaseService"]
        SV5["ParameterSyncService"]
    end

    subgraph Storage["💾 Storage"]
        ST1["IndexedDB<br/><i>conversations, messages</i>"]
        ST2["LocalStorage<br/><i>config, userOverrides</i>"]
    end

    subgraph APIs["🌐 llama-server API"]
        API1["/v1/chat/completions"]
        API2["/props"]
        API3["/models/*"]
        API4["/v1/models"]
    end

    R1 & R2 --> C_Screen
    RL --> C_Sidebar
    C_Screen --> C_Form & C_Messages & C_Settings
    C_Form & C_Messages --> C_ModelsSelector
    C_Form & C_Messages --> H1 & H2
    H1 --> S3 & S4
    H2 --> S1 & S5
    C_Screen --> S1 & S2
    C_Sidebar --> S2
    C_ModelsSelector --> S3 & S4
    C_Settings --> S5
    S1 --> SV1 & SV4
    S2 --> SV4
    S3 --> SV2 & SV3
    S4 --> SV3
    S5 --> SV5
    SV4 --> ST1
    SV5 --> ST2
    SV1 --> API1
    SV2 --> API3 & API4
    SV3 --> API2
Loading

MODEL vs ROUTER Mode Data Flow Comparison

sequenceDiagram
    participant User as 👤 User
    participant UI as 🧩 UI
    participant Stores as 🗄️ Stores
    participant DB as 💾 IndexedDB
    participant API as 🌐 llama-server

    Note over User,API: 🚀 Initialization

    UI->>Stores: initialize()
    Stores->>DB: load conversations
    Stores->>API: GET /props

    alt MODEL mode
        API-->>Stores: {role: "model", modalities: [...]}
        Stores->>API: GET /v1/models
        API-->>Stores: single model (auto-selected)
    else ROUTER mode
        API-->>Stores: {role: "router"}
        Stores->>API: GET /models
        API-->>Stores: models[] with status (loaded/available)
        loop each loaded model
            Stores->>API: GET /props?model=X
            API-->>Stores: modalities (vision/audio)
        end
    end

    Note over User,API: 🔄 Model Selection

    alt MODEL mode
        Note right of Stores: Model auto-selected<br/>No user action needed
    else ROUTER mode
        User->>UI: select model
        alt model not loaded
            Stores->>API: POST /models/load
            loop poll status
                Stores->>API: GET /models
            end
            Stores->>API: GET /props?model=X
        end
        Stores->>Stores: validate modalities vs conversation
        alt invalid modalities
            Stores->>API: POST /models/unload
            UI->>User: error toast
        end
    end

    Note over User,API: 💬 Chat Flow

    User->>UI: send message
    Stores->>DB: save user message

    alt MODEL mode
        Stores->>API: POST /v1/chat/completions
        loop streaming
            API-->>Stores: SSE chunks
            Stores-->>UI: reactive update
        end
    else ROUTER mode
        Stores->>API: POST /v1/chat/completions {model: X}
        Note right of API: router forwards to model
        loop streaming
            API-->>Stores: SSE chunks + model info
            Stores-->>UI: reactive update
        end
        Stores->>DB: save model used in response
    end

    Stores->>DB: save assistant message
Loading

Summary of Changes

🏗️ Architecture Refactoring

Services (API layer):

  • Removed SlotsService → logic moved to reactive state in chatStore
  • Moved DatabaseStore → DatabaseService (from /stores to /services)
  • New PropsService for /props and /props?model=
  • Extended ModelsService with /models/load, /models/unload
  • New api-headers.ts for centralized Bearer auth

Stores (Reactive state):

  • New conversationsStore extracted from chatStore
  • chatStore: abort controllers per-conversation, streaming state per-conversation
  • serverStore: ROUTER vs MODEL detection from /props.role
  • modelsStore: multi-model support, per-model props cache, per-model loading states

Code organization:

  • Section categories in stores/services (State, Lifecycle, Utilities, etc.)
  • Centralized exports: types/index.ts, utils/index.ts, enums/index.ts

🎯 Model Management (main feature)

New components:

  • ModelsSelector with "Loaded"/"Available" groups
  • ModelBadge with status icon
  • DialogModelInformation (context, modalities, parameters)
  • DialogModelNotAvailable for unavailable model handling

Business logic:

  • Auto-select model from last assistant response (model field in message)
  • Automatic model loading on selection (ROUTER mode)
  • Modality validation via useModelChangeValidation hook
  • Filter available models by conversation's required modalities
  • Polling model status after load/unload (API workaround)

Modality detection:

  • getModelModalities() - fetch from /props?model=
  • modelSupportsVision(), modelSupportsAudio() - query helpers
  • Per-model props cache with invalidation on status change

🧩 New UI Components

Badges & statistics:

  • BadgeChatStatistic - generation stats (tokens, time, t/s)
  • BadgeInfo, BadgeModality
  • ChatMessageStatistics - stats section below message

Attachments:

  • ChatAttachmentPreview - unified preview component
  • ChatAttachmentThumbnailFile, ChatAttachmentThumbnailImage
  • DialogChatAttachmentPreview - fullscreen preview
  • DialogChatAttachmentsViewAll - attachment grid dialog

Other:

  • SyntaxHighlightedCode with language detection
  • CopyToClipboardIcon - reusable copy button
  • Renamed dialogs to Dialog* pattern
  • ShadCN table/* and alert/* components

🗑️ Removed Components

  • ServerInfo → replaced by ModelsSelector + ModelBadge
  • ChatFormModelSelector → replaced by ModelsSelector
  • SlotsService → logic moved to stores
  • Various dialog components → renamed to Dialog* pattern

⚙️ Settings Changes

  • Added: autoMicOnEmpty, disableAutoScroll, enableContinueGeneration
  • Removed: showTokensPerSecondshowStatistics, showModelInfo, modelSelectorEnabled
  • Changed: pdfAsImage - automatic fallback to text for non-vision models

📁 Types and Enums

  • ServerRole: MODEL | ROUTER
  • ServerModelStatus: LOADED | UNLOADED | LOADING | FAILED
  • AttachmentType, ModelSource
  • Extended API types for /models responses

🧪 Tests Reorganization

  • Moved tests to /tests/ (from /src/)
    • /tests/client/ - component tests
    • /tests/e2e/ - E2E tests
    • /tests/stories/ - Storybook stories

@ngxson

This comment was marked as resolved.

@allozaur
Copy link
Collaborator

allozaur commented Dec 1, 2025

For vis, there is still a small bug on webui where model get loaded automatically when user doesn't send any messages.

Waiting for the fix from @allozaur the fix, then I guess we can merge!

Just pushed last fixes 😄 @ngxson

@ngxson
Copy link
Collaborator Author

ngxson commented Dec 1, 2025

Nice, thanks! Merging this once CI are all green.

(Note: I'm running the same CI in my fork to skip the long waiting line)

l2k36hk
l2k36hk previously approved these changes Dec 1, 2025
@allozaur allozaur dismissed l2k36hk’s stale review December 1, 2025 17:07

Not a review from maintainer

Copy link
Collaborator

@allozaur allozaur left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

All is looking good on my end, let's ship it 🚀

@allozaur allozaur linked an issue Dec 1, 2025 that may be closed by this pull request
@allozaur
Copy link
Collaborator

allozaur commented Dec 1, 2025

Also changes added here in this PR should fix #17556. @pwilkin please check and let me know if the problem is resolved on your end :)

@ngxson ngxson merged commit ec18edf into ggml-org:master Dec 1, 2025
66 of 69 checks passed
@ngxson
Copy link
Collaborator Author

ngxson commented Dec 1, 2025

Self-note: this also needs to be included in server's changelog

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

examples python python script changes script Script related server testing Everything test related

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Misc. bug: Regression on file uploads Feature request: allow load/unload models on server Update llama-server README secion about WebUI

7 participants