Skip to content

Misc. bug: --no-warmup failing in llama-server.exe for some vision models #17676

@SmartestWashingMachine

Description

@SmartestWashingMachine

Name and Version

load_backend: loaded CUDA backend from C:\Users\metal\OneDrive\Desktop\mar20\src\models\llamacpp_gpu\ggml-cuda.dll
load_backend: loaded RPC backend from C:\Users\metal\OneDrive\Desktop\mar20\src\models\llamacpp_gpu\ggml-rpc.dll
load_backend: loaded CPU backend from C:\Users\metal\OneDrive\Desktop\mar20\src\models\llamacpp_gpu\ggml-cpu-alderlake.dll
version: 7222 (746f9ee)
built with clang version 19.1.5 for x86_64-pc-windows-msvc

Operating systems

Windows

Which llama.cpp modules do you know to be affected?

llama-server

Command line

llama-server.exe -m qwen3vl2b.gguf --mmproj q41_mmproj.gguf --no-warmup -c 4000

Problem description & steps to reproduce

Trying out manual warmup with #17652 but it fails on the first request (my warmup call) with error ggml_new_object: not enough space in the context's memory pool (needed 330192, available 16)

(can't check with llama-mtmd-cli.exe as --no-warmup is an invalid argument there)

Not sure if this is an actual issue or if I just misunderstood the feature. If I understand correctly, I need to:

  1. Start up server.
  2. Make a dummy "warmup" chat call with the desired image warmup size.

Reproducing it:

  1. Start up server with no-warmup. I used Qwen3-VL/LFM2-VL (llama-server.exe --model "qwen3vl2b.gguf" --mmproj "q41_mmproj.gguf" -c 4000 --no-warmup)
  2. Use webui or make OpenAI API compliant chat call to it with image

Dummy image:

Image

Removing --no-warmup makes it work normally. But it'd be nice to use --no-warmup so that I can specify my own "ideal" image warmup size for reduced reserved memory. (Tested on CPU and CUDA backends)

First Bad Commit

#17652

Relevant log output

slot get_availabl: id  3 | task -1 | selected slot by LRU, t_last = -1
slot launch_slot_: id  3 | task -1 | sampler chain: logits -> logit-bias -> penalties -> dry -> top-n-sigma -> top-k -> typical -> top-p -> min-p -> xtc -> temp-ext -> dist
slot launch_slot_: id  3 | task 0 | processing task
slot update_slots: id  3 | task 0 | new prompt, n_ctx_slot = 20224, n_keep = 0, task.n_tokens = 266
slot update_slots: id  3 | task 0 | n_tokens = 0, memory_seq_rm [0, end)
slot update_slots: id  3 | task 0 | prompt processing progress, n_tokens = 4, batch.n_tokens = 4, progress = 0.015038
slot update_slots: id  3 | task 0 | n_tokens = 4, memory_seq_rm [4, end)
srv  process_chun: processing image...
encoding image slice...
ggml_new_object: not enough space in the context's memory pool (needed 330192, available 16)

Metadata

Metadata

Assignees

No one assigned

    Labels

    bugSomething isn't workingmtmdRelated to multimodal functionality (video/image/audio)

    Type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions