-
Notifications
You must be signed in to change notification settings - Fork 13.9k
Description
Name and Version
load_backend: loaded CUDA backend from C:\Users\metal\OneDrive\Desktop\mar20\src\models\llamacpp_gpu\ggml-cuda.dll
load_backend: loaded RPC backend from C:\Users\metal\OneDrive\Desktop\mar20\src\models\llamacpp_gpu\ggml-rpc.dll
load_backend: loaded CPU backend from C:\Users\metal\OneDrive\Desktop\mar20\src\models\llamacpp_gpu\ggml-cpu-alderlake.dll
version: 7222 (746f9ee)
built with clang version 19.1.5 for x86_64-pc-windows-msvc
Operating systems
Windows
Which llama.cpp modules do you know to be affected?
llama-server
Command line
llama-server.exe -m qwen3vl2b.gguf --mmproj q41_mmproj.gguf --no-warmup -c 4000Problem description & steps to reproduce
Trying out manual warmup with #17652 but it fails on the first request (my warmup call) with error ggml_new_object: not enough space in the context's memory pool (needed 330192, available 16)
(can't check with llama-mtmd-cli.exe as --no-warmup is an invalid argument there)
Not sure if this is an actual issue or if I just misunderstood the feature. If I understand correctly, I need to:
- Start up server.
- Make a dummy "warmup" chat call with the desired image warmup size.
Reproducing it:
- Start up server with no-warmup. I used Qwen3-VL/LFM2-VL (
llama-server.exe --model "qwen3vl2b.gguf" --mmproj "q41_mmproj.gguf" -c 4000 --no-warmup) - Use webui or make OpenAI API compliant chat call to it with image
Dummy image:
Removing --no-warmup makes it work normally. But it'd be nice to use --no-warmup so that I can specify my own "ideal" image warmup size for reduced reserved memory. (Tested on CPU and CUDA backends)
First Bad Commit
Relevant log output
slot get_availabl: id 3 | task -1 | selected slot by LRU, t_last = -1
slot launch_slot_: id 3 | task -1 | sampler chain: logits -> logit-bias -> penalties -> dry -> top-n-sigma -> top-k -> typical -> top-p -> min-p -> xtc -> temp-ext -> dist
slot launch_slot_: id 3 | task 0 | processing task
slot update_slots: id 3 | task 0 | new prompt, n_ctx_slot = 20224, n_keep = 0, task.n_tokens = 266
slot update_slots: id 3 | task 0 | n_tokens = 0, memory_seq_rm [0, end)
slot update_slots: id 3 | task 0 | prompt processing progress, n_tokens = 4, batch.n_tokens = 4, progress = 0.015038
slot update_slots: id 3 | task 0 | n_tokens = 4, memory_seq_rm [4, end)
srv process_chun: processing image...
encoding image slice...
ggml_new_object: not enough space in the context's memory pool (needed 330192, available 16)