Name and Version
version: 8559 (59d8402)
built with GNU 15.2.1 for Linux x86_64
Operating systems
Linux
GGML backends
CUDA
Hardware
RTX2060 (laptop version)
Models
Qwen3.5 4B (same happens with other models in this family)
Problem description & steps to reproduce
Despite not running backend sampling, I hit the following assert (rbudget)
|
{ |
|
id = llama_get_sampled_token_ith(ctx, idx); |
|
|
|
if (id != LLAMA_TOKEN_NULL) { |
|
LOG_DBG("%s: Backend sampler selected token: '%d'. Will not run any CPU samplers\n", __func__, id); |
|
|
|
GGML_ASSERT(!gsmpl->grmr && "using grammar in combination with backend sampling is not supported"); |
|
GGML_ASSERT(!gsmpl->rbudget && "using reasoning budget in combination with backend sampling is not supported"); |
Run ./llama-server --models-preset models.ini --port 9090 --host 0.0.0.0 with the following models.ini file:
models.ini
version = 1
[*]
no-mmap = true
# note the below may not be desirable if subagents are used
np = 1
# ctk = q8_0
# ctv = q8_0
[Qwen3.5-4B:Instruct-General]
model = /mnt/disk/llms/Qwen3.5-4B-UD-Q4_K_XL.gguf
c = 64000
temp = 0.7
top-p = 0.8
top-k = 20
min-p = 0.0
presence-penalty = 1.5
repeat-penalty = 1.0
n-predict = 32768
reasoning = off
First Bad Commit
59d8402
Relevant log output
Logs
[47365] llama.cpp/common/sampling.cpp:542: GGML_ASSERT(!gsmpl->rbudget && "using reasoning budget in combination with backend sampling is not supported") failed
[47365] No symbol table is loaded. Use the "file" command.
[47365] Make breakpoint pending on future shared library load? (y or [n]) [answered N; input not from terminal]
[47365] [Thread debugging using libthread_db enabled]
[47365] Using host libthread_db library "/usr/lib/libthread_db.so.1".
[47365] 0x00007f8dbe2adf32 in ?? () from /usr/lib/libc.so.6
[47365] #0 0x00007f8dbe2adf32 in ?? () from /usr/lib/libc.so.6
[47365] #1 0x00007f8dbe2a239c in ?? () from /usr/lib/libc.so.6
[47365] #2 0x00007f8dbe2a23e4 in ?? () from /usr/lib/libc.so.6
[47365] #3 0x00007f8dbe31267f in wait4 () from /usr/lib/libc.so.6
[47365] #4 0x00007f8dc69bc3ab in ggml_print_backtrace () from /home/kris/llama.cpp/build/bin/libggml-base.so.0
[47365] #5 0x00007f8dc69bc510 in ggml_abort () from /home/kris/llama.cpp/build/bin/libggml-base.so.0
[47365] #6 0x0000564ee8b9cd86 in common_sampler_sample(common_sampler*, llama_context*, int, bool) ()
[47365] #7 0x0000564ee89ff938 in server_context_impl::update_slots() ()
[47365] #8 0x0000564ee8a8e45f in server_queue::start_loop(long) ()
[47365] #9 0x0000564ee894d3e2 in main ()
[47365] [Inferior 1 (process 98221) detached]
Additional server startup logs (no --backend-sampling)
srv load: spawning server instance with name=Qwen3.5-4B:Instruct-General on port 47365
srv load: spawning server instance with args:
srv load: llama.cpp/build/bin/llama-server
srv load: --host
srv load: 127.0.0.1
srv load: --min-p
srv load: 0.0
srv load: --no-mmap
srv load: --port
srv load: 47365
srv load: --presence-penalty
srv load: 1.5
srv load: --repeat-penalty
srv load: 1.0
srv load: --temperature
srv load: 0.7
srv load: --top-k
srv load: 20
srv load: --top-p
srv load: 0.8
srv load: --alias
srv load: Qwen3.5-4B:Instruct-General
srv load: --ctx-size
srv load: 64000
srv load: --model
srv load: /mnt/disk/llms/Qwen3.5-4B-UD-Q4_K_XL.gguf
srv load: --n-predict
srv load: 32768
srv load: --parallel
srv load: 1
srv load: --reasoning
srv load: off
Name and Version
version: 8559 (59d8402)
built with GNU 15.2.1 for Linux x86_64
Operating systems
Linux
GGML backends
CUDA
Hardware
RTX2060 (laptop version)
Models
Qwen3.5 4B (same happens with other models in this family)
Problem description & steps to reproduce
Despite not running backend sampling, I hit the following assert (rbudget)
llama.cpp/common/sampling.cpp
Lines 535 to 542 in 08f2145
Run
./llama-server --models-preset models.ini --port 9090 --host 0.0.0.0with the following models.ini file:models.ini
First Bad Commit
59d8402
Relevant log output
Logs
Additional server startup logs (no
--backend-sampling)