Open
Description
Name and Version
.\llamacpp\llama-server.exe --version
ggml_cuda_init: GGML_CUDA_FORCE_MMQ: no
ggml_cuda_init: GGML_CUDA_FORCE_CUBLAS: no
ggml_cuda_init: found 3 CUDA devices:
Device 0: NVIDIA GeForce RTX 4090, compute capability 8.9, VMM: yes
Device 1: NVIDIA GeForce RTX 3090, compute capability 8.6, VMM: yes
Device 2: NVIDIA GeForce RTX 3090, compute capability 8.6, VMM: yes
version: 5228 (44cd8d91)
built with MSVC 19.29.30159.0 for
Operating systems
Windows
GGML backends
CUDA
Hardware
- CPU: Ryzen 7900X
- CUDA0 RTX 4090 @ x16 - primary GPU for video output
- CUDA1 RTX 3090 @ x4
- CUDA2 RTX 3090 @ x1
- RAM 64Gb DDR5@6000MT
Models
Qwen3-30B-A3B-Q6_K from https://huggingface.co/unsloth/Qwen3-30B-A3B-128K-GGUF
Problem description & steps to reproduce
Using CUDA backend I got only 40-50 tps generation speed.
Here is parameters:
./llamacpp/llama-server.exe
--jinja
--flash-attn
--no-mmap
--no-warmup
--host 0.0.0.0
--port 5107
--metrics
--slots
-m ./models/Qwen3-30B-A3B-128K-Q6_K.gguf
-ngl 99
--ctx-size 65536
-ctk q8_0
-ctv q8_0
-dev 'CUDA1,CUDA2'
-ts 100,100
With Vulkan backend I getting 80-90 tps generation speed with
./llamacpp/vulkan/llama-server.exe
--jinja
--flash-attn
--no-mmap
--no-warmup
--host 0.0.0.0
--port 5107
--metrics
--slots
-m ./models/Qwen3-30B-A3B-128K-Q6_K.gguf
-ngl 99
--ctx-size 65536
-ctk q8_0
-ctv q8_0
-dev 'VULKAN1,VULKAN2'
-ts 100,100
-b 384
-ub 512
But! With batch size more than 384 I'm getting error with incorrect size and BSOD with video memory issues which never happening with CUDA. I've tested VRAM with memtest_vulkan-v0.5.0
and everything was fine.
First Bad Commit
No response
Relevant log output
CUDA
main: server is listening on http://0.0.0.0:5107 - starting the main loop
srv update_slots: all slots are idle
srv log_server_r: request: GET /health 127.0.0.1 200
srv params_from_: Chat format: Content-only
slot launch_slot_: id 0 | task 0 | processing task
slot update_slots: id 0 | task 0 | new prompt, n_ctx_slot = 65536, n_keep = 0, n_prompt_tokens = 1219
slot update_slots: id 0 | task 0 | kv cache rm [0, end)
slot update_slots: id 0 | task 0 | prompt processing progress, n_past = 1219, n_tokens = 1219, progress = 1.000000
slot update_slots: id 0 | task 0 | prompt done, n_past = 1219, n_tokens = 1219
slot release: id 0 | task 0 | stop processing: n_past = 1738, truncated = 0
slot print_timing: id 0 | task 0 |
prompt eval time = 2903.47 ms / 1219 tokens ( 2.38 ms per token, 419.84 tokens per second)
eval time = 11284.06 ms / 520 tokens ( 21.70 ms per token, 46.08 tokens per second)
total time = 14187.52 ms / 1739 tokens
srv update_slots: all slots are idle
srv log_server_r: request: POST /v1/chat/completions 127.0.0.1 200
VULKAN
[llama-swap] 192.168.1.5 [2025-04-30 16:00:35] "POST /v1/chat/completions HTTP/1.1" 200 117331 "Python/3.11 aiohttp/3.11.11" 27.9262686s
srv params_from_: Chat format: Content-only
slot launch_slot_: id 0 | task 492 | processing task
slot update_slots: id 0 | task 492 | new prompt, n_ctx_slot = 65536, n_keep = 0, n_prompt_tokens = 349
slot update_slots: id 0 | task 492 | kv cache rm [3, end)
slot update_slots: id 0 | task 492 | prompt processing progress, n_past = 349, n_tokens = 346, progress = 0.991404
slot update_slots: id 0 | task 492 | prompt done, n_past = 349, n_tokens = 346
slot release: id 0 | task 492 | stop processing: n_past = 9757, truncated = 0
slot print_timing: id 0 | task 492 |
prompt eval time = 358.06 ms / 346 tokens ( 1.03 ms per token, 966.33 tokens per second)
eval time = 135226.91 ms / 9409 tokens ( 14.37 ms per token, 69.58 tokens per second)
total time = 135584.97 ms / 9755 tokens
srv update_slots: all slots are idle
srv log_server_r: request: POST /v1/chat/completions 127.0.0.1 200