Skip to content

Eval bug: Qwen3 30B A3B is slow with CUDA #13211

Open
@Nepherpitou

Description

@Nepherpitou

Name and Version

.\llamacpp\llama-server.exe --version
ggml_cuda_init: GGML_CUDA_FORCE_MMQ:    no
ggml_cuda_init: GGML_CUDA_FORCE_CUBLAS: no
ggml_cuda_init: found 3 CUDA devices:
  Device 0: NVIDIA GeForce RTX 4090, compute capability 8.9, VMM: yes
  Device 1: NVIDIA GeForce RTX 3090, compute capability 8.6, VMM: yes
  Device 2: NVIDIA GeForce RTX 3090, compute capability 8.6, VMM: yes
version: 5228 (44cd8d91)
built with MSVC 19.29.30159.0 for

Operating systems

Windows

GGML backends

CUDA

Hardware

  • CPU: Ryzen 7900X
  • CUDA0 RTX 4090 @ x16 - primary GPU for video output
  • CUDA1 RTX 3090 @ x4
  • CUDA2 RTX 3090 @ x1
  • RAM 64Gb DDR5@6000MT

Models

Qwen3-30B-A3B-Q6_K from https://huggingface.co/unsloth/Qwen3-30B-A3B-128K-GGUF

Problem description & steps to reproduce

Using CUDA backend I got only 40-50 tps generation speed.
Here is parameters:

      ./llamacpp/llama-server.exe
      --jinja
      --flash-attn
      --no-mmap
      --no-warmup
      --host 0.0.0.0
      --port 5107
      --metrics
      --slots
      -m ./models/Qwen3-30B-A3B-128K-Q6_K.gguf
      -ngl 99
      --ctx-size 65536
      -ctk q8_0
      -ctv q8_0
      -dev 'CUDA1,CUDA2'
      -ts 100,100

With Vulkan backend I getting 80-90 tps generation speed with

      ./llamacpp/vulkan/llama-server.exe
      --jinja
      --flash-attn
      --no-mmap
      --no-warmup
      --host 0.0.0.0
      --port 5107
      --metrics
      --slots
      -m ./models/Qwen3-30B-A3B-128K-Q6_K.gguf
      -ngl 99
      --ctx-size 65536
      -ctk q8_0
      -ctv q8_0
      -dev 'VULKAN1,VULKAN2'
      -ts 100,100
      -b 384
      -ub 512

But! With batch size more than 384 I'm getting error with incorrect size and BSOD with video memory issues which never happening with CUDA. I've tested VRAM with memtest_vulkan-v0.5.0 and everything was fine.

First Bad Commit

No response

Relevant log output

CUDA

main: server is listening on http://0.0.0.0:5107 - starting the main loop
srv  update_slots: all slots are idle
srv  log_server_r: request: GET /health 127.0.0.1 200
srv  params_from_: Chat format: Content-only
slot launch_slot_: id  0 | task 0 | processing task
slot update_slots: id  0 | task 0 | new prompt, n_ctx_slot = 65536, n_keep = 0, n_prompt_tokens = 1219
slot update_slots: id  0 | task 0 | kv cache rm [0, end)
slot update_slots: id  0 | task 0 | prompt processing progress, n_past = 1219, n_tokens = 1219, progress = 1.000000
slot update_slots: id  0 | task 0 | prompt done, n_past = 1219, n_tokens = 1219
slot      release: id  0 | task 0 | stop processing: n_past = 1738, truncated = 0
slot print_timing: id  0 | task 0 |
prompt eval time =    2903.47 ms /  1219 tokens (    2.38 ms per token,   419.84 tokens per second)
       eval time =   11284.06 ms /   520 tokens (   21.70 ms per token,    46.08 tokens per second)
      total time =   14187.52 ms /  1739 tokens
srv  update_slots: all slots are idle
srv  log_server_r: request: POST /v1/chat/completions 127.0.0.1 200


VULKAN

[llama-swap] 192.168.1.5 [2025-04-30 16:00:35] "POST /v1/chat/completions HTTP/1.1" 200 117331 "Python/3.11 aiohttp/3.11.11" 27.9262686s
srv  params_from_: Chat format: Content-only
slot launch_slot_: id  0 | task 492 | processing task
slot update_slots: id  0 | task 492 | new prompt, n_ctx_slot = 65536, n_keep = 0, n_prompt_tokens = 349
slot update_slots: id  0 | task 492 | kv cache rm [3, end)
slot update_slots: id  0 | task 492 | prompt processing progress, n_past = 349, n_tokens = 346, progress = 0.991404
slot update_slots: id  0 | task 492 | prompt done, n_past = 349, n_tokens = 346
slot      release: id  0 | task 492 | stop processing: n_past = 9757, truncated = 0
slot print_timing: id  0 | task 492 |
prompt eval time =     358.06 ms /   346 tokens (    1.03 ms per token,   966.33 tokens per second)
       eval time =  135226.91 ms /  9409 tokens (   14.37 ms per token,    69.58 tokens per second)
      total time =  135584.97 ms /  9755 tokens
srv  update_slots: all slots are idle
srv  log_server_r: request: POST /v1/chat/completions 127.0.0.1 200

Metadata

Metadata

Assignees

No one assigned

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions