Eval bug: Qwen3 30B A3B is slow with CUDA

### Name and Version

```
.\llamacpp\llama-server.exe --version
ggml_cuda_init: GGML_CUDA_FORCE_MMQ:    no
ggml_cuda_init: GGML_CUDA_FORCE_CUBLAS: no
ggml_cuda_init: found 3 CUDA devices:
  Device 0: NVIDIA GeForce RTX 4090, compute capability 8.9, VMM: yes
  Device 1: NVIDIA GeForce RTX 3090, compute capability 8.6, VMM: yes
  Device 2: NVIDIA GeForce RTX 3090, compute capability 8.6, VMM: yes
version: 5228 (44cd8d91)
built with MSVC 19.29.30159.0 for
```

### Operating systems

Windows

### GGML backends

CUDA

### Hardware

- **CPU**: Ryzen 7900X
- **CUDA0** RTX 4090 @ x16 - primary GPU for video output
- **CUDA1** RTX 3090 @ x4
- **CUDA2** RTX 3090 @ x1
- **RAM** 64Gb DDR5@6000MT


### Models

Qwen3-30B-A3B-Q6_K from https://huggingface.co/unsloth/Qwen3-30B-A3B-128K-GGUF

### Problem description & steps to reproduce

Using CUDA backend I got only 40-50 tps generation speed.
Here is parameters:
```
      ./llamacpp/llama-server.exe
      --jinja
      --flash-attn
      --no-mmap
      --no-warmup
      --host 0.0.0.0
      --port 5107
      --metrics
      --slots
      -m ./models/Qwen3-30B-A3B-128K-Q6_K.gguf
      -ngl 99
      --ctx-size 65536
      -ctk q8_0
      -ctv q8_0
      -dev 'CUDA1,CUDA2'
      -ts 100,100
```

With Vulkan backend I getting 80-90 tps generation speed with
```
      ./llamacpp/vulkan/llama-server.exe
      --jinja
      --flash-attn
      --no-mmap
      --no-warmup
      --host 0.0.0.0
      --port 5107
      --metrics
      --slots
      -m ./models/Qwen3-30B-A3B-128K-Q6_K.gguf
      -ngl 99
      --ctx-size 65536
      -ctk q8_0
      -ctv q8_0
      -dev 'VULKAN1,VULKAN2'
      -ts 100,100
      -b 384
      -ub 512
```

But! With batch size more than 384 I'm getting error with incorrect size and BSOD with video memory issues which never happening with CUDA. I've tested VRAM with `memtest_vulkan-v0.5.0 ` and everything was fine.

### First Bad Commit

_No response_

### Relevant log output

```shell
CUDA

main: server is listening on http://0.0.0.0:5107 - starting the main loop
srv  update_slots: all slots are idle
srv  log_server_r: request: GET /health 127.0.0.1 200
srv  params_from_: Chat format: Content-only
slot launch_slot_: id  0 | task 0 | processing task
slot update_slots: id  0 | task 0 | new prompt, n_ctx_slot = 65536, n_keep = 0, n_prompt_tokens = 1219
slot update_slots: id  0 | task 0 | kv cache rm [0, end)
slot update_slots: id  0 | task 0 | prompt processing progress, n_past = 1219, n_tokens = 1219, progress = 1.000000
slot update_slots: id  0 | task 0 | prompt done, n_past = 1219, n_tokens = 1219
slot      release: id  0 | task 0 | stop processing: n_past = 1738, truncated = 0
slot print_timing: id  0 | task 0 |
prompt eval time =    2903.47 ms /  1219 tokens (    2.38 ms per token,   419.84 tokens per second)
       eval time =   11284.06 ms /   520 tokens (   21.70 ms per token,    46.08 tokens per second)
      total time =   14187.52 ms /  1739 tokens
srv  update_slots: all slots are idle
srv  log_server_r: request: POST /v1/chat/completions 127.0.0.1 200


VULKAN

[llama-swap] 192.168.1.5 [2025-04-30 16:00:35] "POST /v1/chat/completions HTTP/1.1" 200 117331 "Python/3.11 aiohttp/3.11.11" 27.9262686s
srv  params_from_: Chat format: Content-only
slot launch_slot_: id  0 | task 492 | processing task
slot update_slots: id  0 | task 492 | new prompt, n_ctx_slot = 65536, n_keep = 0, n_prompt_tokens = 349
slot update_slots: id  0 | task 492 | kv cache rm [3, end)
slot update_slots: id  0 | task 492 | prompt processing progress, n_past = 349, n_tokens = 346, progress = 0.991404
slot update_slots: id  0 | task 492 | prompt done, n_past = 349, n_tokens = 346
slot      release: id  0 | task 492 | stop processing: n_past = 9757, truncated = 0
slot print_timing: id  0 | task 492 |
prompt eval time =     358.06 ms /   346 tokens (    1.03 ms per token,   966.33 tokens per second)
       eval time =  135226.91 ms /  9409 tokens (   14.37 ms per token,    69.58 tokens per second)
      total time =  135584.97 ms /  9755 tokens
srv  update_slots: all slots are idle
srv  log_server_r: request: POST /v1/chat/completions 127.0.0.1 200
```

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Eval bug: Qwen3 30B A3B is slow with CUDA #13211

Name and Version

Operating systems

GGML backends

Hardware

Models

Problem description & steps to reproduce

First Bad Commit

Relevant log output

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Eval bug: Qwen3 30B A3B is slow with CUDA #13211

Description

Name and Version

Operating systems

GGML backends

Hardware

Models

Problem description & steps to reproduce

First Bad Commit

Relevant log output

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions