Eval bug: Request for Qwen3-Next-80B-A3B Vulkan Inference Optimization

### Name and Version

<details><summary>Details</summary>


llama-server --version
load_backend: loaded RPC backend from /home/tipu/Applications/llamacpp/libggml-rpc.so
ggml_vulkan: Found 1 Vulkan devices:
ggml_vulkan: 0 = AMD Radeon Graphics (RADV RENOIR) (radv) | uma: 1 | fp16: 1 | bf16: 0 | warp size: 64 | shared memory: 65536 | int dot: 0 | matrix cores: none
load_backend: loaded Vulkan backend from /home/tipu/Applications/llamacpp/libggml-vulkan.so
load_backend: loaded CPU backend from /home/tipu/Applications/llamacpp/libggml-cpu-haswell.so
version: 7243 (13628d8bd)
built with cc (Ubuntu 11.4.0-1ubuntu1~22.04.2) 11.4.0 for x86_64-linux-gnu


</details> 

### Operating systems

Linux

### GGML backends

Vulkan

### Hardware
<details><summary>Details</summary>


Device Name AMD Radeon Graphics
PCI (domain:bus:dev.func) 0000:03:00.0
DeviceID:RevID 0x15E7.0xC1
OpenGL Driver Version Mesa 25.3.0 - kisak-mesa PPA
gfx_target_version gfx90c

GPU Type APU
Family Raven (RV)
ASIC Name Renoir
Chip Class GFX9
Shader Engine (SE) 1
Shader Array (SA/SH) per SE 1
CU per SA 8
Total CU 8
RenderBackendPlus (RB+) 2 (16 ROPs)
Peak Pixel Fill-Rate 32 GP/s
GPU Clock 200-2000 MHz
Peak FP32 2048 GFLOPS

VRAM Type DDR4
VRAM Bit Width 128-bit
VRAM Vendor Unknown
VRAM Size 16384 MiB
Memory Clock 400-1333 MHz
ResizableBAR Enabled
ECC Memory Not Supported

L1 Cache (per CU) 16 KiB
L2 Cache 1024 KiB (4 Banks)

Supported Power Profiles[
 "3D_FULL_SCREEN",
 "VIDEO",
 "VR",
 "COMPUTE",
 "CUSTOM",
]


</details> 

### Models

Qwen3-Next-80B-A3B

### Problem description & steps to reproduce

Qwen3-Next-80B-A3B current implementation is not optimized. It is much slower as compared to other A3B Qwen models. In coming weeks/months as per your free time, please support to make it optimized. 

### First Bad Commit

N/A 

### Relevant log output

**Qwen3-Next-80B-A3B-Instruct llama-bench**
<details><summary>Details</summary>


```shell
bash  llama-bench -m /home/tipu/AI/models/unsloth/Qwen3-Next/Qwen3-Next-80B-A3B-Instruct-UD-Q4_K_XL.gguf -ngl 99 --ubatch-size 128,512 --batch-size 2048 --mmap 0 -fa 0,1 --prio 3
load_backend: loaded RPC backend from /home/tipu/Applications/llamacpp/libggml-rpc.so
ggml_vulkan: Found 1 Vulkan devices:
ggml_vulkan: 0 = AMD Radeon Graphics (RADV RENOIR) (radv) | uma: 1 | fp16: 1 | bf16: 0 | warp size: 64 | shared memory: 65536 | int dot: 0 | matrix cores: none
load_backend: loaded Vulkan backend from /home/tipu/Applications/llamacpp/libggml-vulkan.so
load_backend: loaded CPU backend from /home/tipu/Applications/llamacpp/libggml-cpu-haswell.so
| model | size | params | backend | ngl | n_ubatch | fa | mmap | test | t/s |
| ------------------------------ | ---------: | ---------: | ---------- | --: | -------: | -: | ---: | --------------: | -------------------: |
| qwen3next ?B Q4_K - Medium | 42.01 GiB | 79.67 B | Vulkan | 99 | 128 | 0 | 0 | pp512 | 32.24 ± 0.45 |
| qwen3next ?B Q4_K - Medium | 42.01 GiB | 79.67 B | Vulkan | 99 | 128 | 0 | 0 | tg128 | 8.79 ± 0.00 |
| qwen3next ?B Q4_K - Medium | 42.01 GiB | 79.67 B | Vulkan | 99 | 128 | 1 | 0 | pp512 | 32.28 ± 0.51 |
| qwen3next ?B Q4_K - Medium | 42.01 GiB | 79.67 B | Vulkan | 99 | 128 | 1 | 0 | tg128 | 8.82 ± 0.01 |
| qwen3next ?B Q4_K - Medium | 42.01 GiB | 79.67 B | Vulkan | 99 | 512 | 0 | 0 | pp512 | 35.20 ± 0.23 |
| qwen3next ?B Q4_K - Medium | 42.01 GiB | 79.67 B | Vulkan | 99 | 512 | 0 | 0 | tg128 | 8.80 ± 0.01 |
| qwen3next ?B Q4_K - Medium | 42.01 GiB | 79.67 B | Vulkan | 99 | 512 | 1 | 0 | pp512 | 35.16 ± 0.23 |
| qwen3next ?B Q4_K - Medium | 42.01 GiB | 79.67 B | Vulkan | 99 | 512 | 1 | 0 | tg128 | 8.79 ± 0.01 |

build: 13628d8bd (7243)
```


</details> 


**Qwen3-30B-A3B-Thinking-2507 llama-bench More than double inference speed for pps and tg**


<details><summary>Details</summary>


```
bash  llama-bench -m /home/tipu/AI/models/unsloth/Qwen3-Think-A3B-GGUF/Qwen3-30B-A3B-Thinking-2507-UD-Q4_K_XL.gguf -ngl 99 --ubatch-size 128,512 --batch-size 2048 --mmap 0 -fa 0,1 --prio 3
load_backend: loaded RPC backend from /home/tipu/Applications/llamacpp/libggml-rpc.so
ggml_vulkan: Found 1 Vulkan devices:
ggml_vulkan: 0 = AMD Radeon Graphics (RADV RENOIR) (radv) | uma: 1 | fp16: 1 | bf16: 0 | warp size: 64 | shared memory: 65536 | int dot: 0 | matrix cores: none
load_backend: loaded Vulkan backend from /home/tipu/Applications/llamacpp/libggml-vulkan.so
load_backend: loaded CPU backend from /home/tipu/Applications/llamacpp/libggml-cpu-haswell.so
| model | size | params | backend | ngl | n_ubatch | fa | mmap | test | t/s |
| ------------------------------ | ---------: | ---------: | ---------- | --: | -------: | -: | ---: | --------------: | -------------------: |
| qwen3moe 30B.A3B Q4_K - Medium | 16.49 GiB | 30.53 B | Vulkan | 99 | 128 | 0 | 0 | pp512 | 55.86 ± 0.91 |
| qwen3moe 30B.A3B Q4_K - Medium | 16.49 GiB | 30.53 B | Vulkan | 99 | 128 | 0 | 0 | tg128 | 20.83 ± 0.03 |
| qwen3moe 30B.A3B Q4_K - Medium | 16.49 GiB | 30.53 B | Vulkan | 99 | 128 | 1 | 0 | pp512 | 53.52 ± 0.67 |
| qwen3moe 30B.A3B Q4_K - Medium | 16.49 GiB | 30.53 B | Vulkan | 99 | 128 | 1 | 0 | tg128 | 20.72 ± 0.03 |
| qwen3moe 30B.A3B Q4_K - Medium | 16.49 GiB | 30.53 B | Vulkan | 99 | 512 | 0 | 0 | pp512 | 89.25 ± 0.20 |
| qwen3moe 30B.A3B Q4_K - Medium | 16.49 GiB | 30.53 B | Vulkan | 99 | 512 | 0 | 0 | tg128 | 20.87 ± 0.01 |
| qwen3moe 30B.A3B Q4_K - Medium | 16.49 GiB | 30.53 B | Vulkan | 99 | 512 | 1 | 0 | pp512 | 84.70 ± 0.53 |
| qwen3moe 30B.A3B Q4_K - Medium | 16.49 GiB | 30.53 B | Vulkan | 99 | 512 | 1 | 0 | tg128 | 20.75 ± 0.13 |

build: 13628d8bd (7243)
```


</details>

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Eval bug: Request for Qwen3-Next-80B-A3B Vulkan Inference Optimization #17751

Name and Version

Operating systems

GGML backends

Hardware

Models

Problem description & steps to reproduce

First Bad Commit

Relevant log output

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Eval bug: Request for Qwen3-Next-80B-A3B Vulkan Inference Optimization #17751

Description

Name and Version

Operating systems

GGML backends

Hardware

Models

Problem description & steps to reproduce

First Bad Commit

Relevant log output

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions