-
Notifications
You must be signed in to change notification settings - Fork 14k
Description
Name and Version
Details
llama-server --version
load_backend: loaded RPC backend from /home/tipu/Applications/llamacpp/libggml-rpc.so
ggml_vulkan: Found 1 Vulkan devices:
ggml_vulkan: 0 = AMD Radeon Graphics (RADV RENOIR) (radv) | uma: 1 | fp16: 1 | bf16: 0 | warp size: 64 | shared memory: 65536 | int dot: 0 | matrix cores: none
load_backend: loaded Vulkan backend from /home/tipu/Applications/llamacpp/libggml-vulkan.so
load_backend: loaded CPU backend from /home/tipu/Applications/llamacpp/libggml-cpu-haswell.so
version: 7243 (13628d8)
built with cc (Ubuntu 11.4.0-1ubuntu1~22.04.2) 11.4.0 for x86_64-linux-gnu
Operating systems
Linux
GGML backends
Vulkan
Hardware
Details
Device Name AMD Radeon Graphics
PCI (domain:bus:dev.func) 0000:03:00.0
DeviceID:RevID 0x15E7.0xC1
OpenGL Driver Version Mesa 25.3.0 - kisak-mesa PPA
gfx_target_version gfx90c
GPU Type APU
Family Raven (RV)
ASIC Name Renoir
Chip Class GFX9
Shader Engine (SE) 1
Shader Array (SA/SH) per SE 1
CU per SA 8
Total CU 8
RenderBackendPlus (RB+) 2 (16 ROPs)
Peak Pixel Fill-Rate 32 GP/s
GPU Clock 200-2000 MHz
Peak FP32 2048 GFLOPS
VRAM Type DDR4
VRAM Bit Width 128-bit
VRAM Vendor Unknown
VRAM Size 16384 MiB
Memory Clock 400-1333 MHz
ResizableBAR Enabled
ECC Memory Not Supported
L1 Cache (per CU) 16 KiB
L2 Cache 1024 KiB (4 Banks)
Supported Power Profiles[
"3D_FULL_SCREEN",
"VIDEO",
"VR",
"COMPUTE",
"CUSTOM",
]
Models
Qwen3-Next-80B-A3B
Problem description & steps to reproduce
Qwen3-Next-80B-A3B current implementation is not optimized. It is much slower as compared to other A3B Qwen models. In coming weeks/months as per your free time, please support to make it optimized.
First Bad Commit
N/A
Relevant log output
Qwen3-Next-80B-A3B-Instruct llama-bench
Details
bash llama-bench -m /home/tipu/AI/models/unsloth/Qwen3-Next/Qwen3-Next-80B-A3B-Instruct-UD-Q4_K_XL.gguf -ngl 99 --ubatch-size 128,512 --batch-size 2048 --mmap 0 -fa 0,1 --prio 3
load_backend: loaded RPC backend from /home/tipu/Applications/llamacpp/libggml-rpc.so
ggml_vulkan: Found 1 Vulkan devices:
ggml_vulkan: 0 = AMD Radeon Graphics (RADV RENOIR) (radv) | uma: 1 | fp16: 1 | bf16: 0 | warp size: 64 | shared memory: 65536 | int dot: 0 | matrix cores: none
load_backend: loaded Vulkan backend from /home/tipu/Applications/llamacpp/libggml-vulkan.so
load_backend: loaded CPU backend from /home/tipu/Applications/llamacpp/libggml-cpu-haswell.so
| model | size | params | backend | ngl | n_ubatch | fa | mmap | test | t/s |
| ------------------------------ | ---------: | ---------: | ---------- | --: | -------: | -: | ---: | --------------: | -------------------: |
| qwen3next ?B Q4_K - Medium | 42.01 GiB | 79.67 B | Vulkan | 99 | 128 | 0 | 0 | pp512 | 32.24 ± 0.45 |
| qwen3next ?B Q4_K - Medium | 42.01 GiB | 79.67 B | Vulkan | 99 | 128 | 0 | 0 | tg128 | 8.79 ± 0.00 |
| qwen3next ?B Q4_K - Medium | 42.01 GiB | 79.67 B | Vulkan | 99 | 128 | 1 | 0 | pp512 | 32.28 ± 0.51 |
| qwen3next ?B Q4_K - Medium | 42.01 GiB | 79.67 B | Vulkan | 99 | 128 | 1 | 0 | tg128 | 8.82 ± 0.01 |
| qwen3next ?B Q4_K - Medium | 42.01 GiB | 79.67 B | Vulkan | 99 | 512 | 0 | 0 | pp512 | 35.20 ± 0.23 |
| qwen3next ?B Q4_K - Medium | 42.01 GiB | 79.67 B | Vulkan | 99 | 512 | 0 | 0 | tg128 | 8.80 ± 0.01 |
| qwen3next ?B Q4_K - Medium | 42.01 GiB | 79.67 B | Vulkan | 99 | 512 | 1 | 0 | pp512 | 35.16 ± 0.23 |
| qwen3next ?B Q4_K - Medium | 42.01 GiB | 79.67 B | Vulkan | 99 | 512 | 1 | 0 | tg128 | 8.79 ± 0.01 |
build: 13628d8bd (7243)Qwen3-30B-A3B-Thinking-2507 llama-bench More than double inference speed for pps and tg
Details
bash llama-bench -m /home/tipu/AI/models/unsloth/Qwen3-Think-A3B-GGUF/Qwen3-30B-A3B-Thinking-2507-UD-Q4_K_XL.gguf -ngl 99 --ubatch-size 128,512 --batch-size 2048 --mmap 0 -fa 0,1 --prio 3
load_backend: loaded RPC backend from /home/tipu/Applications/llamacpp/libggml-rpc.so
ggml_vulkan: Found 1 Vulkan devices:
ggml_vulkan: 0 = AMD Radeon Graphics (RADV RENOIR) (radv) | uma: 1 | fp16: 1 | bf16: 0 | warp size: 64 | shared memory: 65536 | int dot: 0 | matrix cores: none
load_backend: loaded Vulkan backend from /home/tipu/Applications/llamacpp/libggml-vulkan.so
load_backend: loaded CPU backend from /home/tipu/Applications/llamacpp/libggml-cpu-haswell.so
| model | size | params | backend | ngl | n_ubatch | fa | mmap | test | t/s |
| ------------------------------ | ---------: | ---------: | ---------- | --: | -------: | -: | ---: | --------------: | -------------------: |
| qwen3moe 30B.A3B Q4_K - Medium | 16.49 GiB | 30.53 B | Vulkan | 99 | 128 | 0 | 0 | pp512 | 55.86 ± 0.91 |
| qwen3moe 30B.A3B Q4_K - Medium | 16.49 GiB | 30.53 B | Vulkan | 99 | 128 | 0 | 0 | tg128 | 20.83 ± 0.03 |
| qwen3moe 30B.A3B Q4_K - Medium | 16.49 GiB | 30.53 B | Vulkan | 99 | 128 | 1 | 0 | pp512 | 53.52 ± 0.67 |
| qwen3moe 30B.A3B Q4_K - Medium | 16.49 GiB | 30.53 B | Vulkan | 99 | 128 | 1 | 0 | tg128 | 20.72 ± 0.03 |
| qwen3moe 30B.A3B Q4_K - Medium | 16.49 GiB | 30.53 B | Vulkan | 99 | 512 | 0 | 0 | pp512 | 89.25 ± 0.20 |
| qwen3moe 30B.A3B Q4_K - Medium | 16.49 GiB | 30.53 B | Vulkan | 99 | 512 | 0 | 0 | tg128 | 20.87 ± 0.01 |
| qwen3moe 30B.A3B Q4_K - Medium | 16.49 GiB | 30.53 B | Vulkan | 99 | 512 | 1 | 0 | pp512 | 84.70 ± 0.53 |
| qwen3moe 30B.A3B Q4_K - Medium | 16.49 GiB | 30.53 B | Vulkan | 99 | 512 | 1 | 0 | tg128 | 20.75 ± 0.13 |
build: 13628d8bd (7243)