Skip to content

Conversation

@zhang-hui-yulo
Copy link

@zhang-hui-yulo zhang-hui-yulo commented Nov 7, 2025

Add RDNA4 tensor core support for MMF, honestly the performance is lower than expectation. The model is at https://huggingface.co/Mungert/DeepSeek-R1-0528-Qwen3-8B-GGUF

Model Microbatch size Test t/s master t/s 672492fc Speedup
qwen3 8B Q8_0 1 pp512 46.48 54.61 1.18
qwen3 8B Q8_0 2 pp512 89.96 85.92 0.96
qwen3 8B Q8_0 3 pp512 132.92 126.23 0.95
qwen3 8B Q8_0 4 pp512 176.06 166.12 0.94
qwen3 8B Q8_0 5 pp512 212.00 197.77 0.93
qwen3 8B Q8_0 6 pp512 252.54 233.83 0.93
qwen3 8B Q8_0 7 pp512 289.87 266.58 0.92
qwen3 8B Q8_0 8 pp512 318.56 290.63 0.91
qwen3 8B Q8_0 9 pp512 344.41 314.93 0.91
qwen3 8B Q8_0 10 pp512 377.97 342.75 0.91
qwen3 8B Q8_0 11 pp512 416.42 373.85 0.90
qwen3 8B Q8_0 12 pp512 447.61 398.83 0.89
qwen3 8B Q8_0 13 pp512 486.83 429.74 0.88
qwen3 8B Q8_0 14 pp512 525.24 458.88 0.87
qwen3 8B Q8_0 15 pp512 555.91 482.08 0.87
qwen3 8B Q8_0 16 pp512 580.07 512.47 0.88

@JohannesGaessler, looks like that #16988 changes mmf.cu to

for (size_t i = 0; i < GGML_MAX_DIMS; ++i) {
    if (src0_nb[i] % (2*ts) != 0) {
        return false;
    }
}

then native mmf won't be excised on my RDNA4, it always uses hipblas path.

@github-actions github-actions bot added Nvidia GPU Issues specific to Nvidia GPUs ggml changes relating to the ggml tensor library for machine learning labels Nov 7, 2025
@JohannesGaessler
Copy link
Collaborator

honestly the performance is lower than expectation.

On RDNA the WMMA instructions do to my knowledge not increase peak FLOPS, they only reduce I/O and register usage.

then native mmf won't be excised on my RDNA4, it always uses hipblas path.

Yes sorry, that was a bug that I introduced.

@zhang-hui-yulo
Copy link
Author

zhang-hui-yulo commented Nov 8, 2025

honestly the performance is lower than expectation.

On RDNA the WMMA instructions do to my knowledge not increase peak FLOPS, they only reduce I/O and register usage.

then native mmf won't be excised on my RDNA4, it always uses hipblas path.

Yes sorry, that was a bug that I introduced.

Thank you for the tip, AFAIK, tensor core on RDNA3 uses the same silicon of vector instructions, RDNA4 redesigns the tensor core and makes it more like CDNA.

But at least it shall not be slower than hipblas, I shall spend sometime to find out the root cause, at least I know that hip compiler doesn't acquire register very well.

@JohannesGaessler
Copy link
Collaborator

Looking at the data layout I suspect the biggest problem has to do with shared memory bank conflicts or whatever you would call it for AMD. For NVIDIA I chose the shared memory layout to be padded with 16 bytes because the dedicated ldmatrix instruction can be used to load 4 bytes per thread with groups of 4 threads making a single 16 byte load from shared memory. If you just load 4 byte chunks with the regular indices provided by get_i and get_j you end up with the memory accesses going to only 16 out of (I think) the 32 shared memory banks and you only get 50% of the memory bandwidth.

@zhang-hui-yulo
Copy link
Author

Looking at the data layout I suspect the biggest problem has to do with shared memory bank conflicts or whatever you would call it for AMD. For NVIDIA I chose the shared memory layout to be padded with 16 bytes because the dedicated ldmatrix instruction can be used to load 4 bytes per thread with groups of 4 threads making a single 16 byte load from shared memory. If you just load 4 byte chunks with the regular indices provided by get_i and get_j you end up with the memory accesses going to only 16 out of (I think) the 32 shared memory banks and you only get 50% of the memory bandwidth.

Thank you for the tips, there is little info for AMD bank layout, based on the limited document I have, RDNA3 has 32 banks in CU mode and 54 banks in WGP mode, WGP mode is the default one, the bank width is DWORD, I don't have any doc for RDNA4 so I assume it shall be similar as before, so I don't change any code logic in mmf, just adapter the wmma instruction.

Based on the wmma layout of RDNA4, I just keep the old ldmatrix logic and use vectorized load in load_generic, honestly I'm not sure if shared memory bank conflict is the root cause.

wmma_f16_16x16x16_f16_w32_gfx12

@JohannesGaessler
Copy link
Collaborator

Are you aware of the AMD ISA documentation?

@zhang-hui-yulo
Copy link
Author

Are you aware of the AMD ISA documentation?

Honestly, not very much as it isn't friendly for software developer, based on the gemm benchmark on my modified nvidia cute for RDNA3, the bank layout is same as nvidia ampere.

Just have a check on RDNA4 ISA

1.2.2.1. Local Data Share (LDS)

...
Each work-group processor (WGP) has a 128kB memory space that enables low-latency communication
 between work-items within a work-group, or the work-items within a wave; this is the local data share (LDS).
 This memory is configured with 64 banks, each with 512 entries of 4 bytes. 
12.1. Overview
 There are 128kB of memory per work-group processor split up into 64 banks of DWORD-wide RAMs. These 64
 banks are further sub-divided into two sets of 32-banks each where 32 of the banks are affiliated with a pair of
 SIMD32’s, and the other 32 banks are affiliated with the other pair of SIMD32’s within the WGP. Each bank is a
 512x32 two-port RAM (1R/1W per clock cycle). DWORDs are placed in the banks serially, but all banks can
 execute a store or load simultaneously. One work-group can request up to 64kB memory.

So, it's 32 banks in CU mode and 64 banks in WGP mode, the bank width is 4 bytes.

@zhang-hui-yulo
Copy link
Author

I think I've found the root cause, mat_mmf_f is slower than hipblas, mat_mmf_ids is better than hipblas.

MUL_MAT

Backend GGML op Op parameters TFLOPS master TFLOPS mmf_wmma_rdna4 Speedup
ROCm0 MUL_MAT type_a=bf16,type_b=f32,m=4096,n=1,k=14336,bs=[1,1],nr=[1,1],per=[0,1,2,3],k_v=0,o=1 0.61 0.61 1.00
ROCm0 MUL_MAT type_a=bf16,type_b=f32,m=4096,n=2,k=14336,bs=[1,1],nr=[1,1],per=[0,1,2,3],k_v=0,o=1 1.20 1.20 1.00
ROCm0 MUL_MAT type_a=bf16,type_b=f32,m=4096,n=3,k=14336,bs=[1,1],nr=[1,1],per=[0,1,2,3],k_v=0,o=1 1.65 1.65 1.00
ROCm0 MUL_MAT type_a=bf16,type_b=f32,m=4096,n=4,k=14336,bs=[1,1],nr=[1,1],per=[0,1,2,3],k_v=0,o=1 1.98 1.79 0.90
ROCm0 MUL_MAT type_a=bf16,type_b=f32,m=4096,n=5,k=14336,bs=[1,1],nr=[1,1],per=[0,1,2,3],k_v=0,o=1 2.50 2.18 0.87
ROCm0 MUL_MAT type_a=bf16,type_b=f32,m=4096,n=512,k=14336,bs=[1,1],nr=[1,1],per=[0,1,2,3],k_v=0,o=1 92.19 93.01 1.01
ROCm0 MUL_MAT type_a=bf16,type_b=f32,m=4096,n=8,k=14336,bs=[1,1],nr=[1,1],per=[0,1,2,3],k_v=0,o=1 3.92 3.22 0.82
ROCm0 MUL_MAT type_a=f16,type_b=f32,m=128,n=1,k=16416,bs=[8,1],nr=[4,1],per=[0,1,2,3],k_v=32832,o=1 1.37 1.37 1.00
ROCm0 MUL_MAT type_a=f16,type_b=f32,m=16416,n=1,k=128,bs=[8,1],nr=[4,1],per=[0,2,1,3],k_v=0,o=1 0.34 0.33 0.98
ROCm0 MUL_MAT type_a=f16,type_b=f32,m=4096,n=1,k=14336,bs=[1,1],nr=[1,1],per=[0,1,2,3],k_v=0,o=1 0.61 0.61 1.00
ROCm0 MUL_MAT type_a=f16,type_b=f32,m=4096,n=2,k=14336,bs=[1,1],nr=[1,1],per=[0,1,2,3],k_v=0,o=1 1.21 1.21 1.00
ROCm0 MUL_MAT type_a=f16,type_b=f32,m=4096,n=3,k=14336,bs=[1,1],nr=[1,1],per=[0,1,2,3],k_v=0,o=1 1.77 1.77 1.00
ROCm0 MUL_MAT type_a=f16,type_b=f32,m=4096,n=4,k=14336,bs=[1,1],nr=[1,1],per=[0,1,2,3],k_v=0,o=1 2.27 2.27 1.00
ROCm0 MUL_MAT type_a=f16,type_b=f32,m=4096,n=5,k=14336,bs=[1,1],nr=[1,1],per=[0,1,2,3],k_v=0,o=1 2.67 2.67 1.00
ROCm0 MUL_MAT type_a=f16,type_b=f32,m=4096,n=512,k=14336,bs=[1,1],nr=[1,1],per=[0,1,2,3],k_v=0,o=1 95.79 95.64 1.00
ROCm0 MUL_MAT type_a=f16,type_b=f32,m=4096,n=8,k=14336,bs=[1,1],nr=[1,1],per=[0,1,2,3],k_v=0,o=1 4.03 3.25 0.81
ROCm0 MUL_MAT type_a=f32,type_b=f32,m=4096,n=1,k=14336,bs=[1,1],nr=[1,1],per=[0,1,2,3],k_v=0,o=1 0.31 0.31 1.00
ROCm0 MUL_MAT type_a=f32,type_b=f32,m=4096,n=2,k=14336,bs=[1,1],nr=[1,1],per=[0,1,2,3],k_v=0,o=1 0.63 0.63 1.00
ROCm0 MUL_MAT type_a=f32,type_b=f32,m=4096,n=3,k=14336,bs=[1,1],nr=[1,1],per=[0,1,2,3],k_v=0,o=1 0.94 0.94 1.00
ROCm0 MUL_MAT type_a=f32,type_b=f32,m=4096,n=4,k=14336,bs=[1,1],nr=[1,1],per=[0,1,2,3],k_v=0,o=1 1.25 1.25 1.00
ROCm0 MUL_MAT type_a=f32,type_b=f32,m=4096,n=5,k=14336,bs=[1,1],nr=[1,1],per=[0,1,2,3],k_v=0,o=1 1.54 1.54 1.00
ROCm0 MUL_MAT type_a=f32,type_b=f32,m=4096,n=512,k=14336,bs=[1,1],nr=[1,1],per=[0,1,2,3],k_v=0,o=1 3.47 3.47 1.00
ROCm0 MUL_MAT type_a=f32,type_b=f32,m=4096,n=8,k=14336,bs=[1,1],nr=[1,1],per=[0,1,2,3],k_v=0,o=1 2.30 2.30 1.00
ROCm0 MUL_MAT type_a=iq1_m,type_b=f32,m=4096,n=1,k=14336,bs=[1,1],nr=[1,1],per=[0,1,2,3],k_v=0,o=1 3.64 3.65 1.00
ROCm0 MUL_MAT type_a=iq1_m,type_b=f32,m=4096,n=2,k=14336,bs=[1,1],nr=[1,1],per=[0,1,2,3],k_v=0,o=1 5.94 5.93 1.00
ROCm0 MUL_MAT type_a=iq1_m,type_b=f32,m=4096,n=3,k=14336,bs=[1,1],nr=[1,1],per=[0,1,2,3],k_v=0,o=1 7.00 6.97 1.00
ROCm0 MUL_MAT type_a=iq1_m,type_b=f32,m=4096,n=4,k=14336,bs=[1,1],nr=[1,1],per=[0,1,2,3],k_v=0,o=1 7.84 7.83 1.00
ROCm0 MUL_MAT type_a=iq1_m,type_b=f32,m=4096,n=5,k=14336,bs=[1,1],nr=[1,1],per=[0,1,2,3],k_v=0,o=1 8.19 8.26 1.01
ROCm0 MUL_MAT type_a=iq1_m,type_b=f32,m=4096,n=512,k=14336,bs=[1,1],nr=[1,1],per=[0,1,2,3],k_v=0,o=1 74.72 74.74 1.00
ROCm0 MUL_MAT type_a=iq1_m,type_b=f32,m=4096,n=8,k=14336,bs=[1,1],nr=[1,1],per=[0,1,2,3],k_v=0,o=1 9.37 9.34 1.00
ROCm0 MUL_MAT type_a=iq1_s,type_b=f32,m=4096,n=1,k=14336,bs=[1,1],nr=[1,1],per=[0,1,2,3],k_v=0,o=1 4.13 4.13 1.00
ROCm0 MUL_MAT type_a=iq1_s,type_b=f32,m=4096,n=2,k=14336,bs=[1,1],nr=[1,1],per=[0,1,2,3],k_v=0,o=1 6.92 6.90 1.00
ROCm0 MUL_MAT type_a=iq1_s,type_b=f32,m=4096,n=3,k=14336,bs=[1,1],nr=[1,1],per=[0,1,2,3],k_v=0,o=1 7.43 7.40 1.00
ROCm0 MUL_MAT type_a=iq1_s,type_b=f32,m=4096,n=4,k=14336,bs=[1,1],nr=[1,1],per=[0,1,2,3],k_v=0,o=1 8.22 8.21 1.00
ROCm0 MUL_MAT type_a=iq1_s,type_b=f32,m=4096,n=5,k=14336,bs=[1,1],nr=[1,1],per=[0,1,2,3],k_v=0,o=1 8.84 8.91 1.01
ROCm0 MUL_MAT type_a=iq1_s,type_b=f32,m=4096,n=512,k=14336,bs=[1,1],nr=[1,1],per=[0,1,2,3],k_v=0,o=1 75.09 75.23 1.00
ROCm0 MUL_MAT type_a=iq1_s,type_b=f32,m=4096,n=8,k=14336,bs=[1,1],nr=[1,1],per=[0,1,2,3],k_v=0,o=1 9.03 8.91 0.99
ROCm0 MUL_MAT type_a=iq2_s,type_b=f32,m=4096,n=1,k=14336,bs=[1,1],nr=[1,1],per=[0,1,2,3],k_v=0,o=1 1.61 1.60 1.00
ROCm0 MUL_MAT type_a=iq2_s,type_b=f32,m=4096,n=2,k=14336,bs=[1,1],nr=[1,1],per=[0,1,2,3],k_v=0,o=1 2.85 2.84 1.00
ROCm0 MUL_MAT type_a=iq2_s,type_b=f32,m=4096,n=3,k=14336,bs=[1,1],nr=[1,1],per=[0,1,2,3],k_v=0,o=1 3.97 3.97 1.00
ROCm0 MUL_MAT type_a=iq2_s,type_b=f32,m=4096,n=4,k=14336,bs=[1,1],nr=[1,1],per=[0,1,2,3],k_v=0,o=1 4.80 4.80 1.00
ROCm0 MUL_MAT type_a=iq2_s,type_b=f32,m=4096,n=5,k=14336,bs=[1,1],nr=[1,1],per=[0,1,2,3],k_v=0,o=1 5.09 5.08 1.00
ROCm0 MUL_MAT type_a=iq2_s,type_b=f32,m=4096,n=512,k=14336,bs=[1,1],nr=[1,1],per=[0,1,2,3],k_v=0,o=1 74.15 74.19 1.00
ROCm0 MUL_MAT type_a=iq2_s,type_b=f32,m=4096,n=8,k=14336,bs=[1,1],nr=[1,1],per=[0,1,2,3],k_v=0,o=1 6.53 6.52 1.00
ROCm0 MUL_MAT type_a=iq2_xs,type_b=f32,m=4096,n=1,k=14336,bs=[1,1],nr=[1,1],per=[0,1,2,3],k_v=0,o=1 2.20 2.20 1.00
ROCm0 MUL_MAT type_a=iq2_xs,type_b=f32,m=4096,n=2,k=14336,bs=[1,1],nr=[1,1],per=[0,1,2,3],k_v=0,o=1 3.85 3.84 1.00
ROCm0 MUL_MAT type_a=iq2_xs,type_b=f32,m=4096,n=3,k=14336,bs=[1,1],nr=[1,1],per=[0,1,2,3],k_v=0,o=1 5.05 5.05 1.00
ROCm0 MUL_MAT type_a=iq2_xs,type_b=f32,m=4096,n=4,k=14336,bs=[1,1],nr=[1,1],per=[0,1,2,3],k_v=0,o=1 5.88 5.87 1.00
ROCm0 MUL_MAT type_a=iq2_xs,type_b=f32,m=4096,n=5,k=14336,bs=[1,1],nr=[1,1],per=[0,1,2,3],k_v=0,o=1 6.62 6.61 1.00
ROCm0 MUL_MAT type_a=iq2_xs,type_b=f32,m=4096,n=512,k=14336,bs=[1,1],nr=[1,1],per=[0,1,2,3],k_v=0,o=1 74.19 74.25 1.00
ROCm0 MUL_MAT type_a=iq2_xs,type_b=f32,m=4096,n=8,k=14336,bs=[1,1],nr=[1,1],per=[0,1,2,3],k_v=0,o=1 7.79 7.79 1.00
ROCm0 MUL_MAT type_a=iq2_xxs,type_b=f32,m=4096,n=1,k=14336,bs=[1,1],nr=[1,1],per=[0,1,2,3],k_v=0,o=1 1.66 1.66 1.00
ROCm0 MUL_MAT type_a=iq2_xxs,type_b=f32,m=4096,n=2,k=14336,bs=[1,1],nr=[1,1],per=[0,1,2,3],k_v=0,o=1 2.99 2.99 1.00
ROCm0 MUL_MAT type_a=iq2_xxs,type_b=f32,m=4096,n=3,k=14336,bs=[1,1],nr=[1,1],per=[0,1,2,3],k_v=0,o=1 4.16 4.16 1.00
ROCm0 MUL_MAT type_a=iq2_xxs,type_b=f32,m=4096,n=4,k=14336,bs=[1,1],nr=[1,1],per=[0,1,2,3],k_v=0,o=1 5.04 5.04 1.00
ROCm0 MUL_MAT type_a=iq2_xxs,type_b=f32,m=4096,n=5,k=14336,bs=[1,1],nr=[1,1],per=[0,1,2,3],k_v=0,o=1 5.82 5.82 1.00
ROCm0 MUL_MAT type_a=iq2_xxs,type_b=f32,m=4096,n=512,k=14336,bs=[1,1],nr=[1,1],per=[0,1,2,3],k_v=0,o=1 74.48 74.52 1.00
ROCm0 MUL_MAT type_a=iq2_xxs,type_b=f32,m=4096,n=8,k=14336,bs=[1,1],nr=[1,1],per=[0,1,2,3],k_v=0,o=1 6.90 6.90 1.00
ROCm0 MUL_MAT type_a=iq3_s,type_b=f32,m=4096,n=1,k=14336,bs=[1,1],nr=[1,1],per=[0,1,2,3],k_v=0,o=1 1.56 1.56 1.00
ROCm0 MUL_MAT type_a=iq3_s,type_b=f32,m=4096,n=2,k=14336,bs=[1,1],nr=[1,1],per=[0,1,2,3],k_v=0,o=1 2.88 2.87 1.00
ROCm0 MUL_MAT type_a=iq3_s,type_b=f32,m=4096,n=3,k=14336,bs=[1,1],nr=[1,1],per=[0,1,2,3],k_v=0,o=1 4.04 4.04 1.00
ROCm0 MUL_MAT type_a=iq3_s,type_b=f32,m=4096,n=4,k=14336,bs=[1,1],nr=[1,1],per=[0,1,2,3],k_v=0,o=1 5.02 5.01 1.00
ROCm0 MUL_MAT type_a=iq3_s,type_b=f32,m=4096,n=5,k=14336,bs=[1,1],nr=[1,1],per=[0,1,2,3],k_v=0,o=1 5.86 5.84 1.00
ROCm0 MUL_MAT type_a=iq3_s,type_b=f32,m=4096,n=512,k=14336,bs=[1,1],nr=[1,1],per=[0,1,2,3],k_v=0,o=1 73.34 73.21 1.00
ROCm0 MUL_MAT type_a=iq3_s,type_b=f32,m=4096,n=8,k=14336,bs=[1,1],nr=[1,1],per=[0,1,2,3],k_v=0,o=1 7.06 7.04 1.00
ROCm0 MUL_MAT type_a=iq3_xxs,type_b=f32,m=4096,n=1,k=14336,bs=[1,1],nr=[1,1],per=[0,1,2,3],k_v=0,o=1 2.14 2.13 1.00
ROCm0 MUL_MAT type_a=iq3_xxs,type_b=f32,m=4096,n=2,k=14336,bs=[1,1],nr=[1,1],per=[0,1,2,3],k_v=0,o=1 3.79 3.79 1.00
ROCm0 MUL_MAT type_a=iq3_xxs,type_b=f32,m=4096,n=3,k=14336,bs=[1,1],nr=[1,1],per=[0,1,2,3],k_v=0,o=1 5.08 5.08 1.00
ROCm0 MUL_MAT type_a=iq3_xxs,type_b=f32,m=4096,n=4,k=14336,bs=[1,1],nr=[1,1],per=[0,1,2,3],k_v=0,o=1 5.81 5.80 1.00
ROCm0 MUL_MAT type_a=iq3_xxs,type_b=f32,m=4096,n=5,k=14336,bs=[1,1],nr=[1,1],per=[0,1,2,3],k_v=0,o=1 6.85 6.84 1.00
ROCm0 MUL_MAT type_a=iq3_xxs,type_b=f32,m=4096,n=512,k=14336,bs=[1,1],nr=[1,1],per=[0,1,2,3],k_v=0,o=1 73.37 73.46 1.00
ROCm0 MUL_MAT type_a=iq3_xxs,type_b=f32,m=4096,n=8,k=14336,bs=[1,1],nr=[1,1],per=[0,1,2,3],k_v=0,o=1 7.69 7.68 1.00
ROCm0 MUL_MAT type_a=iq4_nl,type_b=f32,m=4096,n=1,k=14336,bs=[1,1],nr=[1,1],per=[0,1,2,3],k_v=0,o=1 3.79 3.79 1.00
ROCm0 MUL_MAT type_a=iq4_nl,type_b=f32,m=4096,n=2,k=14336,bs=[1,1],nr=[1,1],per=[0,1,2,3],k_v=0,o=1 5.48 5.48 1.00
ROCm0 MUL_MAT type_a=iq4_nl,type_b=f32,m=4096,n=3,k=14336,bs=[1,1],nr=[1,1],per=[0,1,2,3],k_v=0,o=1 7.24 7.21 1.00
ROCm0 MUL_MAT type_a=iq4_nl,type_b=f32,m=4096,n=4,k=14336,bs=[1,1],nr=[1,1],per=[0,1,2,3],k_v=0,o=1 8.77 8.70 0.99
ROCm0 MUL_MAT type_a=iq4_nl,type_b=f32,m=4096,n=5,k=14336,bs=[1,1],nr=[1,1],per=[0,1,2,3],k_v=0,o=1 8.54 8.47 0.99
ROCm0 MUL_MAT type_a=iq4_nl,type_b=f32,m=4096,n=512,k=14336,bs=[1,1],nr=[1,1],per=[0,1,2,3],k_v=0,o=1 72.52 72.51 1.00
ROCm0 MUL_MAT type_a=iq4_nl,type_b=f32,m=4096,n=8,k=14336,bs=[1,1],nr=[1,1],per=[0,1,2,3],k_v=0,o=1 9.25 9.22 1.00
ROCm0 MUL_MAT type_a=iq4_xs,type_b=f32,m=4096,n=1,k=14336,bs=[1,1],nr=[1,1],per=[0,1,2,3],k_v=0,o=1 3.76 3.76 1.00
ROCm0 MUL_MAT type_a=iq4_xs,type_b=f32,m=4096,n=2,k=14336,bs=[1,1],nr=[1,1],per=[0,1,2,3],k_v=0,o=1 6.54 6.52 1.00
ROCm0 MUL_MAT type_a=iq4_xs,type_b=f32,m=4096,n=3,k=14336,bs=[1,1],nr=[1,1],per=[0,1,2,3],k_v=0,o=1 8.94 8.93 1.00
ROCm0 MUL_MAT type_a=iq4_xs,type_b=f32,m=4096,n=4,k=14336,bs=[1,1],nr=[1,1],per=[0,1,2,3],k_v=0,o=1 9.69 9.66 1.00
ROCm0 MUL_MAT type_a=iq4_xs,type_b=f32,m=4096,n=5,k=14336,bs=[1,1],nr=[1,1],per=[0,1,2,3],k_v=0,o=1 10.25 10.21 1.00
ROCm0 MUL_MAT type_a=iq4_xs,type_b=f32,m=4096,n=512,k=14336,bs=[1,1],nr=[1,1],per=[0,1,2,3],k_v=0,o=1 72.60 72.56 1.00
ROCm0 MUL_MAT type_a=iq4_xs,type_b=f32,m=4096,n=8,k=14336,bs=[1,1],nr=[1,1],per=[0,1,2,3],k_v=0,o=1 10.13 10.11 1.00
ROCm0 MUL_MAT type_a=mxfp4,type_b=f32,m=4096,n=1,k=14336,bs=[1,1],nr=[1,1],per=[0,1,2,3],k_v=0,o=1 3.56 3.56 1.00
ROCm0 MUL_MAT type_a=mxfp4,type_b=f32,m=4096,n=2,k=14336,bs=[1,1],nr=[1,1],per=[0,1,2,3],k_v=0,o=1 5.08 5.05 0.99
ROCm0 MUL_MAT type_a=mxfp4,type_b=f32,m=4096,n=3,k=14336,bs=[1,1],nr=[1,1],per=[0,1,2,3],k_v=0,o=1 7.15 7.12 1.00
ROCm0 MUL_MAT type_a=mxfp4,type_b=f32,m=4096,n=4,k=14336,bs=[1,1],nr=[1,1],per=[0,1,2,3],k_v=0,o=1 8.53 8.54 1.00
ROCm0 MUL_MAT type_a=mxfp4,type_b=f32,m=4096,n=5,k=14336,bs=[1,1],nr=[1,1],per=[0,1,2,3],k_v=0,o=1 8.72 8.70 1.00
ROCm0 MUL_MAT type_a=mxfp4,type_b=f32,m=4096,n=512,k=14336,bs=[1,1],nr=[1,1],per=[0,1,2,3],k_v=0,o=1 73.92 73.95 1.00
ROCm0 MUL_MAT type_a=mxfp4,type_b=f32,m=4096,n=8,k=14336,bs=[1,1],nr=[1,1],per=[0,1,2,3],k_v=0,o=1 9.13 9.13 1.00
ROCm0 MUL_MAT type_a=q2_K,type_b=f32,m=4096,n=1,k=14336,bs=[1,1],nr=[1,1],per=[0,1,2,3],k_v=0,o=1 2.86 2.85 1.00
ROCm0 MUL_MAT type_a=q2_K,type_b=f32,m=4096,n=2,k=14336,bs=[1,1],nr=[1,1],per=[0,1,2,3],k_v=0,o=1 3.69 3.68 1.00
ROCm0 MUL_MAT type_a=q2_K,type_b=f32,m=4096,n=3,k=14336,bs=[1,1],nr=[1,1],per=[0,1,2,3],k_v=0,o=1 4.02 4.00 1.00
ROCm0 MUL_MAT type_a=q2_K,type_b=f32,m=4096,n=4,k=14336,bs=[1,1],nr=[1,1],per=[0,1,2,3],k_v=0,o=1 4.20 4.19 1.00
ROCm0 MUL_MAT type_a=q2_K,type_b=f32,m=4096,n=5,k=14336,bs=[1,1],nr=[1,1],per=[0,1,2,3],k_v=0,o=1 4.32 4.30 1.00
ROCm0 MUL_MAT type_a=q2_K,type_b=f32,m=4096,n=512,k=14336,bs=[1,1],nr=[1,1],per=[0,1,2,3],k_v=0,o=1 72.18 72.15 1.00
ROCm0 MUL_MAT type_a=q2_K,type_b=f32,m=4096,n=8,k=14336,bs=[1,1],nr=[1,1],per=[0,1,2,3],k_v=0,o=1 4.36 4.36 1.00
ROCm0 MUL_MAT type_a=q3_K,type_b=f32,m=4096,n=1,k=14336,bs=[1,1],nr=[1,1],per=[0,1,2,3],k_v=0,o=1 1.71 1.71 1.00
ROCm0 MUL_MAT type_a=q3_K,type_b=f32,m=4096,n=2,k=14336,bs=[1,1],nr=[1,1],per=[0,1,2,3],k_v=0,o=1 2.82 2.81 1.00
ROCm0 MUL_MAT type_a=q3_K,type_b=f32,m=4096,n=3,k=14336,bs=[1,1],nr=[1,1],per=[0,1,2,3],k_v=0,o=1 3.41 3.41 1.00
ROCm0 MUL_MAT type_a=q3_K,type_b=f32,m=4096,n=4,k=14336,bs=[1,1],nr=[1,1],per=[0,1,2,3],k_v=0,o=1 3.84 3.83 1.00
ROCm0 MUL_MAT type_a=q3_K,type_b=f32,m=4096,n=5,k=14336,bs=[1,1],nr=[1,1],per=[0,1,2,3],k_v=0,o=1 4.10 4.09 1.00
ROCm0 MUL_MAT type_a=q3_K,type_b=f32,m=4096,n=512,k=14336,bs=[1,1],nr=[1,1],per=[0,1,2,3],k_v=0,o=1 68.74 68.69 1.00
ROCm0 MUL_MAT type_a=q3_K,type_b=f32,m=4096,n=8,k=14336,bs=[1,1],nr=[1,1],per=[0,1,2,3],k_v=0,o=1 4.41 4.41 1.00
ROCm0 MUL_MAT type_a=q4_0,type_b=f32,m=4096,n=1,k=14336,bs=[1,1],nr=[1,1],per=[0,1,2,3],k_v=0,o=1 4.02 4.02 1.00
ROCm0 MUL_MAT type_a=q4_0,type_b=f32,m=4096,n=2,k=14336,bs=[1,1],nr=[1,1],per=[0,1,2,3],k_v=0,o=1 5.47 5.47 1.00
ROCm0 MUL_MAT type_a=q4_0,type_b=f32,m=4096,n=3,k=14336,bs=[1,1],nr=[1,1],per=[0,1,2,3],k_v=0,o=1 7.35 7.33 1.00
ROCm0 MUL_MAT type_a=q4_0,type_b=f32,m=4096,n=4,k=14336,bs=[1,1],nr=[1,1],per=[0,1,2,3],k_v=0,o=1 8.74 8.73 1.00
ROCm0 MUL_MAT type_a=q4_0,type_b=f32,m=4096,n=5,k=14336,bs=[1,1],nr=[1,1],per=[0,1,2,3],k_v=0,o=1 8.64 8.60 0.99
ROCm0 MUL_MAT type_a=q4_0,type_b=f32,m=4096,n=512,k=14336,bs=[1,1],nr=[1,1],per=[0,1,2,3],k_v=0,o=1 72.55 72.71 1.00
ROCm0 MUL_MAT type_a=q4_0,type_b=f32,m=4096,n=8,k=14336,bs=[1,1],nr=[1,1],per=[0,1,2,3],k_v=0,o=1 9.50 9.44 0.99
ROCm0 MUL_MAT type_a=q4_1,type_b=f32,m=4096,n=1,k=14336,bs=[1,1],nr=[1,1],per=[0,1,2,3],k_v=0,o=1 3.95 3.95 1.00
ROCm0 MUL_MAT type_a=q4_1,type_b=f32,m=4096,n=2,k=14336,bs=[1,1],nr=[1,1],per=[0,1,2,3],k_v=0,o=1 6.88 6.86 1.00
ROCm0 MUL_MAT type_a=q4_1,type_b=f32,m=4096,n=3,k=14336,bs=[1,1],nr=[1,1],per=[0,1,2,3],k_v=0,o=1 7.59 7.57 1.00
ROCm0 MUL_MAT type_a=q4_1,type_b=f32,m=4096,n=4,k=14336,bs=[1,1],nr=[1,1],per=[0,1,2,3],k_v=0,o=1 9.02 9.00 1.00
ROCm0 MUL_MAT type_a=q4_1,type_b=f32,m=4096,n=5,k=14336,bs=[1,1],nr=[1,1],per=[0,1,2,3],k_v=0,o=1 8.84 8.68 0.98
ROCm0 MUL_MAT type_a=q4_1,type_b=f32,m=4096,n=512,k=14336,bs=[1,1],nr=[1,1],per=[0,1,2,3],k_v=0,o=1 72.18 72.13 1.00
ROCm0 MUL_MAT type_a=q4_1,type_b=f32,m=4096,n=8,k=14336,bs=[1,1],nr=[1,1],per=[0,1,2,3],k_v=0,o=1 9.76 9.74 1.00
ROCm0 MUL_MAT type_a=q4_K,type_b=f32,m=4096,n=1,k=14336,bs=[1,1],nr=[1,1],per=[0,1,2,3],k_v=0,o=1 2.70 2.70 1.00
ROCm0 MUL_MAT type_a=q4_K,type_b=f32,m=4096,n=2,k=14336,bs=[1,1],nr=[1,1],per=[0,1,2,3],k_v=0,o=1 3.69 3.68 1.00
ROCm0 MUL_MAT type_a=q4_K,type_b=f32,m=4096,n=3,k=14336,bs=[1,1],nr=[1,1],per=[0,1,2,3],k_v=0,o=1 4.05 4.05 1.00
ROCm0 MUL_MAT type_a=q4_K,type_b=f32,m=4096,n=4,k=14336,bs=[1,1],nr=[1,1],per=[0,1,2,3],k_v=0,o=1 4.24 4.23 1.00
ROCm0 MUL_MAT type_a=q4_K,type_b=f32,m=4096,n=5,k=14336,bs=[1,1],nr=[1,1],per=[0,1,2,3],k_v=0,o=1 4.37 4.35 1.00
ROCm0 MUL_MAT type_a=q4_K,type_b=f32,m=4096,n=512,k=14336,bs=[1,1],nr=[1,1],per=[0,1,2,3],k_v=0,o=1 72.00 71.93 1.00
ROCm0 MUL_MAT type_a=q4_K,type_b=f32,m=4096,n=8,k=14336,bs=[1,1],nr=[1,1],per=[0,1,2,3],k_v=0,o=1 4.57 4.57 1.00
ROCm0 MUL_MAT type_a=q5_0,type_b=f32,m=4096,n=1,k=14336,bs=[1,1],nr=[1,1],per=[0,1,2,3],k_v=0,o=1 3.11 3.10 1.00
ROCm0 MUL_MAT type_a=q5_0,type_b=f32,m=4096,n=2,k=14336,bs=[1,1],nr=[1,1],per=[0,1,2,3],k_v=0,o=1 5.22 5.21 1.00
ROCm0 MUL_MAT type_a=q5_0,type_b=f32,m=4096,n=3,k=14336,bs=[1,1],nr=[1,1],per=[0,1,2,3],k_v=0,o=1 6.46 6.46 1.00
ROCm0 MUL_MAT type_a=q5_0,type_b=f32,m=4096,n=4,k=14336,bs=[1,1],nr=[1,1],per=[0,1,2,3],k_v=0,o=1 7.80 7.79 1.00
ROCm0 MUL_MAT type_a=q5_0,type_b=f32,m=4096,n=5,k=14336,bs=[1,1],nr=[1,1],per=[0,1,2,3],k_v=0,o=1 8.18 8.08 0.99
ROCm0 MUL_MAT type_a=q5_0,type_b=f32,m=4096,n=512,k=14336,bs=[1,1],nr=[1,1],per=[0,1,2,3],k_v=0,o=1 62.94 62.84 1.00
ROCm0 MUL_MAT type_a=q5_0,type_b=f32,m=4096,n=8,k=14336,bs=[1,1],nr=[1,1],per=[0,1,2,3],k_v=0,o=1 8.89 8.88 1.00
ROCm0 MUL_MAT type_a=q5_1,type_b=f32,m=4096,n=1,k=14336,bs=[1,1],nr=[1,1],per=[0,1,2,3],k_v=0,o=1 3.36 3.35 1.00
ROCm0 MUL_MAT type_a=q5_1,type_b=f32,m=4096,n=2,k=14336,bs=[1,1],nr=[1,1],per=[0,1,2,3],k_v=0,o=1 5.59 5.59 1.00
ROCm0 MUL_MAT type_a=q5_1,type_b=f32,m=4096,n=3,k=14336,bs=[1,1],nr=[1,1],per=[0,1,2,3],k_v=0,o=1 7.04 7.02 1.00
ROCm0 MUL_MAT type_a=q5_1,type_b=f32,m=4096,n=4,k=14336,bs=[1,1],nr=[1,1],per=[0,1,2,3],k_v=0,o=1 8.05 8.05 1.00
ROCm0 MUL_MAT type_a=q5_1,type_b=f32,m=4096,n=5,k=14336,bs=[1,1],nr=[1,1],per=[0,1,2,3],k_v=0,o=1 8.77 8.75 1.00
ROCm0 MUL_MAT type_a=q5_1,type_b=f32,m=4096,n=512,k=14336,bs=[1,1],nr=[1,1],per=[0,1,2,3],k_v=0,o=1 63.41 63.49 1.00
ROCm0 MUL_MAT type_a=q5_1,type_b=f32,m=4096,n=8,k=14336,bs=[1,1],nr=[1,1],per=[0,1,2,3],k_v=0,o=1 9.54 9.54 1.00
ROCm0 MUL_MAT type_a=q5_K,type_b=f32,m=4096,n=1,k=14336,bs=[1,1],nr=[1,1],per=[0,1,2,3],k_v=0,o=1 2.53 2.53 1.00
ROCm0 MUL_MAT type_a=q5_K,type_b=f32,m=4096,n=2,k=14336,bs=[1,1],nr=[1,1],per=[0,1,2,3],k_v=0,o=1 3.53 3.53 1.00
ROCm0 MUL_MAT type_a=q5_K,type_b=f32,m=4096,n=3,k=14336,bs=[1,1],nr=[1,1],per=[0,1,2,3],k_v=0,o=1 3.90 3.89 1.00
ROCm0 MUL_MAT type_a=q5_K,type_b=f32,m=4096,n=4,k=14336,bs=[1,1],nr=[1,1],per=[0,1,2,3],k_v=0,o=1 4.13 4.12 1.00
ROCm0 MUL_MAT type_a=q5_K,type_b=f32,m=4096,n=5,k=14336,bs=[1,1],nr=[1,1],per=[0,1,2,3],k_v=0,o=1 4.29 4.29 1.00
ROCm0 MUL_MAT type_a=q5_K,type_b=f32,m=4096,n=512,k=14336,bs=[1,1],nr=[1,1],per=[0,1,2,3],k_v=0,o=1 69.41 69.38 1.00
ROCm0 MUL_MAT type_a=q5_K,type_b=f32,m=4096,n=8,k=14336,bs=[1,1],nr=[1,1],per=[0,1,2,3],k_v=0,o=1 4.50 4.49 1.00
ROCm0 MUL_MAT type_a=q6_K,type_b=f32,m=4096,n=1,k=14336,bs=[1,1],nr=[1,1],per=[0,1,2,3],k_v=0,o=1 1.84 1.83 1.00
ROCm0 MUL_MAT type_a=q6_K,type_b=f32,m=4096,n=2,k=14336,bs=[1,1],nr=[1,1],per=[0,1,2,3],k_v=0,o=1 2.91 2.90 1.00
ROCm0 MUL_MAT type_a=q6_K,type_b=f32,m=4096,n=3,k=14336,bs=[1,1],nr=[1,1],per=[0,1,2,3],k_v=0,o=1 3.66 3.66 1.00
ROCm0 MUL_MAT type_a=q6_K,type_b=f32,m=4096,n=4,k=14336,bs=[1,1],nr=[1,1],per=[0,1,2,3],k_v=0,o=1 4.15 4.15 1.00
ROCm0 MUL_MAT type_a=q6_K,type_b=f32,m=4096,n=5,k=14336,bs=[1,1],nr=[1,1],per=[0,1,2,3],k_v=0,o=1 4.51 4.50 1.00
ROCm0 MUL_MAT type_a=q6_K,type_b=f32,m=4096,n=512,k=14336,bs=[1,1],nr=[1,1],per=[0,1,2,3],k_v=0,o=1 69.82 69.80 1.00
ROCm0 MUL_MAT type_a=q6_K,type_b=f32,m=4096,n=8,k=14336,bs=[1,1],nr=[1,1],per=[0,1,2,3],k_v=0,o=1 5.16 5.16 1.00
ROCm0 MUL_MAT type_a=q8_0,type_b=f32,m=4096,n=1,k=14336,bs=[1,1],nr=[1,1],per=[0,1,2,3],k_v=0,o=1 2.52 2.53 1.00
ROCm0 MUL_MAT type_a=q8_0,type_b=f32,m=4096,n=2,k=14336,bs=[1,1],nr=[1,1],per=[0,1,2,3],k_v=0,o=1 4.63 4.63 1.00
ROCm0 MUL_MAT type_a=q8_0,type_b=f32,m=4096,n=3,k=14336,bs=[1,1],nr=[1,1],per=[0,1,2,3],k_v=0,o=1 6.44 6.43 1.00
ROCm0 MUL_MAT type_a=q8_0,type_b=f32,m=4096,n=4,k=14336,bs=[1,1],nr=[1,1],per=[0,1,2,3],k_v=0,o=1 7.39 7.38 1.00
ROCm0 MUL_MAT type_a=q8_0,type_b=f32,m=4096,n=5,k=14336,bs=[1,1],nr=[1,1],per=[0,1,2,3],k_v=0,o=1 7.80 7.77 1.00
ROCm0 MUL_MAT type_a=q8_0,type_b=f32,m=4096,n=512,k=14336,bs=[1,1],nr=[1,1],per=[0,1,2,3],k_v=0,o=1 68.38 68.34 1.00
ROCm0 MUL_MAT type_a=q8_0,type_b=f32,m=4096,n=8,k=14336,bs=[1,1],nr=[1,1],per=[0,1,2,3],k_v=0,o=1 8.03 8.04 1.00

MUL_MAT_ID

Backend GGML op Op parameters TFLOPS master TFLOPS 0ec241d Speedup
ROCm0 MUL_MAT_ID type_a=f16,type_b=f32,n_mats=128,n_used=8,b=0,m=768,n=1,k=2048,o=1 0.77 0.77 1.00
ROCm0 MUL_MAT_ID type_a=f16,type_b=f32,n_mats=128,n_used=8,b=0,m=768,n=128,k=2048,o=1 1.46 4.32 2.96
ROCm0 MUL_MAT_ID type_a=f16,type_b=f32,n_mats=128,n_used=8,b=0,m=768,n=256,k=2048,o=1 2.75 4.59 1.67
ROCm0 MUL_MAT_ID type_a=f16,type_b=f32,n_mats=128,n_used=8,b=0,m=768,n=32,k=2048,o=1 0.38 1.37 3.60
ROCm0 MUL_MAT_ID type_a=f16,type_b=f32,n_mats=128,n_used=8,b=0,m=768,n=4,k=2048,o=1 0.13 0.66 4.90
ROCm0 MUL_MAT_ID type_a=f16,type_b=f32,n_mats=128,n_used=8,b=0,m=768,n=512,k=2048,o=1 5.19 5.34 1.03
ROCm0 MUL_MAT_ID type_a=f16,type_b=f32,n_mats=128,n_used=8,b=0,m=768,n=64,k=2048,o=1 0.73 2.41 3.29
ROCm0 MUL_MAT_ID type_a=f16,type_b=f32,n_mats=128,n_used=8,b=0,m=768,n=8,k=2048,o=1 0.16 0.80 4.88
ROCm0 MUL_MAT_ID type_a=f16,type_b=f32,n_mats=32,n_used=4,b=0,m=1792,n=1,k=2048,o=1 0.78 0.77 0.99
ROCm0 MUL_MAT_ID type_a=f16,type_b=f32,n_mats=32,n_used=4,b=0,m=1792,n=128,k=2048,o=1 4.68 5.85 1.25
ROCm0 MUL_MAT_ID type_a=f16,type_b=f32,n_mats=32,n_used=4,b=0,m=1792,n=256,k=2048,o=1 8.87 8.80 0.99
ROCm0 MUL_MAT_ID type_a=f16,type_b=f32,n_mats=32,n_used=4,b=0,m=1792,n=32,k=2048,o=1 1.29 2.45 1.89
ROCm0 MUL_MAT_ID type_a=f16,type_b=f32,n_mats=32,n_used=4,b=0,m=1792,n=4,k=2048,o=1 0.36 0.75 2.08
ROCm0 MUL_MAT_ID type_a=f16,type_b=f32,n_mats=32,n_used=4,b=0,m=1792,n=512,k=2048,o=1 13.96 13.94 1.00
ROCm0 MUL_MAT_ID type_a=f16,type_b=f32,n_mats=32,n_used=4,b=0,m=1792,n=64,k=2048,o=1 2.51 4.32 1.72
ROCm0 MUL_MAT_ID type_a=f16,type_b=f32,n_mats=32,n_used=4,b=0,m=1792,n=8,k=2048,o=1 0.44 0.93 2.13
ROCm0 MUL_MAT_ID type_a=f32,type_b=f32,n_mats=128,n_used=8,b=0,m=768,n=1,k=2048,o=1 0.67 0.68 1.01
ROCm0 MUL_MAT_ID type_a=f32,type_b=f32,n_mats=128,n_used=8,b=0,m=768,n=128,k=2048,o=1 0.66 0.65 0.99
ROCm0 MUL_MAT_ID type_a=f32,type_b=f32,n_mats=128,n_used=8,b=0,m=768,n=256,k=2048,o=1 0.90 0.90 0.99
ROCm0 MUL_MAT_ID type_a=f32,type_b=f32,n_mats=128,n_used=8,b=0,m=768,n=32,k=2048,o=1 0.29 0.30 1.03
ROCm0 MUL_MAT_ID type_a=f32,type_b=f32,n_mats=128,n_used=8,b=0,m=768,n=4,k=2048,o=1 0.13 0.13 0.97
ROCm0 MUL_MAT_ID type_a=f32,type_b=f32,n_mats=128,n_used=8,b=0,m=768,n=512,k=2048,o=1 1.65 1.65 1.00
ROCm0 MUL_MAT_ID type_a=f32,type_b=f32,n_mats=128,n_used=8,b=0,m=768,n=64,k=2048,o=1 0.47 0.47 0.99
ROCm0 MUL_MAT_ID type_a=f32,type_b=f32,n_mats=128,n_used=8,b=0,m=768,n=8,k=2048,o=1 0.16 0.16 1.04
ROCm0 MUL_MAT_ID type_a=f32,type_b=f32,n_mats=32,n_used=4,b=0,m=1792,n=1,k=2048,o=1 0.73 0.73 1.00
ROCm0 MUL_MAT_ID type_a=f32,type_b=f32,n_mats=32,n_used=4,b=0,m=1792,n=128,k=2048,o=1 1.62 1.61 1.00
ROCm0 MUL_MAT_ID type_a=f32,type_b=f32,n_mats=32,n_used=4,b=0,m=1792,n=256,k=2048,o=1 2.73 2.70 0.99
ROCm0 MUL_MAT_ID type_a=f32,type_b=f32,n_mats=32,n_used=4,b=0,m=1792,n=32,k=2048,o=1 0.90 0.90 1.00
ROCm0 MUL_MAT_ID type_a=f32,type_b=f32,n_mats=32,n_used=4,b=0,m=1792,n=4,k=2048,o=1 0.25 0.27 1.06
ROCm0 MUL_MAT_ID type_a=f32,type_b=f32,n_mats=32,n_used=4,b=0,m=1792,n=512,k=2048,o=1 2.48 2.57 1.04
ROCm0 MUL_MAT_ID type_a=f32,type_b=f32,n_mats=32,n_used=4,b=0,m=1792,n=64,k=2048,o=1 1.09 1.05 0.96
ROCm0 MUL_MAT_ID type_a=f32,type_b=f32,n_mats=32,n_used=4,b=0,m=1792,n=8,k=2048,o=1 0.35 0.38 1.09
ROCm0 MUL_MAT_ID type_a=iq2_xs,type_b=f32,n_mats=128,n_used=8,b=0,m=768,n=1,k=2048,o=1 1.49 1.48 1.00
ROCm0 MUL_MAT_ID type_a=iq2_xs,type_b=f32,n_mats=128,n_used=8,b=0,m=768,n=128,k=2048,o=1 1.30 1.25 0.96
ROCm0 MUL_MAT_ID type_a=iq2_xs,type_b=f32,n_mats=128,n_used=8,b=0,m=768,n=256,k=2048,o=1 1.59 1.57 0.99
ROCm0 MUL_MAT_ID type_a=iq2_xs,type_b=f32,n_mats=128,n_used=8,b=0,m=768,n=32,k=2048,o=1 1.48 1.53 1.04
ROCm0 MUL_MAT_ID type_a=iq2_xs,type_b=f32,n_mats=128,n_used=8,b=0,m=768,n=4,k=2048,o=1 0.96 0.97 1.01
ROCm0 MUL_MAT_ID type_a=iq2_xs,type_b=f32,n_mats=128,n_used=8,b=0,m=768,n=512,k=2048,o=1 2.40 2.39 0.99
ROCm0 MUL_MAT_ID type_a=iq2_xs,type_b=f32,n_mats=128,n_used=8,b=0,m=768,n=64,k=2048,o=1 1.18 1.20 1.02
ROCm0 MUL_MAT_ID type_a=iq2_xs,type_b=f32,n_mats=128,n_used=8,b=0,m=768,n=8,k=2048,o=1 1.24 1.32 1.06
ROCm0 MUL_MAT_ID type_a=iq2_xs,type_b=f32,n_mats=32,n_used=4,b=0,m=1792,n=1,k=2048,o=1 1.60 1.60 1.00
ROCm0 MUL_MAT_ID type_a=iq2_xs,type_b=f32,n_mats=32,n_used=4,b=0,m=1792,n=128,k=2048,o=1 3.70 3.73 1.01
ROCm0 MUL_MAT_ID type_a=iq2_xs,type_b=f32,n_mats=32,n_used=4,b=0,m=1792,n=256,k=2048,o=1 5.57 5.60 1.00
ROCm0 MUL_MAT_ID type_a=iq2_xs,type_b=f32,n_mats=32,n_used=4,b=0,m=1792,n=32,k=2048,o=1 2.62 2.64 1.01
ROCm0 MUL_MAT_ID type_a=iq2_xs,type_b=f32,n_mats=32,n_used=4,b=0,m=1792,n=4,k=2048,o=1 1.30 1.14 0.88
ROCm0 MUL_MAT_ID type_a=iq2_xs,type_b=f32,n_mats=32,n_used=4,b=0,m=1792,n=512,k=2048,o=1 9.29 9.31 1.00
ROCm0 MUL_MAT_ID type_a=iq2_xs,type_b=f32,n_mats=32,n_used=4,b=0,m=1792,n=64,k=2048,o=1 2.70 2.76 1.02
ROCm0 MUL_MAT_ID type_a=iq2_xs,type_b=f32,n_mats=32,n_used=4,b=0,m=1792,n=8,k=2048,o=1 1.46 1.61 1.10
ROCm0 MUL_MAT_ID type_a=mxfp4,type_b=f32,n_mats=32,n_used=4,b=0,m=2880,n=1,k=2880,o=1 2.87 2.86 1.00
ROCm0 MUL_MAT_ID type_a=mxfp4,type_b=f32,n_mats=32,n_used=4,b=0,m=2880,n=4,k=2880,o=1 1.80 1.38 0.77
ROCm0 MUL_MAT_ID type_a=mxfp4,type_b=f32,n_mats=32,n_used=4,b=0,m=2880,n=512,k=2880,o=1 15.45 15.13 0.98
ROCm0 MUL_MAT_ID type_a=mxfp4,type_b=f32,n_mats=32,n_used=4,b=0,m=2880,n=8,k=2880,o=1 1.92 2.02 1.05
ROCm0 MUL_MAT_ID type_a=q4_0,type_b=f32,n_mats=128,n_used=8,b=0,m=768,n=1,k=2048,o=1 1.13 1.13 1.00
ROCm0 MUL_MAT_ID type_a=q4_0,type_b=f32,n_mats=128,n_used=8,b=0,m=768,n=128,k=2048,o=1 1.51 1.44 0.95
ROCm0 MUL_MAT_ID type_a=q4_0,type_b=f32,n_mats=128,n_used=8,b=0,m=768,n=256,k=2048,o=1 1.85 1.85 1.00
ROCm0 MUL_MAT_ID type_a=q4_0,type_b=f32,n_mats=128,n_used=8,b=0,m=768,n=32,k=2048,o=1 2.00 2.03 1.02
ROCm0 MUL_MAT_ID type_a=q4_0,type_b=f32,n_mats=128,n_used=8,b=0,m=768,n=4,k=2048,o=1 1.54 1.76 1.14
ROCm0 MUL_MAT_ID type_a=q4_0,type_b=f32,n_mats=128,n_used=8,b=0,m=768,n=512,k=2048,o=1 2.88 2.86 0.99
ROCm0 MUL_MAT_ID type_a=q4_0,type_b=f32,n_mats=128,n_used=8,b=0,m=768,n=64,k=2048,o=1 1.23 1.27 1.03
ROCm0 MUL_MAT_ID type_a=q4_0,type_b=f32,n_mats=128,n_used=8,b=0,m=768,n=8,k=2048,o=1 2.02 2.23 1.10
ROCm0 MUL_MAT_ID type_a=q4_0,type_b=f32,n_mats=32,n_used=4,b=0,m=1792,n=1,k=2048,o=1 2.15 2.15 1.00
ROCm0 MUL_MAT_ID type_a=q4_0,type_b=f32,n_mats=32,n_used=4,b=0,m=1792,n=128,k=2048,o=1 4.03 3.95 0.98
ROCm0 MUL_MAT_ID type_a=q4_0,type_b=f32,n_mats=32,n_used=4,b=0,m=1792,n=256,k=2048,o=1 6.40 6.37 1.00
ROCm0 MUL_MAT_ID type_a=q4_0,type_b=f32,n_mats=32,n_used=4,b=0,m=1792,n=32,k=2048,o=1 3.61 3.68 1.02
ROCm0 MUL_MAT_ID type_a=q4_0,type_b=f32,n_mats=32,n_used=4,b=0,m=1792,n=4,k=2048,o=1 1.79 1.69 0.94
ROCm0 MUL_MAT_ID type_a=q4_0,type_b=f32,n_mats=32,n_used=4,b=0,m=1792,n=512,k=2048,o=1 9.89 10.01 1.01
ROCm0 MUL_MAT_ID type_a=q4_0,type_b=f32,n_mats=32,n_used=4,b=0,m=1792,n=64,k=2048,o=1 2.95 2.83 0.96
ROCm0 MUL_MAT_ID type_a=q4_0,type_b=f32,n_mats=32,n_used=4,b=0,m=1792,n=8,k=2048,o=1 2.85 2.60 0.91
ROCm0 MUL_MAT_ID type_a=q4_K,type_b=f32,n_mats=128,n_used=8,b=0,m=768,n=1,k=2048,o=1 1.02 1.02 1.00
ROCm0 MUL_MAT_ID type_a=q4_K,type_b=f32,n_mats=128,n_used=8,b=0,m=768,n=128,k=2048,o=1 1.25 1.23 0.98
ROCm0 MUL_MAT_ID type_a=q4_K,type_b=f32,n_mats=128,n_used=8,b=0,m=768,n=256,k=2048,o=1 1.73 1.73 1.00
ROCm0 MUL_MAT_ID type_a=q4_K,type_b=f32,n_mats=128,n_used=8,b=0,m=768,n=32,k=2048,o=1 1.63 1.72 1.06
ROCm0 MUL_MAT_ID type_a=q4_K,type_b=f32,n_mats=128,n_used=8,b=0,m=768,n=4,k=2048,o=1 1.64 1.50 0.92
ROCm0 MUL_MAT_ID type_a=q4_K,type_b=f32,n_mats=128,n_used=8,b=0,m=768,n=512,k=2048,o=1 2.25 2.30 1.02
ROCm0 MUL_MAT_ID type_a=q4_K,type_b=f32,n_mats=128,n_used=8,b=0,m=768,n=64,k=2048,o=1 0.98 0.99 1.01
ROCm0 MUL_MAT_ID type_a=q4_K,type_b=f32,n_mats=128,n_used=8,b=0,m=768,n=8,k=2048,o=1 1.90 1.98 1.04
ROCm0 MUL_MAT_ID type_a=q4_K,type_b=f32,n_mats=32,n_used=4,b=0,m=1792,n=1,k=2048,o=1 1.80 1.80 1.00
ROCm0 MUL_MAT_ID type_a=q4_K,type_b=f32,n_mats=32,n_used=4,b=0,m=1792,n=128,k=2048,o=1 3.71 3.77 1.02
ROCm0 MUL_MAT_ID type_a=q4_K,type_b=f32,n_mats=32,n_used=4,b=0,m=1792,n=256,k=2048,o=1 5.07 4.91 0.97
ROCm0 MUL_MAT_ID type_a=q4_K,type_b=f32,n_mats=32,n_used=4,b=0,m=1792,n=32,k=2048,o=1 2.87 2.92 1.02
ROCm0 MUL_MAT_ID type_a=q4_K,type_b=f32,n_mats=32,n_used=4,b=0,m=1792,n=4,k=2048,o=1 1.96 1.90 0.97
ROCm0 MUL_MAT_ID type_a=q4_K,type_b=f32,n_mats=32,n_used=4,b=0,m=1792,n=512,k=2048,o=1 8.40 7.68 0.91
ROCm0 MUL_MAT_ID type_a=q4_K,type_b=f32,n_mats=32,n_used=4,b=0,m=1792,n=64,k=2048,o=1 2.46 2.40 0.98
ROCm0 MUL_MAT_ID type_a=q4_K,type_b=f32,n_mats=32,n_used=4,b=0,m=1792,n=8,k=2048,o=1 2.23 2.54 1.14
ROCm0 MUL_MAT_ID type_a=q6_K,type_b=f32,n_mats=128,n_used=8,b=0,m=768,n=1,k=2048,o=1 1.29 1.29 1.00
ROCm0 MUL_MAT_ID type_a=q6_K,type_b=f32,n_mats=128,n_used=8,b=0,m=768,n=128,k=2048,o=1 1.21 1.25 1.03
ROCm0 MUL_MAT_ID type_a=q6_K,type_b=f32,n_mats=128,n_used=8,b=0,m=768,n=256,k=2048,o=1 1.43 1.41 0.99
ROCm0 MUL_MAT_ID type_a=q6_K,type_b=f32,n_mats=128,n_used=8,b=0,m=768,n=32,k=2048,o=1 1.21 1.20 0.99
ROCm0 MUL_MAT_ID type_a=q6_K,type_b=f32,n_mats=128,n_used=8,b=0,m=768,n=4,k=2048,o=1 1.11 1.12 1.01
ROCm0 MUL_MAT_ID type_a=q6_K,type_b=f32,n_mats=128,n_used=8,b=0,m=768,n=512,k=2048,o=1 1.73 1.71 0.99
ROCm0 MUL_MAT_ID type_a=q6_K,type_b=f32,n_mats=128,n_used=8,b=0,m=768,n=64,k=2048,o=1 1.05 1.08 1.03
ROCm0 MUL_MAT_ID type_a=q6_K,type_b=f32,n_mats=128,n_used=8,b=0,m=768,n=8,k=2048,o=1 1.42 1.40 0.98
ROCm0 MUL_MAT_ID type_a=q6_K,type_b=f32,n_mats=32,n_used=4,b=0,m=1792,n=1,k=2048,o=1 1.40 1.39 0.99
ROCm0 MUL_MAT_ID type_a=q6_K,type_b=f32,n_mats=32,n_used=4,b=0,m=1792,n=128,k=2048,o=1 3.22 3.21 1.00
ROCm0 MUL_MAT_ID type_a=q6_K,type_b=f32,n_mats=32,n_used=4,b=0,m=1792,n=256,k=2048,o=1 3.86 3.89 1.01
ROCm0 MUL_MAT_ID type_a=q6_K,type_b=f32,n_mats=32,n_used=4,b=0,m=1792,n=32,k=2048,o=1 2.18 2.15 0.99
ROCm0 MUL_MAT_ID type_a=q6_K,type_b=f32,n_mats=32,n_used=4,b=0,m=1792,n=4,k=2048,o=1 1.30 1.35 1.04
ROCm0 MUL_MAT_ID type_a=q6_K,type_b=f32,n_mats=32,n_used=4,b=0,m=1792,n=512,k=2048,o=1 5.08 5.00 0.98
ROCm0 MUL_MAT_ID type_a=q6_K,type_b=f32,n_mats=32,n_used=4,b=0,m=1792,n=64,k=2048,o=1 2.33 2.18 0.94
ROCm0 MUL_MAT_ID type_a=q6_K,type_b=f32,n_mats=32,n_used=4,b=0,m=1792,n=8,k=2048,o=1 1.67 1.89 1.13
ROCm0 MUL_MAT_ID type_a=q8_0,type_b=f32,n_mats=128,n_used=8,b=0,m=768,n=1,k=2048,o=1 1.06 1.06 1.00
ROCm0 MUL_MAT_ID type_a=q8_0,type_b=f32,n_mats=128,n_used=8,b=0,m=768,n=128,k=2048,o=1 1.40 1.48 1.05
ROCm0 MUL_MAT_ID type_a=q8_0,type_b=f32,n_mats=128,n_used=8,b=0,m=768,n=256,k=2048,o=1 1.83 1.80 0.98
ROCm0 MUL_MAT_ID type_a=q8_0,type_b=f32,n_mats=128,n_used=8,b=0,m=768,n=32,k=2048,o=1 1.75 1.91 1.09
ROCm0 MUL_MAT_ID type_a=q8_0,type_b=f32,n_mats=128,n_used=8,b=0,m=768,n=4,k=2048,o=1 1.39 1.32 0.95
ROCm0 MUL_MAT_ID type_a=q8_0,type_b=f32,n_mats=128,n_used=8,b=0,m=768,n=512,k=2048,o=1 2.75 2.75 1.00
ROCm0 MUL_MAT_ID type_a=q8_0,type_b=f32,n_mats=128,n_used=8,b=0,m=768,n=64,k=2048,o=1 1.13 1.09 0.97
ROCm0 MUL_MAT_ID type_a=q8_0,type_b=f32,n_mats=128,n_used=8,b=0,m=768,n=8,k=2048,o=1 1.24 1.21 0.98
ROCm0 MUL_MAT_ID type_a=q8_0,type_b=f32,n_mats=32,n_used=4,b=0,m=1792,n=1,k=2048,o=1 1.81 1.81 1.00
ROCm0 MUL_MAT_ID type_a=q8_0,type_b=f32,n_mats=32,n_used=4,b=0,m=1792,n=128,k=2048,o=1 3.74 3.69 0.99
ROCm0 MUL_MAT_ID type_a=q8_0,type_b=f32,n_mats=32,n_used=4,b=0,m=1792,n=256,k=2048,o=1 5.90 5.82 0.99
ROCm0 MUL_MAT_ID type_a=q8_0,type_b=f32,n_mats=32,n_used=4,b=0,m=1792,n=32,k=2048,o=1 3.06 3.10 1.01
ROCm0 MUL_MAT_ID type_a=q8_0,type_b=f32,n_mats=32,n_used=4,b=0,m=1792,n=4,k=2048,o=1 1.36 1.55 1.13
ROCm0 MUL_MAT_ID type_a=q8_0,type_b=f32,n_mats=32,n_used=4,b=0,m=1792,n=512,k=2048,o=1 9.40 9.28 0.99
ROCm0 MUL_MAT_ID type_a=q8_0,type_b=f32,n_mats=32,n_used=4,b=0,m=1792,n=64,k=2048,o=1 2.77 2.66 0.96
ROCm0 MUL_MAT_ID type_a=q8_0,type_b=f32,n_mats=32,n_used=4,b=0,m=1792,n=8,k=2048,o=1 1.31 2.01 1.53

@JohannesGaessler
Copy link
Collaborator

How about this: for now we move towards merging this PR but only enable it for MUL_MAT_ID where it's already faster, if in the future it also becomes faster for MUL_MAT we can then enable it for that as well.

@zhang-hui-yulo
Copy link
Author

zhang-hui-yulo commented Nov 9, 2025

How about this: for now we move towards merging this PR but only enable it for MUL_MAT_ID where it's already faster, if in the future it also becomes faster for MUL_MAT we can then enable it for that as well.

Thank you for the support, this is also what I'm thinking, just disable mul_mat_f on RDNA4 first and try to rewrite a RDNA4 optimized version in the future.

Also I presume that hip compiler would generate better code on RDNA3 than RDNA4, I will have a test on my 7900XTX next week.

Anyway, could youplease review it first? One thing that hip compiler cannot handle early return code, it will still compile the code after return

if  constexpr (rdna_not_supported) {
    NO_DEVICE_CODE;
    return;
}

// hip compiler will still compile this
rdna unsupported code like tile<16, 8, float> tile_A

Also, I don't see performance improvement with real model like llama3-8b-fp16 and deepseek-r1-8b-fp16 with batch 1~16, looks like that I need to do the test with batch 512, right?

@zhang-hui-yulo zhang-hui-yulo marked this pull request as ready for review November 9, 2025 10:58
@JohannesGaessler
Copy link
Collaborator

@zhang-hui-yulo can you tell me if and when you intend to work on FA support or better MMF performance? That would make it easier for me to schedule my own concurrent work to avoid conflicts.

@zhang-hui-yulo
Copy link
Author

zhang-hui-yulo commented Nov 10, 2025

@zhang-hui-yulo can you tell me if and when you intend to work on FA support or better MMF performance? That would make it easier for me to schedule my own concurrent work to avoid conflicts.

Hello @JohannesGaessler, as I'm still not very familiar with llama.cpp internal code, I think my schedule shall be

  1. porting MMF to RDNA3, keep the original logic to see if the performance is good enough.
  2. porting FA to RDNA4, keep the original logic to see if the performance is good enough.
  3. better MMF or FA for RDNA4 or RDNA3.

I will start them once this PR is approved.

Also I suggest you to put FA on RDNA3 to low priority as RDNA3 wmma isn't suitable for gemm fusion, you need shared memory to rearrange the layout for D matrix of QK.

@JohannesGaessler
Copy link
Collaborator

@zhang-hui-yulo as it turns out I'll need to touch the MMA FA kernel in the near future regardless of additional hardware support so I'd suggest we do it like this: first I make some changes to the MMA FA kernel during which I'll also add Volta support. Afterwards you can add AMD WMMA support, with the Volta implementation serving as a checklist where in the code it's necessary to make changes due to the different data layout.

@zhang-hui-yulo
Copy link
Author

zhang-hui-yulo commented Nov 11, 2025

@zhang-hui-yulo as it turns out I'll need to touch the MMA FA kernel in the near future regardless of additional hardware support so I'd suggest we do it like this: first I make some changes to the MMA FA kernel during which I'll also add Volta support. Afterwards you can add AMD WMMA support, with the Volta implementation serving as a checklist where in the code it's necessary to make changes due to the different data layout.

I agree, please move forward first, I will do it based on your changes, anyway I still need sometime to find a good way to support C = B * A for RDNA4 and CDNA3, maybe add a new tile class is enough

template <int I, int J, typename T, bool trans>
class tle : tile <I, J, T> {
    int get_i () {
        if (!trans) {
            return tile <I, J, T>::get_i();
        } else {
            return tile <I, J, T>::get_j();
        }
    }
}

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

ggml changes relating to the ggml tensor library for machine learning Nvidia GPU Issues specific to Nvidia GPUs

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants