gemm performance downgrade for small size M and big size N&K #520

abcdrm · 2024-01-10T08:40:11Z

Hi, I run ClBlast Hgemm and Hgemv on qualcomm adreno 730 GPU, the matrix shape is M = 8, N = 3200, K = 3200.
Hgemm took 8.05401 ms(average of 1000 times run) to finish, which is much slower than running 8 times Hgemv in a for loops(0.500931 ms per loop, total about 4 ms).
Here is code calling gemm and gemv:
CLBlastHgemm(CLBlastLayout::CLBlastLayoutRowMajor, CLBlastTranspose::CLBlastTransposeNo, CLBlastTranspose::CLBlastTransposeNo, M, N, K, alpha, A_mat(), 0, K, B_mat(), 0, N, beta, C_mat(), 0, N, &command_queue(), nullptr);

CLBlastHgemv(CLBlastLayout::CLBlastLayoutRowMajor, CLBlastTranspose::CLBlastTransposeYes, K, N, alpha, B_mat(), b_offset * i, N, A_mat(), 0, 1, beta, C_mat(), 0, 1, &command_queue(), nullptr);

The text was updated successfully, but these errors were encountered:

CNugteren · 2024-01-21T10:06:55Z

Thank you for reporting this. Yes, it can be that for weird shapes running GEMV multiple times is faster than running GEMM once. I'm not sure there is much we can do here, since the optimization space is vast.

You should try to run the tuner specifically for these matrix sizes, but most likely it won't be faster than your current GEMV solution.

CNugteren added the performance label Jan 11, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

gemm performance downgrade for small size M and big size N&K #520

gemm performance downgrade for small size M and big size N&K #520

abcdrm commented Jan 10, 2024

CNugteren commented Jan 21, 2024

gemm performance downgrade for small size M and big size N&K #520

gemm performance downgrade for small size M and big size N&K #520

Comments

abcdrm commented Jan 10, 2024

CNugteren commented Jan 21, 2024