Parallelization along K-dimension (Parallel Reduction) for GEMM with small M/N and large K

Hi OpenBLAS team,

I noticed that zgemm (and other GEMM functions) falls back to single-threaded execution when M and N are small (e.g., 32) but K is extremely large (e.g., 1,000,000).

On my many-core system, this leaves most cores idle. Given the large K size, parallelizing the K-loop (via parallel reduction) should theoretically offer significant speedup. I perform the matrix partitioning (of k) externally, and then use multithreading to call zgemm, but the performance is only average.

Questions:

Does OpenBLAS currently support threading along the K-dimension for this shape?

If not, are there any plans to implement parallel reduction for large K?

My Machine Info:

<img width="2244" height="942" alt="Image" src="https://github.com/user-attachments/assets/07a4a434-4b0e-4e62-a868-f385b8b40dd7" />

<img width="1104" height="850" alt="Image" src="https://github.com/user-attachments/assets/e9d98f65-71c7-44de-9cb2-e322cab03f5a" />

Thanks!

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Parallelization along K-dimension (Parallel Reduction) for GEMM with small M/N and large K #5629

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Parallelization along K-dimension (Parallel Reduction) for GEMM with small M/N and large K #5629

Description

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions