Hi OpenBLAS team,
I noticed that zgemm (and other GEMM functions) falls back to single-threaded execution when M and N are small (e.g., 32) but K is extremely large (e.g., 1,000,000).
On my many-core system, this leaves most cores idle. Given the large K size, parallelizing the K-loop (via parallel reduction) should theoretically offer significant speedup. I perform the matrix partitioning (of k) externally, and then use multithreading to call zgemm, but the performance is only average.
Questions:
Does OpenBLAS currently support threading along the K-dimension for this shape?
If not, are there any plans to implement parallel reduction for large K?
My Machine Info:
Thanks!