Description
My app performs many small dgemms, each invoked by a separate thread (via a task pool). As recommended I compiled OpenBlas 3.10 with USE_THREAD=0 and USE_LOCKING=1. This is on Cavium ThunderX2 with gcc 9.2.0.
The scaling with threads is really bad, presumably because of the locking. Replacing the OpenBlas dgemm call with a hand-written kernel using neon intrinsics gives comparable single thread performance, but when using 30 threads pinned to the cores of a single socket (that has 32 physical cores) the hand-written code is about 6x faster, entirely due to superior thread scaling.
The dominant dgemms sizes are (12,144)^T (12,12) and (16,256)^T (16,16).
I wonder if there is a lower-level interface to your optimized small-matrix kernels that I can invoke that bypasses the use of static memory blocks that need the lock protection?
Also, please note that the default RedHat EL8.3 openblas_serial package is not compiled with USE_LOCKING and so produces incorrect results in this use case.
Finally, many thanks for OpenBlas ... it is a tremendously valuable tool and I appreciate the effort it takes to make it happen.
Thanks
Robert