Thread scaling of concurrent small dgemm operations

My app performs many small dgemms, each invoked by a separate thread (via a task pool).  As recommended I compiled OpenBlas 3.10 with USE_THREAD=0 and USE_LOCKING=1.   This is on Cavium ThunderX2 with gcc 9.2.0.

The scaling with threads is really bad, presumably because of the locking.  Replacing the OpenBlas dgemm call with a hand-written kernel using neon intrinsics gives comparable single thread performance, but when using 30 threads pinned to the cores of a single socket (that has 32 physical cores) the hand-written code is about 6x faster, entirely due to superior thread scaling.

The dominant dgemms sizes are (12,144)^T (12,12) and (16,256)^T (16,16). 

I wonder if there is a lower-level interface to your optimized small-matrix kernels that I can invoke that bypasses the use of static memory blocks that need the lock protection?

Also, please note that the default RedHat EL8.3 openblas_serial package is not compiled with USE_LOCKING and so produces incorrect results in this use case.

Finally, many thanks for OpenBlas ... it is a tremendously valuable tool and I appreciate the effort it takes to make it happen.

Thanks

Robert

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Thread scaling of concurrent small dgemm operations #2712

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Thread scaling of concurrent small dgemm operations #2712

Description

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions