Skip to content

Thread scaling of concurrent small dgemm operations #2712

Open
@robertjharrison

Description

@robertjharrison

My app performs many small dgemms, each invoked by a separate thread (via a task pool). As recommended I compiled OpenBlas 3.10 with USE_THREAD=0 and USE_LOCKING=1. This is on Cavium ThunderX2 with gcc 9.2.0.

The scaling with threads is really bad, presumably because of the locking. Replacing the OpenBlas dgemm call with a hand-written kernel using neon intrinsics gives comparable single thread performance, but when using 30 threads pinned to the cores of a single socket (that has 32 physical cores) the hand-written code is about 6x faster, entirely due to superior thread scaling.

The dominant dgemms sizes are (12,144)^T (12,12) and (16,256)^T (16,16).

I wonder if there is a lower-level interface to your optimized small-matrix kernels that I can invoke that bypasses the use of static memory blocks that need the lock protection?

Also, please note that the default RedHat EL8.3 openblas_serial package is not compiled with USE_LOCKING and so produces incorrect results in this use case.

Finally, many thanks for OpenBlas ... it is a tremendously valuable tool and I appreciate the effort it takes to make it happen.

Thanks

Robert

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions