Skip to content

Solve bank conflict #8

@yofufufufu

Description

@yofufufufu

In my opinion, when loading data from global memory to shared memory(i.e. write shared memory) with vectorized access, because of the transposition, threads within a warp may write the same col in shared memory.
For example, thread 0 reads A[0][0] to A[0][3], thread 1 reads A[0][4] to A[0][7]. So thread 0 writes As[0][0] to As[3][0], thread 1 writes As[4][0] to As[7][0]. For a BM(=128) * BK(=8) size As, it is obvious that As[0][0] and As[4][0] are on the same bank, causing bank conflict.
So I think bank conflict will only occur when writing As not Bs. But in kernel v7 and v8, it seems like you try to optimize wrting to Bs:

tmp = reinterpret_cast<float4 *>(&B[innerRowB * N + innerColB * 4])[0];
Bs[innerRowB * (BN + extraCols) + innerColB * 4 + 0] = tmp.x;
Bs[innerRowB * (BN + extraCols) + innerColB * 4 + 1] = tmp.y;
Bs[innerRowB * (BN + extraCols) + innerColB * 4 + 2] = tmp.z;
Bs[innerRowB * (BN + extraCols) + innerColB * 4 + 3] = tmp.w;

Did I understand something wrong?

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions