In my opinion, when loading data from global memory to shared memory(i.e. write shared memory) with vectorized access, because of the transposition, threads within a warp may write the same col in shared memory.
For example, thread 0 reads A[0][0] to A[0][3], thread 1 reads A[0][4] to A[0][7]. So thread 0 writes As[0][0] to As[3][0], thread 1 writes As[4][0] to As[7][0]. For a BM(=128) * BK(=8) size As, it is obvious that As[0][0] and As[4][0] are on the same bank, causing bank conflict.
So I think bank conflict will only occur when writing As not Bs. But in kernel v7 and v8, it seems like you try to optimize wrting to Bs:
|
tmp = reinterpret_cast<float4 *>(&B[innerRowB * N + innerColB * 4])[0]; |
|
Bs[innerRowB * (BN + extraCols) + innerColB * 4 + 0] = tmp.x; |
|
Bs[innerRowB * (BN + extraCols) + innerColB * 4 + 1] = tmp.y; |
|
Bs[innerRowB * (BN + extraCols) + innerColB * 4 + 2] = tmp.z; |
|
Bs[innerRowB * (BN + extraCols) + innerColB * 4 + 3] = tmp.w; |
Did I understand something wrong?
In my opinion, when loading data from global memory to shared memory(i.e. write shared memory) with vectorized access, because of the transposition, threads within a warp may write the same col in shared memory.
For example, thread 0 reads
A[0][0]toA[0][3], thread 1 readsA[0][4]toA[0][7]. So thread 0 writesAs[0][0]toAs[3][0], thread 1 writesAs[4][0]toAs[7][0]. For aBM(=128) * BK(=8)sizeAs, it is obvious thatAs[0][0]andAs[4][0]are on the same bank, causing bank conflict.So I think bank conflict will only occur when writing
AsnotBs. But in kernel v7 and v8, it seems like you try to optimize wrting toBs:SGEMM_CUDA/src/kernels/8_kernel_bank_extra_col.cuh
Lines 56 to 60 in 60cba6f
Did I understand something wrong?