Solve bank conflict

In my opinion, when loading data from global memory to shared memory(i.e. write shared memory) with vectorized access, because of the transposition, threads within a warp may write the same col in shared memory. 
For example, thread 0 reads `A[0][0]` to `A[0][3]`, thread 1 reads `A[0][4]` to `A[0][7]`. So thread 0 writes  `As[0][0]` to `As[3][0]`, thread 1 writes  `As[4][0]` to `As[7][0]`. For a `BM(=128) * BK(=8)` size `As`, it is obvious that `As[0][0]` and `As[4][0]` are on the same bank, causing bank conflict.
 So I think bank conflict will only occur when writing `As` not `Bs`. But in kernel v7 and v8, it seems like you try to optimize wrting to `Bs`:
https://github.com/siboehm/SGEMM_CUDA/blob/60cba6f9b20a198116c76f18de8047f44df8c8b8/src/kernels/8_kernel_bank_extra_col.cuh#L56-L60
Did I understand something wrong?

	tmp = reinterpret_cast<float4 >(&B[innerRowB N + innerColB * 4])[0];
	Bs[innerRowB * (BN + extraCols) + innerColB * 4 + 0] = tmp.x;
	Bs[innerRowB * (BN + extraCols) + innerColB * 4 + 1] = tmp.y;
	Bs[innerRowB * (BN + extraCols) + innerColB * 4 + 2] = tmp.z;
	Bs[innerRowB * (BN + extraCols) + innerColB * 4 + 3] = tmp.w;

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Solve bank conflict #8

Metadata

Assignees

Labels

Projects

Milestone

Relationships

Development

Solve bank conflict #8

Description

Metadata

Metadata

Assignees

Labels

Projects

Milestone

Relationships

Development

Issue actions