Skip to content

Some question about cuda thread size. #7081

Closed
@chengduoZH

Description

@chengduoZH

In ForRange struct, thread size seems to be assigned arbitrary value, the value is not multiple of the warp size.
As I read and heard that the thread size assigned to a block should be always multiple of the warp size(32), otherwise not only the remaining part of the warp goes unused and the performance is dropped too since bad memory coalescing. But I didn't find a comparative experiment on this.

constexpr size_t num_threads = 1024;
int block_size = limit_ <= num_threads ? limit_ : num_threads;
int grid_size = (limit_ + num_threads - 1) / num_threads;
if (grid_size == 1) {
ForRangeElemwiseOpGridIsOne<<<1, block_size, 0, dev_ctx_.stream()>>>(
func);
} else {
ForRangeElemwiseOp<<<grid_size, block_size, 0, dev_ctx_.stream()>>>(
func, limit_);
}

Metadata

Metadata

Labels

No labels
No labels

Type

No type

Projects

No projects

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions