Some question about cuda thread size.

In [ForRange](https://github.com/PaddlePaddle/Paddle/blob/7bf47ea40c2561feaf11da7d233cd2528eb72cc5/paddle/platform/for_range.h#L59) struct,  thread size seems to be assigned arbitrary value, the value is not multiple of the warp size.
As I read and heard that the thread size assigned to a block should be always multiple of the warp size(32), otherwise not only the remaining part of the warp goes unused and the performance is dropped too since bad memory coalescing. But I didn't find a comparative experiment on this. 

https://github.com/PaddlePaddle/Paddle/blob/7bf47ea40c2561feaf11da7d233cd2528eb72cc5/paddle/platform/for_range.h#L65-L75

	constexpr size_t num_threads = 1024;
	int block_size = limit_ <= num_threads ? limit_ : num_threads;
	int grid_size = (limit_ + num_threads - 1) / num_threads;

	if (grid_size == 1) {
	ForRangeElemwiseOpGridIsOne<<<1, block_size, 0, dev_ctx_.stream()>>>(
	func);
	} else {
	ForRangeElemwiseOp<<<grid_size, block_size, 0, dev_ctx_.stream()>>>(
	func, limit_);
	}

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Some question about cuda thread size. #7081

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Some question about cuda thread size. #7081

Description

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions