Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There are some other kernels in
segment_pooling.cu
, such asSegmentMeanKernel
, share the same launch config.Dose other kernel function which have the same launch config may cause the same problem?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There might be, but currently no other kernels encounter the same problem on V100. NVCC doesn't know the runtime launch config, so it doesn't limit the registers usage . For example, this kernel can be ran with 128/256/512 threads per block, if NVCC limits the registers usage, it may reduce the performance of above configurations.
BTW, 1024 threads per block results lower performance than 128/256 threads per block from my experiences. CUDA Best Practices also says that
However, it may be a large effort to do performance benchmarks and verifications on each op used this launch config.