-
Notifications
You must be signed in to change notification settings - Fork 212
[Cuda Codegen] Emit launch bounds #526
base: master
Are you sure you want to change the base?
Conversation
did you check what happens if somebody manually maps to |
Cuda functions can be annotated with launch bounds, that is the maximum number of threads per block (the minimum blocks per multiprocessor can also be specified). This information is used by nvrtc/nvcc during register allocation (and probably other phases as well).
4ec077e
to
f6a78dc
Compare
Fixed. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm just putting a block on this because it fails for me with
[ RUN ] TensorDot_32_512_8_2_28_28.BaseCorrect
unknown file: Failure
C++ exception with description "Error at: /home/skimo/git/c2isl/tc/core/cuda/cuda_rtc.cc:188: CUDA_ERROR_INVALID_VALUE" thrown in the test body.
[ FAILED ] TensorDot_32_512_8_2_28_28.BaseCorrect (540 ms)
On Tue, Jun 19, 2018 at 08:09:18AM -0700, ftynse wrote:
did you check what happens if somebody manually maps to `.mapToThreads(32,0,0)` ?
Does that make any sense?
Surely, the kernel is not going to run at all in that case,
so why bother with special cases for this situation?
skimo
|
Oh, this test is failing for me as well. However, if I dump the cuda and compile it with nvcc, then I see no error. |
auto b1 = block.view[1]; | ||
b1 = b1 == 0 ? 1 : b1; | ||
auto b2 = block.view[2]; | ||
b1 = b2 == 0 ? 1 : b2; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This should be b2
instead of b1
.
However, I would suggest you remove this special handling of 0.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
trying to unsubscribe, don't see a way other than approving
Cuda functions can be annotated with launch bounds, that is the maximum
number of threads per block (the minimum blocks per multiprocessor can
also be specified). This information is used by nvrtc/nvcc during
register allocation (and probably other phases as well).