-
Notifications
You must be signed in to change notification settings - Fork 86
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[OKL][Feature] @max_inner_dims
attribute
#531
Conversation
warnings from clang.
Codecov Report
@@ Coverage Diff @@
## main #531 +/- ##
==========================================
- Coverage 76.59% 76.54% -0.06%
==========================================
Files 263 264 +1
Lines 19472 19550 +78
==========================================
+ Hits 14915 14964 +49
- Misses 4557 4586 +29
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This works wonderfully. I've played around with it in the cuda lang test (tests/src/internal/lang/modes/cuda.cpp) by changing the multi-kernel source test to:
parseAndPrintSource(
"const int var[10];\n"
"void foo() {}\n"
"int bar(int i) {}\n"
"@kernel void kernel(@restrict int * arg, const int bar) {\n"
" @max_inner_dims(256)\n"
" for (int o1 = 0; o1 < O1; ++o1; @outer) {\n"
" for (int o0 = 0; o0 < O0; ++o0; @outer) {\n"
" @shared int shr[3];\n"
" @exclusive int excl;\n"
" if (true) {\n"
" for (int i1 = 10; i1 < (I1 + 4); i1 += 3; @inner) {\n"
" for (int i0 = 0; i0 < I0; ++i0; @inner) {\n"
" for (;;) {\n"
" excl = i0;\n"
" }\n"
" for (;;) {\n"
" excl = i0;\n"
" }\n"
" }\n"
" }\n"
" }\n"
" }\n"
" }\n"
" for (int o1 = 0; o1 < O1; ++o1; @outer(0)) {\n"
" for (int o0 = 0; o0 < O0; ++o0; @outer(1)) {\n"
" @shared int shr[3];\n"
" @exclusive int excl;\n"
" if (true) {\n"
" for (int i1 = 10; i1 < (I1 + 4); i1 += 3; @inner(1)) {\n"
" for (int i0 = 0; i0 < I0; ++i0; @inner(0)) {\n"
" for (;;) {\n"
" excl = i0;\n"
" }\n"
" for (;;) {\n"
" excl = i0;\n"
" }\n"
" }\n"
" }\n"
" }\n"
" }\n"
" }\n"
" @max_inner_dims(1024)\n"
" for (int ib = 0; ib < entries; ib += 16; @outer) {\n"
" for (int it = 0; it < 16; ++it; @inner) {\n"
" const int i = ib + it;\n"
" if (i < entries) {\n"
" ab[i] = a[i] + b[i];\n"
" }\n"
" }\n"
" }\n"
"}\n"
);
and it works a treat, with the following device code generated:
__constant__ int var[10];
__device__ void foo() {}
__device__ int bar(int i) {}
extern "C" __global__ __launch_bounds__(256) void _occa_kernel_0(int * __restrict__ arg,
const int bar) {
{
int o1 = 0 + blockIdx.y;
{
int o0 = 0 + blockIdx.x;
__shared__ int shr[3];
int excl;
if (true) {
{
int i1 = 10 + (3 * threadIdx.y);
{
int i0 = 0 + threadIdx.x;
for (; ; ) {
excl = i0;
}
for (; ; ) {
excl = i0;
}
}
}
}
}
}
}
extern "C" __global__ void _occa_kernel_1(int * __restrict__ arg,
const int bar) {
{
int o1 = 0 + blockIdx.x;
{
int o0 = 0 + blockIdx.y;
__shared__ int shr[3];
int excl;
if (true) {
{
int i1 = 10 + (3 * threadIdx.y);
{
int i0 = 0 + threadIdx.x;
for (; ; ) {
excl = i0;
}
for (; ; ) {
excl = i0;
}
}
}
}
}
}
}
extern "C" __global__ __launch_bounds__(1024) void _occa_kernel_2(int * __restrict__ arg,
const int bar) {
{
int ib = 0 + (16 * blockIdx.x);
{
int it = 0 + threadIdx.x;
const int i = ib + it;
if (i < entries) {
ab[i] = a[i] + b[i];
}
}
}
}
The addVectors
kernel in this test also has the launch bounds added automatically.
I'm not certain of all the tests being run and whether they adequately cover this. I'll leave that to @dmed256 to judge.
A small stylistic comment, the kernel name lines with the launch_bounds are getting pretty long. Should we consider adding line breaks after __global__
and __launch_bounds__
just to help with readability?
Description
Closes #307
Introduces a new loop attribute
@max_inner_dims(X, Y, Z)
, whereX
,Y
, andZ
are compile-time constants.Example usage:
When the range of the
@inner
loops within an@outer
block can be determined at compile-time (e.g., because they are constant), this attribute is added automatically.Details
If the outermost
@outer
-loop of a for-loop block is decorated with this OKL attribute, the following C++ attributes are added to the resulting backend kernel definitions:__launch_bounds(X*Y*Z)__
__attribute__((reqd_work_group_size(X, Y, Z)))
[[sycl::reqd_work_group_size(X,Y,Z)]]
The number of arguments given should be greater than or equal to the number of
@inner
loops in the block.In the case where
@max_inner_dims
is specified and the the range of the@inner
loops can be determined, the former takes precedence—overriding the kernel launch bounds that would be added automatically.Limitations
This attribute only affects the behaviour of the "launched" backends.
Currently CUDA, HIP, OpenCL, and DPC++ are supported: the Metal backend still needs an implementation.