[OKL][Feature] `@max_inner_dims` attribute #531

kris-rowe · 2021-10-22T04:14:02Z

Description

Closes #307

Introduces a new loop attribute @max_inner_dims(X, Y, Z), where X, Y, and Z are compile-time constants.

Example usage:

@max_inner_dims(16,32)
for(int ib=0; ib < BLOCK_M; ++ib; @outer) {
  for(int jb=0; jb < BLOCK_N; ++jb; @outer) {
    ...
    for(int it=0; it < 32; ++it; @inner) {
      for(int jt=0; jt < 16; ++jt; @inner) {
      ...
      }
    }
  }
}

When the range of the @inner loops within an @outer block can be determined at compile-time (e.g., because they are constant), this attribute is added automatically.

Details

If the outermost @outer-loop of a for-loop block is decorated with this OKL attribute, the following C++ attributes are added to the resulting backend kernel definitions:

CUDA/HIP: __launch_bounds(X*Y*Z)__
OpenCL: __attribute__((reqd_work_group_size(X, Y, Z)))
DPC++: [[sycl::reqd_work_group_size(X,Y,Z)]]

The number of arguments given should be greater than or equal to the number of @inner loops in the block.

In the case where @max_inner_dims is specified and the the range of the @inner loops can be determined, the former takes precedence—overriding the kernel launch bounds that would be added automatically.

Limitations

This attribute only affects the behaviour of the "launched" backends.
Currently CUDA, HIP, OpenCL, and DPC++ are supported: the Metal backend still needs an implementation.

time.

depricated.

warnings from clang.

codecov · 2021-10-22T04:41:16Z

Codecov Report

Merging #531 (6e70742) into main (b07ec0b) will decrease coverage by 0.05%.
The diff coverage is 63.85%.

@@            Coverage Diff             @@
##             main     #531      +/-   ##
==========================================
- Coverage   76.59%   76.54%   -0.06%     
==========================================
  Files         263      264       +1     
  Lines       19472    19550      +78     
==========================================
+ Hits        14915    14964      +49     
- Misses       4557     4586      +29

Impacted Files	Coverage Δ
...internal/lang/builtins/attributes/maxInnerDims.cpp	`13.79% <13.79%> (ø)`
src/occa/internal/lang/modes/withLauncher.cpp	`95.12% <81.48%> (-1.23%)`	⬇️
src/occa/internal/lang/expr/dpcppAtomicNode.cpp	`77.77% <100.00%> (ø)`
src/occa/internal/lang/modes/cuda.cpp	`66.12% <100.00%> (+1.12%)`	⬆️
src/occa/internal/lang/modes/dpcpp.cpp	`80.37% <100.00%> (+0.90%)`	⬆️
src/occa/internal/lang/modes/metal.cpp	`98.97% <100.00%> (+0.02%)`	⬆️
src/occa/internal/lang/modes/okl.cpp	`93.08% <100.00%> (+0.03%)`	⬆️
src/occa/internal/lang/modes/opencl.cpp	`95.58% <100.00%> (+0.17%)`	⬆️
src/occa/internal/lang/qualifier.cpp	`82.17% <0.00%> (+1.17%)`	⬆️

noelchalmers

This works wonderfully. I've played around with it in the cuda lang test (tests/src/internal/lang/modes/cuda.cpp) by changing the multi-kernel source test to:

parseAndPrintSource(
    "const int var[10];\n"
    "void foo() {}\n"
    "int bar(int i) {}\n"
    "@kernel void kernel(@restrict int * arg, const int bar) {\n"
    "  @max_inner_dims(256)\n"
    "  for (int o1 = 0; o1 < O1; ++o1; @outer) {\n"
    "    for (int o0 = 0; o0 < O0; ++o0; @outer) {\n"
    "      @shared int shr[3];\n"
    "      @exclusive int excl;\n"
    "      if (true) {\n"
    "        for (int i1 = 10; i1 < (I1 + 4); i1 += 3; @inner) {\n"
    "          for (int i0 = 0; i0 < I0; ++i0; @inner) {\n"
    "            for (;;) {\n"
    "               excl = i0;\n"
    "            }\n"
    "            for (;;) {\n"
    "               excl = i0;\n"
    "            }\n"
    "          }\n"
    "        }\n"
    "      }\n"
    "    }\n"
    "  }\n"
    "  for (int o1 = 0; o1 < O1; ++o1; @outer(0)) {\n"
    "    for (int o0 = 0; o0 < O0; ++o0; @outer(1)) {\n"
    "      @shared int shr[3];\n"
    "      @exclusive int excl;\n"
    "      if (true) {\n"
    "        for (int i1 = 10; i1 < (I1 + 4); i1 += 3; @inner(1)) {\n"
    "          for (int i0 = 0; i0 < I0; ++i0; @inner(0)) {\n"
    "            for (;;) {\n"
    "               excl = i0;\n"
    "            }\n"
    "            for (;;) {\n"
    "               excl = i0;\n"
    "            }\n"
    "          }\n"
    "        }\n"
    "      }\n"
    "    }\n"
    "  }\n"
    "  @max_inner_dims(1024)\n"
    "  for (int ib = 0; ib < entries; ib += 16; @outer) {\n"
    "    for (int it = 0; it < 16; ++it; @inner) {\n"
    "      const int i = ib + it;\n"
    "      if (i < entries) {\n"
    "        ab[i] = a[i] + b[i];\n"
    "      }\n"
    "    }\n"
    "  }\n"
    "}\n"
  );

and it works a treat, with the following device code generated:

__constant__ int var[10];

__device__ void foo() {}

__device__ int bar(int i) {}

extern "C" __global__ __launch_bounds__(256) void _occa_kernel_0(int * __restrict__ arg,
                                                                 const int bar) {
  {
    int o1 = 0 + blockIdx.y;
    {
      int o0 = 0 + blockIdx.x;
      __shared__ int shr[3];
      int excl;
      if (true) {
        {
          int i1 = 10 + (3 * threadIdx.y);
          {
            int i0 = 0 + threadIdx.x;
            for (; ; ) {
              excl = i0;
            }
            for (; ; ) {
              excl = i0;
            }
          }
        }
      }
    }
  }
}

extern "C" __global__ void _occa_kernel_1(int * __restrict__ arg,
                                          const int bar) {
  {
    int o1 = 0 + blockIdx.x;
    {
      int o0 = 0 + blockIdx.y;
      __shared__ int shr[3];
      int excl;
      if (true) {
        {
          int i1 = 10 + (3 * threadIdx.y);
          {
            int i0 = 0 + threadIdx.x;
            for (; ; ) {
              excl = i0;
            }
            for (; ; ) {
              excl = i0;
            }
          }
        }
      }
    }
  }
}

extern "C" __global__ __launch_bounds__(1024) void _occa_kernel_2(int * __restrict__ arg,
                                                                  const int bar) {
  {
    int ib = 0 + (16 * blockIdx.x);
    {
      int it = 0 + threadIdx.x;
      const int i = ib + it;
      if (i < entries) {
        ab[i] = a[i] + b[i];
      }
    }
  }
}

The addVectors kernel in this test also has the launch bounds added automatically.

I'm not certain of all the tests being run and whether they adequately cover this. I'll leave that to @dmed256 to judge.

A small stylistic comment, the kernel name lines with the launch_bounds are getting pretty long. Should we consider adding line breaks after __global__ and __launch_bounds__ just to help with readability?

Stefan and others added 12 commits October 22, 2021 02:37

add kernel launch bounds

db0e7be

Remove debug statement.

9bf0e52

Handle cass with multiple @outer loops, undefined @tile size.

6e55a25

Add kernel launch bounds for the DPC++ backend.

5ec40ae

[WIP][OKL] Add @max_inner_dims attribute.

d5c719a

[WIP][OKL] Make maxInnerDims a for-loop attribute.

8a0424b

Reintroduce automatic addition of launch bounds when known at compile

84c80d3

time.

Defer metal launch bounds attribute for now.

db2a79d

Don't allow for zero tile-size when adding launch bounds.

9faeeae

Add sycl:: prefix to launch bounds attribute for DPC++.

30bd6b5

Replace sycl::ONEAPI with sycl::ext::oneapi since the former is

5bd1d94

depricated.

Add override to virtual functions in parser classes to silence

d1bebdb

warnings from clang.

kris-rowe mentioned this pull request Oct 22, 2021

Add kernel launch bounds #519

Closed

2 tasks

noelchalmers requested review from noelchalmers and dmed256 October 22, 2021 14:35

noelchalmers approved these changes Nov 14, 2021

View reviewed changes

kris-rowe added 3 commits November 15, 2021 12:28

Merge branch 'main' into launchBounds

7878397

Merge branch 'main' into launchBounds

4819fb6

Merge branch 'main' into launchBounds

6e70742

kris-rowe merged commit 5db15e0 into libocca:main Dec 6, 2021

kris-rowe deleted the launchBounds branch December 15, 2021 17:52

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[OKL][Feature] `@max_inner_dims` attribute #531

[OKL][Feature] `@max_inner_dims` attribute #531

kris-rowe commented Oct 22, 2021

codecov bot commented Oct 22, 2021 •

edited

Loading

noelchalmers left a comment

[OKL][Feature] @max_inner_dims attribute #531

[OKL][Feature] @max_inner_dims attribute #531

Conversation

kris-rowe commented Oct 22, 2021

Description

Details

Limitations

codecov bot commented Oct 22, 2021 • edited Loading

Codecov Report

noelchalmers left a comment

Choose a reason for hiding this comment

[OKL][Feature] `@max_inner_dims` attribute #531

[OKL][Feature] `@max_inner_dims` attribute #531

codecov bot commented Oct 22, 2021 •

edited

Loading