Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[OKL][Feature] @max_inner_dims attribute #531

Merged
merged 15 commits into from
Dec 6, 2021

Conversation

kris-rowe
Copy link
Member

Description

Closes #307

Introduces a new loop attribute @max_inner_dims(X, Y, Z), where X, Y, and Z are compile-time constants.

Example usage:

@max_inner_dims(16,32)
for(int ib=0; ib < BLOCK_M; ++ib; @outer) {
  for(int jb=0; jb < BLOCK_N; ++jb; @outer) {
    ...
    for(int it=0; it < 32; ++it; @inner) {
      for(int jt=0; jt < 16; ++jt; @inner) {
      ...
      }
    }
  }
}

When the range of the @inner loops within an @outer block can be determined at compile-time (e.g., because they are constant), this attribute is added automatically.

Details

If the outermost @outer-loop of a for-loop block is decorated with this OKL attribute, the following C++ attributes are added to the resulting backend kernel definitions:

  • CUDA/HIP: __launch_bounds(X*Y*Z)__
  • OpenCL: __attribute__((reqd_work_group_size(X, Y, Z)))
  • DPC++: [[sycl::reqd_work_group_size(X,Y,Z)]]

The number of arguments given should be greater than or equal to the number of @inner loops in the block.

In the case where @max_inner_dims is specified and the the range of the @inner loops can be determined, the former takes precedence—overriding the kernel launch bounds that would be added automatically.

Limitations

This attribute only affects the behaviour of the "launched" backends.
Currently CUDA, HIP, OpenCL, and DPC++ are supported: the Metal backend still needs an implementation.

@codecov
Copy link

codecov bot commented Oct 22, 2021

Codecov Report

Merging #531 (6e70742) into main (b07ec0b) will decrease coverage by 0.05%.
The diff coverage is 63.85%.

Impacted file tree graph

@@            Coverage Diff             @@
##             main     #531      +/-   ##
==========================================
- Coverage   76.59%   76.54%   -0.06%     
==========================================
  Files         263      264       +1     
  Lines       19472    19550      +78     
==========================================
+ Hits        14915    14964      +49     
- Misses       4557     4586      +29     
Impacted Files Coverage Δ
...internal/lang/builtins/attributes/maxInnerDims.cpp 13.79% <13.79%> (ø)
src/occa/internal/lang/modes/withLauncher.cpp 95.12% <81.48%> (-1.23%) ⬇️
src/occa/internal/lang/expr/dpcppAtomicNode.cpp 77.77% <100.00%> (ø)
src/occa/internal/lang/modes/cuda.cpp 66.12% <100.00%> (+1.12%) ⬆️
src/occa/internal/lang/modes/dpcpp.cpp 80.37% <100.00%> (+0.90%) ⬆️
src/occa/internal/lang/modes/metal.cpp 98.97% <100.00%> (+0.02%) ⬆️
src/occa/internal/lang/modes/okl.cpp 93.08% <100.00%> (+0.03%) ⬆️
src/occa/internal/lang/modes/opencl.cpp 95.58% <100.00%> (+0.17%) ⬆️
src/occa/internal/lang/qualifier.cpp 82.17% <0.00%> (+1.17%) ⬆️

@kris-rowe kris-rowe mentioned this pull request Oct 22, 2021
2 tasks
Copy link
Contributor

@noelchalmers noelchalmers left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This works wonderfully. I've played around with it in the cuda lang test (tests/src/internal/lang/modes/cuda.cpp) by changing the multi-kernel source test to:

parseAndPrintSource(
    "const int var[10];\n"
    "void foo() {}\n"
    "int bar(int i) {}\n"
    "@kernel void kernel(@restrict int * arg, const int bar) {\n"
    "  @max_inner_dims(256)\n"
    "  for (int o1 = 0; o1 < O1; ++o1; @outer) {\n"
    "    for (int o0 = 0; o0 < O0; ++o0; @outer) {\n"
    "      @shared int shr[3];\n"
    "      @exclusive int excl;\n"
    "      if (true) {\n"
    "        for (int i1 = 10; i1 < (I1 + 4); i1 += 3; @inner) {\n"
    "          for (int i0 = 0; i0 < I0; ++i0; @inner) {\n"
    "            for (;;) {\n"
    "               excl = i0;\n"
    "            }\n"
    "            for (;;) {\n"
    "               excl = i0;\n"
    "            }\n"
    "          }\n"
    "        }\n"
    "      }\n"
    "    }\n"
    "  }\n"
    "  for (int o1 = 0; o1 < O1; ++o1; @outer(0)) {\n"
    "    for (int o0 = 0; o0 < O0; ++o0; @outer(1)) {\n"
    "      @shared int shr[3];\n"
    "      @exclusive int excl;\n"
    "      if (true) {\n"
    "        for (int i1 = 10; i1 < (I1 + 4); i1 += 3; @inner(1)) {\n"
    "          for (int i0 = 0; i0 < I0; ++i0; @inner(0)) {\n"
    "            for (;;) {\n"
    "               excl = i0;\n"
    "            }\n"
    "            for (;;) {\n"
    "               excl = i0;\n"
    "            }\n"
    "          }\n"
    "        }\n"
    "      }\n"
    "    }\n"
    "  }\n"
    "  @max_inner_dims(1024)\n"
    "  for (int ib = 0; ib < entries; ib += 16; @outer) {\n"
    "    for (int it = 0; it < 16; ++it; @inner) {\n"
    "      const int i = ib + it;\n"
    "      if (i < entries) {\n"
    "        ab[i] = a[i] + b[i];\n"
    "      }\n"
    "    }\n"
    "  }\n"
    "}\n"
  );

and it works a treat, with the following device code generated:

__constant__ int var[10];

__device__ void foo() {}

__device__ int bar(int i) {}

extern "C" __global__ __launch_bounds__(256) void _occa_kernel_0(int * __restrict__ arg,
                                                                 const int bar) {
  {
    int o1 = 0 + blockIdx.y;
    {
      int o0 = 0 + blockIdx.x;
      __shared__ int shr[3];
      int excl;
      if (true) {
        {
          int i1 = 10 + (3 * threadIdx.y);
          {
            int i0 = 0 + threadIdx.x;
            for (; ; ) {
              excl = i0;
            }
            for (; ; ) {
              excl = i0;
            }
          }
        }
      }
    }
  }
}

extern "C" __global__ void _occa_kernel_1(int * __restrict__ arg,
                                          const int bar) {
  {
    int o1 = 0 + blockIdx.x;
    {
      int o0 = 0 + blockIdx.y;
      __shared__ int shr[3];
      int excl;
      if (true) {
        {
          int i1 = 10 + (3 * threadIdx.y);
          {
            int i0 = 0 + threadIdx.x;
            for (; ; ) {
              excl = i0;
            }
            for (; ; ) {
              excl = i0;
            }
          }
        }
      }
    }
  }
}

extern "C" __global__ __launch_bounds__(1024) void _occa_kernel_2(int * __restrict__ arg,
                                                                  const int bar) {
  {
    int ib = 0 + (16 * blockIdx.x);
    {
      int it = 0 + threadIdx.x;
      const int i = ib + it;
      if (i < entries) {
        ab[i] = a[i] + b[i];
      }
    }
  }
}

The addVectors kernel in this test also has the launch bounds added automatically.

I'm not certain of all the tests being run and whether they adequately cover this. I'll leave that to @dmed256 to judge.

A small stylistic comment, the kernel name lines with the launch_bounds are getting pretty long. Should we consider adding line breaks after __global__ and __launch_bounds__ just to help with readability?

@kris-rowe kris-rowe merged commit 5db15e0 into libocca:main Dec 6, 2021
@kris-rowe kris-rowe deleted the launchBounds branch December 15, 2021 17:52
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

New kernel qualifier macro
2 participants