Skip to content

[Clang][CUDA][HIP] lambda capture of constexpr variable inconsistent between host and device #132068

Closed
@mkuron

Description

@mkuron

Consider the following bit of HIP code:

#include <algorithm>

using std::max;

template<typename F>
static void __global__
kernel(F f)
{
  f(1);
}

void test(float const * fl, float const * A, float * Vf)
{
  float constexpr small(1.0e-25);

  auto f = [=] __device__ __host__ (unsigned int n) {
    float const value = max(small, fl[0]);
    Vf[0] = value * A[0];
  };
  static_assert(sizeof(f) == sizeof(fl) + sizeof(A) + sizeof(Vf));
  kernel<<<1,1>>>(f);
}

The static_assert fails in the host-side compilation but succeeds in the device-side compilation. This means that the layout of the struct synthesized from the lambda is inconsistent between host and device, so if you use any of the captured variables on the device side, they will contain the data of some of the other variables. You can also use -Xclang -fdump-record-layouts to see that. Evidently the constexpr variable is part of the captured variables only on the host side, but not on the device side.
With --cuda-host-only:

*** Dumping AST Record Layout
         0 | class (lambda at <source>:23:12)
         0 |   const float * 
         8 |   const float 
        16 |   float * 
        24 |   const float * 
           | [sizeof=32, dsize=32, align=8,
           |  nvsize=32, nvalign=8]

With --cuda-device-only:

*** Dumping AST Record Layout
         0 | class (lambda at <source>:23:12)
         0 |   const float * 
         8 |   float * 
        16 |   const float * 
           | [sizeof=24, dsize=24, align=8,
           |  nvsize=24, nvalign=8]

Godbolt: https://cuda.godbolt.org/z/KE789sevs.

When you compile the exact same code for CUDA, this does not happen. However, if you add

template <typename T, std::enable_if_t<std::is_arithmetic<T>::value, int> = 0>
__host__ T max(const T a, const T b) {
    return std::max(a, b);
}

after line 3 of the code at the top, you get the exact same layout discrepancy as with HIP. See https://cuda.godbolt.org/z/e3Ybr4hK1.

I can replace the [=] with [fl, A, Vf] and if that __host__ T max overload is present, it tells me that variable 'small' cannot be implicitly captured in a lambda with no capture-default specified, but if I leave out that overload it does not show that error message.

The example code uses std::max (from the GNU libstdc++), but I have no doubt that the same issue can easily be demonstrated without making use of any library headers.

nvcc, for comparison, does not exhibit this issue (https://cuda.godbolt.org/z/33MvMc755). It has a consistent capture size between host and device. Judging by the size, the constexpr variable seems to be part of the capture, though in contrast to Clang, a lambda appears to always occupy 8 bytes more than required for its captured variables.

Metadata

Metadata

Assignees

Labels

clang:frontendLanguage frontend issues, e.g. anything involving "Sema"cuda

Type

No type

Projects

No projects

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions