CUDA: unions in kernel arguments not copied completely if member contains padding

**Summary**

I have a CUDA kernel that takes a `union` as one of its parameters. One of the `union`'s member types is a `struct` (called `Ptrs` in my minimal working example) that contains some padding due to the memory alignment requirements of its own members. I initialize the `union` as its other member type (`Data` in the MWE) and pass it to the kernel launch. Inside the kernel however, some of the contents of `Data` appear to be zeroed out -- notably those corresponding to the padding region of `Ptrs`. It seems like Clang looks at the padding inside one of the `union`'s member types and assumes that all the members have holes in the same places.

When compiling the CUDA kernel, Clang should make no assumption about which one of the member types is currently stored in a `union` that it gets as an argument, but it clearly does and it's the wrong assumption in this case. We initially found this bug in conjunction with [thrust](https://github.com/nvidia/thrust) and [mpark::variant](https://github.com/mpark/variant), but it can be observed with plain CUDA code too as in the MWE.

**Versions**

I reproduced this bug with multiple Clang versions between 7.0.0 and 14.0.0-rc1, as well as the current master branch. Nvidia's nvcc compiler does not exhibit such an issue.

Starting with https://github.com/llvm/llvm-project/commit/d0615a93bb6d7aedc43323dc8957fe57e86ed8ae, my MWE required a minor modification to make sure that the argument copy wasn't optimized out. This optimization is not generally possible (e.g. when using thrust), so we cannot rely on it accidentally fixing the real issue.

**Minimal working example**

Compile with `clang++ -O3 -std=c++14 --cuda-gpu-arch=sm_70 mwe.cu -o mwe.cu -lcudart`. The issue can also be observed without `-O3`. Add `-DFIX_IT` to get a struct without padding, in which case the issue does not occur.

```cuda
#include <algorithm>
#include <iostream>
#include <vector>

#include <cuda.h>

struct Ptrs
{
  char s1, s2, s3, s4, s5, s6, s7;
#ifdef FIX_IT
  char s8;
#endif
  void const * data;
};

template <class T, unsigned N>
struct Data
{
  T data[N];
};

template <class T, unsigned N>
union DataType
{
  Data<T,N> d;
  Ptrs p;
};

__global__ void transform(DataType<unsigned, 4> umap, unsigned n, unsigned * result)
{
  const unsigned i = threadIdx.x;

  if(n > 0xffff) {
    /* This condition is never true, but the next line ensures that `umap` is copied into writable memory.
     * Clang 13 (https://github.com/llvm/llvm-project/commit/d0615a93bb6d7aedc43323dc8957fe57e86ed8ae)
     * introduced an optimization that will otherwise mask the issue we are trying to demonstrate.
     */
    umap.d.data[i] = 0;
  }

  if(i < n)
  {
    result[i] = umap.d.data[i];
  }
}

bool test()
{
  std::vector<unsigned> map{23U, 0xffffffff, 42U, 13U};
  DataType<unsigned, 4> umap;
  Data<unsigned, 4> d;
  std::copy(map.begin(), map.end(), &d.data[0]);
  umap.d = d;

  const unsigned mod = map.size();

  std::vector<unsigned> h_values(16);

  unsigned * values = nullptr;
  cudaMalloc(&values, h_values.size() * sizeof(unsigned));
  transform<<<1, h_values.size()>>>(umap, h_values.size(), values);
  cudaMemcpy(h_values.data(), values, h_values.size() * sizeof(unsigned), cudaMemcpyDeviceToHost);
  cudaFree(values);

  bool good = true;
  for(int i = 0; i < mod; ++i)
  {
    std::cout << h_values[i];
    if (h_values[i] != map[i])
    {
      std::cout << " wrong";
      good = false;
    }
    std::cout << std::endl;
  }

  return good;
}

int main()
{
    bool good = test();
    return (good ? 0 : 1);
}
```

**Assembly**

To see what is happening, compile the MWE with and without `-DFIX_IT` and compare the NVPTX assembly: https://godbolt.org/z/W7PY9q3sj . You can see that in the former case, it generates `ld.param` and `st.local` instructions that cover the entire size of the `union`, while in the latter case it skips those bytes that are padding.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

CUDA: unions in kernel arguments not copied completely if member contains padding #53710

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

CUDA: unions in kernel arguments not copied completely if member contains padding #53710

Description

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions