Skip to content

CUDA: unions in kernel arguments not copied completely if member contains padding #53710

Closed
@mkuron

Description

@mkuron

Summary

I have a CUDA kernel that takes a union as one of its parameters. One of the union's member types is a struct (called Ptrs in my minimal working example) that contains some padding due to the memory alignment requirements of its own members. I initialize the union as its other member type (Data in the MWE) and pass it to the kernel launch. Inside the kernel however, some of the contents of Data appear to be zeroed out -- notably those corresponding to the padding region of Ptrs. It seems like Clang looks at the padding inside one of the union's member types and assumes that all the members have holes in the same places.

When compiling the CUDA kernel, Clang should make no assumption about which one of the member types is currently stored in a union that it gets as an argument, but it clearly does and it's the wrong assumption in this case. We initially found this bug in conjunction with thrust and mpark::variant, but it can be observed with plain CUDA code too as in the MWE.

Versions

I reproduced this bug with multiple Clang versions between 7.0.0 and 14.0.0-rc1, as well as the current master branch. Nvidia's nvcc compiler does not exhibit such an issue.

Starting with d0615a9, my MWE required a minor modification to make sure that the argument copy wasn't optimized out. This optimization is not generally possible (e.g. when using thrust), so we cannot rely on it accidentally fixing the real issue.

Minimal working example

Compile with clang++ -O3 -std=c++14 --cuda-gpu-arch=sm_70 mwe.cu -o mwe.cu -lcudart. The issue can also be observed without -O3. Add -DFIX_IT to get a struct without padding, in which case the issue does not occur.

#include <algorithm>
#include <iostream>
#include <vector>

#include <cuda.h>

struct Ptrs
{
  char s1, s2, s3, s4, s5, s6, s7;
#ifdef FIX_IT
  char s8;
#endif
  void const * data;
};

template <class T, unsigned N>
struct Data
{
  T data[N];
};

template <class T, unsigned N>
union DataType
{
  Data<T,N> d;
  Ptrs p;
};

__global__ void transform(DataType<unsigned, 4> umap, unsigned n, unsigned * result)
{
  const unsigned i = threadIdx.x;

  if(n > 0xffff) {
    /* This condition is never true, but the next line ensures that `umap` is copied into writable memory.
     * Clang 13 (https://github.com/llvm/llvm-project/commit/d0615a93bb6d7aedc43323dc8957fe57e86ed8ae)
     * introduced an optimization that will otherwise mask the issue we are trying to demonstrate.
     */
    umap.d.data[i] = 0;
  }

  if(i < n)
  {
    result[i] = umap.d.data[i];
  }
}

bool test()
{
  std::vector<unsigned> map{23U, 0xffffffff, 42U, 13U};
  DataType<unsigned, 4> umap;
  Data<unsigned, 4> d;
  std::copy(map.begin(), map.end(), &d.data[0]);
  umap.d = d;

  const unsigned mod = map.size();

  std::vector<unsigned> h_values(16);

  unsigned * values = nullptr;
  cudaMalloc(&values, h_values.size() * sizeof(unsigned));
  transform<<<1, h_values.size()>>>(umap, h_values.size(), values);
  cudaMemcpy(h_values.data(), values, h_values.size() * sizeof(unsigned), cudaMemcpyDeviceToHost);
  cudaFree(values);

  bool good = true;
  for(int i = 0; i < mod; ++i)
  {
    std::cout << h_values[i];
    if (h_values[i] != map[i])
    {
      std::cout << " wrong";
      good = false;
    }
    std::cout << std::endl;
  }

  return good;
}

int main()
{
    bool good = test();
    return (good ? 0 : 1);
}

Assembly

To see what is happening, compile the MWE with and without -DFIX_IT and compare the NVPTX assembly: https://godbolt.org/z/W7PY9q3sj . You can see that in the former case, it generates ld.param and st.local instructions that cover the entire size of the union, while in the latter case it skips those bytes that are padding.

Metadata

Metadata

Assignees

Type

No type

Projects

No projects

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions