Description
Summary
I have a CUDA kernel that takes a union
as one of its parameters. One of the union
's member types is a struct
(called Ptrs
in my minimal working example) that contains some padding due to the memory alignment requirements of its own members. I initialize the union
as its other member type (Data
in the MWE) and pass it to the kernel launch. Inside the kernel however, some of the contents of Data
appear to be zeroed out -- notably those corresponding to the padding region of Ptrs
. It seems like Clang looks at the padding inside one of the union
's member types and assumes that all the members have holes in the same places.
When compiling the CUDA kernel, Clang should make no assumption about which one of the member types is currently stored in a union
that it gets as an argument, but it clearly does and it's the wrong assumption in this case. We initially found this bug in conjunction with thrust and mpark::variant, but it can be observed with plain CUDA code too as in the MWE.
Versions
I reproduced this bug with multiple Clang versions between 7.0.0 and 14.0.0-rc1, as well as the current master branch. Nvidia's nvcc compiler does not exhibit such an issue.
Starting with d0615a9, my MWE required a minor modification to make sure that the argument copy wasn't optimized out. This optimization is not generally possible (e.g. when using thrust), so we cannot rely on it accidentally fixing the real issue.
Minimal working example
Compile with clang++ -O3 -std=c++14 --cuda-gpu-arch=sm_70 mwe.cu -o mwe.cu -lcudart
. The issue can also be observed without -O3
. Add -DFIX_IT
to get a struct without padding, in which case the issue does not occur.
#include <algorithm>
#include <iostream>
#include <vector>
#include <cuda.h>
struct Ptrs
{
char s1, s2, s3, s4, s5, s6, s7;
#ifdef FIX_IT
char s8;
#endif
void const * data;
};
template <class T, unsigned N>
struct Data
{
T data[N];
};
template <class T, unsigned N>
union DataType
{
Data<T,N> d;
Ptrs p;
};
__global__ void transform(DataType<unsigned, 4> umap, unsigned n, unsigned * result)
{
const unsigned i = threadIdx.x;
if(n > 0xffff) {
/* This condition is never true, but the next line ensures that `umap` is copied into writable memory.
* Clang 13 (https://github.com/llvm/llvm-project/commit/d0615a93bb6d7aedc43323dc8957fe57e86ed8ae)
* introduced an optimization that will otherwise mask the issue we are trying to demonstrate.
*/
umap.d.data[i] = 0;
}
if(i < n)
{
result[i] = umap.d.data[i];
}
}
bool test()
{
std::vector<unsigned> map{23U, 0xffffffff, 42U, 13U};
DataType<unsigned, 4> umap;
Data<unsigned, 4> d;
std::copy(map.begin(), map.end(), &d.data[0]);
umap.d = d;
const unsigned mod = map.size();
std::vector<unsigned> h_values(16);
unsigned * values = nullptr;
cudaMalloc(&values, h_values.size() * sizeof(unsigned));
transform<<<1, h_values.size()>>>(umap, h_values.size(), values);
cudaMemcpy(h_values.data(), values, h_values.size() * sizeof(unsigned), cudaMemcpyDeviceToHost);
cudaFree(values);
bool good = true;
for(int i = 0; i < mod; ++i)
{
std::cout << h_values[i];
if (h_values[i] != map[i])
{
std::cout << " wrong";
good = false;
}
std::cout << std::endl;
}
return good;
}
int main()
{
bool good = test();
return (good ? 0 : 1);
}
Assembly
To see what is happening, compile the MWE with and without -DFIX_IT
and compare the NVPTX assembly: https://godbolt.org/z/W7PY9q3sj . You can see that in the former case, it generates ld.param
and st.local
instructions that cover the entire size of the union
, while in the latter case it skips those bytes that are padding.