-
Notifications
You must be signed in to change notification settings - Fork 217
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Align > 32 bytes #1563
Comments
@Flamefire I can't reproduce any errors with the given example. Could you please try it again? I used the following modules on K80 and k20:
|
I can't get onto the cluster queues so I verified this locally: Result:
|
Could it be an driver issue? |
Ok I was able to test it on k80 now:
CUDA 7.5 shows no error for this case. However I was able to reproduce this behaviour also for 7.5 by using Made an example that tests this for all sizes:
|
It look like it is not allowed to align over the cache line size. Get cache line size in linux
|
I don't think so. Output at my laptop is |
We can create a bug report for NVIDIA or/and post this small example in the NVIDIA forum. I searched for alignment restriction for the stack but only find what I also posted in the other issue https://gcc.gnu.org/bugzilla/show_bug.cgi?id=44948 |
What I found in the PTX ISA document from NVIDIA in 5.1.1 is
Your laptop device is |
It is possible that the arch version has something to do with this issue. From the document you posted, I don't find any alignment requirement in general, as all of that refers to PTX code which is generated by nvcc, so nvcc should also handle alignment issues. If at all, one could say that large structs that should be moved at once might benefit from (manual) alignment. But I think any benefits there might be outweighted by the additional memory usage (and transfer) incurred by e.g. aligning a 44 byte struct to a 64 byte boundary. |
I found this for gcc 4.3^^ https://gcc.gnu.org/onlinedocs/gcc-3.2/gcc/Variable-Attributes.html It says that we can ask the compiler for the maximal use full alignment.
And in C++11 we can check the alignment with
If I check it I get always |
My solution for a fix in PMacc is: #include <boost/align/alignment_of.hpp>
namespace pmacc
{
struct MaxAlignTestObject
{
char x[128] __attribute__ ((aligned));
};
typedef boost::alignment::alignment_of<MaxAlignTestObject> max_align_t;
}
#define PMACC_POW2_ALIGNMENT(byte) \
((byte)==1?1: \
((byte)<=2?2: \
((byte)<=4?4: \
((byte)<=8?8: \
((byte)<=16?16: \
((byte)<=32?32: \
((byte)<=64?64:128 \
)))))))
#define __optimal_align__(byte) \
__align__( \
PMACC_POW2_ALIGNMENT(byte) <= pmacc::max_align_t::value ? \
PMACC_POW2_ALIGNMENT(byte) : \
pmacc::max_align_t::value \
)
#define PMACC_ALIGN(var,...) __optimal_align__(sizeof(__VA_ARGS__)) __VA_ARGS__ var |
I checked my solution and I get the same alignments on host and device. |
That seems not optimal. This would mean a maximum alignment of 16, but CUDA supports 32Byte vector loads/stores. |
But the problem is that it is not save to give a type which is aligned to >=32byte to a function. |
I think this is not correct. Documentation for the attribute you used states Side note: >= 64 byte is unsafe (currently), not >= 32 byte as of my experiments. |
I agree with you but I can't find how we can get the maximal alignment, defined by the ABI. |
I found here that the stack for x86-64 is aligned to 16 byte. |
I'd say testing: We know >32 byte is not advantageous on the GPU due to the maximum vector size. Yes, stack is 16byte aligned. That does not mean that params passed on the stack need to be 16byte aligned. Pass a 256Byte aligned struct on the stack -> stack is aligned to 16byte. |
close ComputationalRadiationPhysics#1563 and close ComputationalRadiationPhysics#1553 - add type to get a architecture depending useful alignment - change `__optimal_align__` based on discussion ComputationalRadiationPhysics#1563 - add pre processor macro `PMACC_ROUND_UP_NEXT_POW2`
needs an upstream (nvidia cuda bugtracker) report to get more information, e.g., if 32byte alignment is save or not. |
note from @psychocoderHPC: We should precise our question to ask (and check again with CUDA 8.0):
Update: submitted as bug ID |
I could even see this with 64 bytes on k80 with cuda 7.0 although this seems to be unreliable to reproduce. |
good information, thanks! |
close ComputationalRadiationPhysics#1563 and close ComputationalRadiationPhysics#1553 - add type to get a architecture depending useful alignment - change `__optimal_align__` based on discussion ComputationalRadiationPhysics#1563 - add pre processor macro `PMACC_ROUND_UP_NEXT_POW2`
close ComputationalRadiationPhysics#1563 and close ComputationalRadiationPhysics#1553 - add type to get a architecture depending useful alignment - change `__optimal_align__` based on discussion ComputationalRadiationPhysics#1563 - add pre processor macro `PMACC_ROUND_UP_NEXT_POW2`
close ComputationalRadiationPhysics#1563 and close ComputationalRadiationPhysics#1553 - add type to get a architecture depending useful alignment - change `__optimal_align__` based on discussion ComputationalRadiationPhysics#1563 - add pre processor macro `PMACC_ROUND_UP_NEXT_POW2`
close ComputationalRadiationPhysics#1563 and close ComputationalRadiationPhysics#1553 - change `__optimal_align__` based on discussion ComputationalRadiationPhysics#1563 - add pre processor macro `PMACC_ROUND_UP_NEXT_POW2`
close ComputationalRadiationPhysics#1563 and close ComputationalRadiationPhysics#1553 - change `__optimal_align__` based on discussion ComputationalRadiationPhysics#1563 - add pre processor macro `PMACC_ROUND_UP_NEXT_POW2`
close ComputationalRadiationPhysics#1563 and close ComputationalRadiationPhysics#1553 - change `__optimal_align__` based on discussion ComputationalRadiationPhysics#1563 - add pre processor macro `PMACC_ROUND_UP_NEXT_POW2`
let us leave this issue open for now to collect further feedback from nvidia, so we know what kind of "above 16 Byte" tunings we can apply again |
With example from Alex, // expected output:
// value 1 2
// no error
#include <cstdio>
#ifndef ARRAY_SIZE
#define ARRAY_SIZE 65
#endif
struct Bar{char v[ARRAY_SIZE];};
struct Foo{
Bar bar __align__(ALIGN);
};
__global__ void test(int i, Foo foo, int value){
printf("value %i %i\n", i, value);
}
int main(){
int value=2;
Foo foo;
test<<<1,1>>>(1, foo, value);
printf("%s\n", cudaGetErrorString(cudaDeviceSynchronize()));
}
/*
Run on SM 3.X devices: nvcc -DALIGN=32 ok! nvcc -DALIGN=128 wrong! (value 1 2 line is missing)
Run on SM 2.0 devices you already get the unexpected behaviour when using -DALIGN=64 PTX code for both SM is 64:
64 (both systems):
.visible .entry _Z4testi3Fooi(
.param .u32 _Z4testi3Fooi_param_0,
.param .align 64 .b8 _Z4testi3Fooi_param_1[128],
.param .u32 _Z4testi3Fooi_param_2
)
128:
.visible .entry _Z4testi3Fooi(
.param .u32 _Z4testi3Fooi_param_0,
.param .align 128 .b8 _Z4testi3Fooi_param_1[128],
.param .u32 _Z4testi3Fooi_param_2
)
/* the issue was reproduced locally by the support and assigned to a developer in 09/2016. Today I pinged the support again. Response:
|
Proposed answer with @psychocoderHPC: As soon as one uses an array of struct, SIMD access to it should be aligned for optimal access. Our objects are generic (in number and size of struct arguments) and used both on host and device. The same (struct) object that might be stored (aligned) in an array can also be used as a scalar and passed to a kernel. |
we are currently aligning up to nevertheless, it looks from the support answer that actually only up to 16 is guaranteed to work on nvidia GPUs... |
Got news from our ticket (Bug ID 1809741) today: Note: at the time of writing, CUDA 9.1 is the last release. |
There should be no upper limit anymore in CUDA and PTX errors are fixed in CUDA 11.0+. |
reported by @Flamefire here:
Just found an interesting issue while testing. Executing the following code yields in semi-random results:
This is basically a proof-of concept of a bug caused by alignment. I get the same ABI-change warning when compiling this with nvcc (7.0, 7.5) and g++4-8. In my more complex case I don't get that warning although the behaviour is the same (semi-random results, big struct somewhere)
This does not happen, if the maximum alignment is set to 32 bytes. It also does not happen, if there is another 32-byte aligned param before value. If that param is more or less aligned than 32 bytes then the bug does happen again.
The text was updated successfully, but these errors were encountered: