Align > 32 bytes #1563

ax3l · 2016-08-29T09:54:08Z

Just found an interesting issue while testing. Executing the following code yields in semi-random results:

#include <curand_kernel.h>

#define __optimal_align__(byte)   \
        __align__(                \
        ((byte)==1?1:             \
        ((byte)<=2?2:             \
        ((byte)<=4?4:             \
        ((byte)<=8?8:             \
        ((byte)<=16?16:           \
        ((byte)<=32?32:           \
        ((byte)<=64?64:128        \
        ))))))))

#define PMACC_ALIGN(var,...) __optimal_align__(sizeof(__VA_ARGS__)) __VA_ARGS__ var

struct Foo{
PMACC_ALIGN(rng, curandStateXORWOW_t);
};

    template<class T_RNG, class T_Mapper>
    __global__ void
    testRNG(T_RNG rng, T_Mapper mapper)
    {
        if(!threadIdx.x)
            printf("value %i\n", mapper);
    }

int main(){
    int value=1;
    Foo foo;
    testRNG<<<1,1>>>(foo, value);
    cudaDeviceSynchronize();
}

This is basically a proof-of concept of a bug caused by alignment. I get the same ABI-change warning when compiling this with nvcc (7.0, 7.5) and g++4-8. In my more complex case I don't get that warning although the behaviour is the same (semi-random results, big struct somewhere)

This does not happen, if the maximum alignment is set to 32 bytes. It also does not happen, if there is another 32-byte aligned param before value. If that param is more or less aligned than 32 bytes then the bug does happen again.

psychocoderHPC · 2016-09-01T07:40:21Z

@Flamefire I can't reproduce any errors with the given example. Could you please try it again?

I used the following modules on K80 and k20:

  1) mpfr/3.1.2                    5) cmake/3.3.0                   9) hdf5-parallel/1.8.14
  2) mpc/1.0.1                     6) cuda/7.0                     10) libsplash/1.4.0
  3) gmp/5.1.1                     7) openmpi/1.8.4.kepler.cuda70  11) numactl/2.0.7
  4) gcc/4.8.2                     8) boost/1.56.0                 12) valgrind/3.8.1

Flamefire · 2016-09-01T08:23:12Z

I can't get onto the cluster queues so I verified this locally:
Cuda compilation tools, release 7.0, V7.0.27
Cuda compilation tools, release 7.5, V7.5.17
g++ (Ubuntu 4.8.5-2ubuntu1~14.04.1) 4.8.5

Result:

nvcc paramOverwrite2.cu 
paramOverwrite2.cu(31): warning: variable "foo" is used before its value is set

paramOverwrite2.cu(31): warning: variable "foo" is used before its value is set

paramOverwrite2.cu: In function ‘void testRNG(T_RNG, T_Mapper) [with T_RNG = Foo; T_Mapper = int]’:
paramOverwrite2.cu:22:1: note: The ABI for passing parameters with 64-byte alignment has changed in GCC 4.6
     testRNG(T_RNG rng, T_Mapper mapper)
 ^

 ./a.out 
value -71320066

psychocoderHPC · 2016-09-01T08:24:26Z

Could it be an driver issue?

Flamefire · 2016-09-01T09:11:11Z

Ok I was able to test it on k80 now:

Cuda compilation tools, release 7.0, V7.0.27
g++-4.8.2 (GCC) 4.8.2
nvcc paramAlign.cu 
paramAlign.cu(31): warning: variable "foo" is used before its value is set

paramAlign.cu(31): warning: variable "foo" is used before its value is set

paramAlign.cu: In Funktion »void testRNG(T_RNG, T_Mapper) [mit T_RNG = Foo; T_Mapper = int]«:
paramAlign.cu:22:1: Anmerkung: Das ABI der Parameterübergabe mit 64-Byte-Ausrichtung hat sich in GCC 4.6 geändert
     testRNG(T_RNG rng, T_Mapper mapper)
 ^
grund59@kepler028-gsi:~$ ./a.out 
/// NO OUTPUT! Check says "Invalid device ordinal"

CUDA 7.5 shows no error for this case. However I was able to reproduce this behaviour also for 7.5 by using struct Bar{char v[65];}; instead of the curandStateXORWOW_t which results in 128 byte alignment.

Made an example that tests this for all sizes:

#include <iostream>
#include <cstdio>

#define __optimal_align__(byte)   \
        __align__(                \
        ((byte)==1?1:             \
        ((byte)<=2?2:             \
        ((byte)<=4?4:             \
        ((byte)<=8?8:             \
        ((byte)<=16?16:           \
        ((byte)<=32?32:           \
        ((byte)<=64?64:128        \
        ))))))))

#define PMACC_ALIGN(var,...) __optimal_align__(sizeof(__VA_ARGS__)) __VA_ARGS__ var

template<size_t N>
struct array{ char v[N];};

template<size_t T_N>
struct Foo{
static const size_t N = T_N;
PMACC_ALIGN(dummy,array<N>);
};

template<class T_RNG, class T_Mapper>
__global__ void
testKernel(T_RNG rng, T_Mapper mapper)
{
    unsigned N = T_RNG::N;
    printf("%u value %i\n", N, mapper);
}

template<size_t N>
struct CallKernel{
    void operator()(int value){
        CallKernel<N-1>()(value);
        testKernel<<<1,1>>>(Foo<N>(), value);
    }
};
template<>
struct CallKernel<0>
{
    void operator()(int){}
};

int main(){
    int value=1;
    CallKernel<129>()(value);
    cudaError_t code = cudaDeviceSynchronize();
    if(code != cudaSuccess)
        std::cout << "ERROR: " << cudaGetErrorString(code) << std::endl;
}

psychocoderHPC · 2016-09-01T13:12:22Z

It look like it is not allowed to align over the cache line size.

Get cache line size in linux

getconf LEVEL1_DCACHE_LINESIZE

Flamefire · 2016-09-01T14:42:21Z

I don't think so. Output at my laptop is 64, but failures happen at 64 byte alignment. It also does not explain why CUDA 7.5 works for 64 but 7.0 does not.

psychocoderHPC · 2016-09-01T15:16:48Z

We can create a bug report for NVIDIA or/and post this small example in the NVIDIA forum.

I searched for alignment restriction for the stack but only find what I also posted in the other issue https://gcc.gnu.org/bugzilla/show_bug.cgi?id=44948

psychocoderHPC · 2016-09-01T15:29:43Z

@Flamefire

What I found in the PTX ISA document from NVIDIA in 5.1.1 is

Registers differ from the other state spaces in that they are not fully addressable, i.e.,
 it is not possible to refer to the address of a register. When compiling to use the Application Binary 
Interface (ABI), register variables are restricted to function scope and may not be declared at module 
scope. When compiling legacy PTX code (ISA versions prior to 3.0) containing module-scoped .reg 
variables, the compiler silently disables use of the ABI. Registers may have alignment boundaries 
required by multi-word loads and stores. - See more at: http://docs.nvidia.com/cuda/parallel-thread-
execution/index.html#sthash.p4EqxveA.dpuf

Your laptop device is sm_20 maybe therefore you have also problems with 64byte.

Flamefire · 2016-09-01T17:48:14Z

It is possible that the arch version has something to do with this issue. From the document you posted, I don't find any alignment requirement in general, as all of that refers to PTX code which is generated by nvcc, so nvcc should also handle alignment issues. If at all, one could say that large structs that should be moved at once might benefit from (manual) alignment. But I think any benefits there might be outweighted by the additional memory usage (and transfer) incurred by e.g. aligning a 44 byte struct to a 64 byte boundary.
Also maximum vector size for a load operation is 128bit=32 byte. So I think any alignment above 32 bytes is counterproductive.

psychocoderHPC · 2016-09-02T11:51:21Z

I found this for gcc 4.3^^ https://gcc.gnu.org/onlinedocs/gcc-3.2/gcc/Variable-Attributes.html

It says that we can ask the compiler for the maximal use full alignment.

struct Foo
{
   short array[256] __attribute__ ((aligned));
}

And in C++11 we can check the alignment with

//...
std::cout<<alignof(Foo)<<std::endl;

If I check it I get always 16 as result.

psychocoderHPC · 2016-09-02T12:13:21Z

My solution for a fix in PMacc is:

#include <boost/align/alignment_of.hpp>

namespace pmacc
{
    struct MaxAlignTestObject
    {
        char x[128] __attribute__ ((aligned));
    };

    typedef boost::alignment::alignment_of<MaxAlignTestObject> max_align_t;
}

#define PMACC_POW2_ALIGNMENT(byte) \
    ((byte)==1?1:             \
    ((byte)<=2?2:             \
    ((byte)<=4?4:             \
    ((byte)<=8?8:             \
    ((byte)<=16?16:           \
    ((byte)<=32?32:           \
    ((byte)<=64?64:128        \
    )))))))

#define __optimal_align__(byte)                                   \
        __align__(                                                \
        PMACC_POW2_ALIGNMENT(byte) <= pmacc::max_align_t::value ? \
            PMACC_POW2_ALIGNMENT(byte) :                          \
            pmacc::max_align_t::value                             \
        )

#define PMACC_ALIGN(var,...) __optimal_align__(sizeof(__VA_ARGS__)) __VA_ARGS__ var

psychocoderHPC · 2016-09-02T12:20:09Z

I checked my solution and I get the same alignments on host and device.
The maximal alignment on hypnos k20 (64bit System) is 16 byte.

Flamefire · 2016-09-02T12:22:53Z

That seems not optimal. This would mean a maximum alignment of 16, but CUDA supports 32Byte vector loads/stores.
I'd just reduce the max alignment to 32 bytes and reference the CUDA guide.

psychocoderHPC · 2016-09-02T12:30:11Z

But the problem is that it is not save to give a type which is aligned to >=32byte to a function.
For that I would prefer a "save" alignment for the auto align method and if the user knows that the type is never used as parameter he/she can align is by hand.

Flamefire · 2016-09-02T12:36:49Z

I think this is not correct. Documentation for the attribute you used states align a variable or field to the maximum useful alignment for the target machine you are compiling for. Note the "useful" here.
It does not state, that it is the maximum allowed value. You can still use higher alignments, you just have to make sure that interfaces are ABI compatible, which they are if they are compiled from the same source, compilers and switches.
Our problem here is, that nvcc expects a different alignment strategy than g++ uses for >=64 byte alignment.
Your approach would even fail for this scenario:
Compiling for 512Bit SIMD host (looking at Xeon Phis) will lead to 64 byte alignment (which is the maximum usefull value here) and break when passing that to the gpu which expects a different alignment.

Side note: >= 64 byte is unsafe (currently), not >= 32 byte as of my experiments.

psychocoderHPC · 2016-09-02T12:44:30Z

I agree with you but I can't find how we can get the maximal alignment, defined by the ABI.

psychocoderHPC · 2016-09-02T12:49:41Z

I found here that the stack for x86-64 is aligned to 16 byte.

Flamefire · 2016-09-02T13:04:29Z

I'd say testing: We know >32 byte is not advantageous on the GPU due to the maximum vector size.
So we can use that. If we wanted to use more, we'd need to work-around or avoid issues like "nvcc with g++>=3.4"
If we want to be conservative we could say min(32, pmacc::max_align_t::value) as we are unaware of any problems.

Yes, stack is 16byte aligned. That does not mean that params passed on the stack need to be 16byte aligned. Pass a 256Byte aligned struct on the stack -> stack is aligned to 16byte.

close ComputationalRadiationPhysics#1563 and close ComputationalRadiationPhysics#1553 - add type to get a architecture depending useful alignment - change `__optimal_align__` based on discussion ComputationalRadiationPhysics#1563 - add pre processor macro `PMACC_ROUND_UP_NEXT_POW2`

ax3l · 2016-09-05T12:09:46Z

needs an upstream (nvidia cuda bugtracker) report to get more information, e.g., if 32byte alignment is save or not.

ax3l · 2016-09-05T13:54:12Z

note from @psychocoderHPC: We should precise our question to ask (and check again with CUDA 8.0):

Is a copy-by-value struct parameter to a `__global__` kernel call, aligned larger (>)
then 16 byte save/well-defined?
Are their any ABI restrictions regarding the fact that parameter stacks on
most (host) systems are 16 byte aligned?

We see problems with 128 byte alignment on our cluster (K80) and even with
64 byte alignment on a sm_20 device (mini-example above; icc & gcc).

Update: submitted as bug ID 1809741

Flamefire · 2016-09-05T14:42:43Z

I could even see this with 64 bytes on k80 with cuda 7.0 although this seems to be unreliable to reproduce.

ax3l · 2016-09-06T08:29:56Z

good information, thanks!
but probably only CUDA 7.5+ will be relevant for upstream requests.

close ComputationalRadiationPhysics#1563 and close ComputationalRadiationPhysics#1553 - add type to get a architecture depending useful alignment - change `__optimal_align__` based on discussion ComputationalRadiationPhysics#1563 - add pre processor macro `PMACC_ROUND_UP_NEXT_POW2`

close ComputationalRadiationPhysics#1563 and close ComputationalRadiationPhysics#1553 - change `__optimal_align__` based on discussion ComputationalRadiationPhysics#1563 - add pre processor macro `PMACC_ROUND_UP_NEXT_POW2`

ax3l · 2016-09-07T14:25:45Z

let us leave this issue open for now to collect further feedback from nvidia, so we know what kind of "above 16 Byte" tunings we can apply again

ax3l · 2017-04-19T13:20:06Z

With example from Alex,

// expected output:
//   value 1 2
//   no error
#include <cstdio>
#ifndef ARRAY_SIZE
#define ARRAY_SIZE 65
#endif

struct Bar{char v[ARRAY_SIZE];};

struct Foo{
    Bar bar __align__(ALIGN);
};

__global__ void test(int i, Foo foo, int value){
    printf("value %i %i\n", i, value);
}

int main(){
    int value=2;
    Foo foo;
    test<<<1,1>>>(1, foo, value);
    printf("%s\n", cudaGetErrorString(cudaDeviceSynchronize()));
}
/*
Run on SM 3.X devices: nvcc -DALIGN=32 ok! nvcc -DALIGN=128 wrong! (value 1 2 line is missing)
Run on SM 2.0 devices you already get the unexpected behaviour when using -DALIGN=64 PTX code for both SM is 64:
64 (both systems):
.visible .entry _Z4testi3Fooi(
        .param .u32 _Z4testi3Fooi_param_0,
        .param .align 64 .b8 _Z4testi3Fooi_param_1[128],
        .param .u32 _Z4testi3Fooi_param_2
)

128:
.visible .entry _Z4testi3Fooi(
        .param .u32 _Z4testi3Fooi_param_0,
        .param .align 128 .b8 _Z4testi3Fooi_param_1[128],
        .param .u32 _Z4testi3Fooi_param_2
)
/*

the issue was reproduced locally by the support and assigned to a developer in 09/2016.

Today I pinged the support again. Response:

[..] By the way, could you please tell us why you need such large alignment (>64, or even >16)
for ByVal struct kernel parameters so that we can understand ?

ax3l · 2017-04-19T14:06:07Z

Proposed answer with @psychocoderHPC:

As soon as one uses an array of struct, SIMD access to it should be aligned for optimal access. Our objects are generic (in number and size of struct arguments) and used both on host and device. The same (struct) object that might be stored (aligned) in an array can also be used as a scalar and passed to a kernel.

ax3l · 2017-05-24T14:04:46Z

we are currently aligning up to 32 as we saw no problems up to that size and think that's conservative to run on all platforms (host & device).

nevertheless, it looks from the support answer that actually only up to 16 is guaranteed to work on nvidia GPUs...

ax3l · 2018-02-07T08:35:27Z

Got news from our ticket (Bug ID 1809741) today:
The problem will be fixed in the next CUDA release, hurray!

Note: at the time of writing, CUDA 9.1 is the last release.

ax3l · 2020-12-01T20:29:17Z

There should be no upper limit anymore in CUDA and PTX errors are fixed in CUDA 11.0+.
Please report if there are still errors/issues with this.

ax3l added the question label Aug 29, 2016

ax3l added this to the 0.2.0: Open Beta milestone Aug 29, 2016

ax3l assigned psychocoderHPC Aug 29, 2016

ax3l added bug a bug in the project's code component: PMacc in PMacc and removed question labels Aug 29, 2016

psychocoderHPC mentioned this issue Sep 5, 2016

Data Alignment for Kernel Parameters: 16 Byte #1566

Merged

3 tasks

Flamefire mentioned this issue Sep 5, 2016

Warning: ABI change for 32-byte alignment #1553

Closed

ax3l mentioned this issue Sep 5, 2016

Collection of CUDA Bugs and Work Arounds #652

Closed

ax3l modified the milestones: Next Stable: 0.3.0 / 1.0.0, 0.2.0: Open Beta Sep 7, 2016

ax3l added component: third party third party libraries that are shipped and/or linked bug a bug in the project's code and removed bug a bug in the project's code labels Sep 7, 2016

ax3l modified the milestones: Future, 0.3.0: Release Candidate May 24, 2017

ax3l added the backend: cuda CUDA backend label Feb 7, 2018

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Align > 32 bytes #1563

Align > 32 bytes #1563

ax3l commented Aug 29, 2016

psychocoderHPC commented Sep 1, 2016 •

edited

Loading

Flamefire commented Sep 1, 2016

psychocoderHPC commented Sep 1, 2016

Flamefire commented Sep 1, 2016

psychocoderHPC commented Sep 1, 2016

Flamefire commented Sep 1, 2016

psychocoderHPC commented Sep 1, 2016 •

edited

Loading

psychocoderHPC commented Sep 1, 2016 •

edited

Loading

Flamefire commented Sep 1, 2016

psychocoderHPC commented Sep 2, 2016 •

edited

Loading

psychocoderHPC commented Sep 2, 2016

psychocoderHPC commented Sep 2, 2016

Flamefire commented Sep 2, 2016

psychocoderHPC commented Sep 2, 2016

Flamefire commented Sep 2, 2016

psychocoderHPC commented Sep 2, 2016 •

edited

Loading

psychocoderHPC commented Sep 2, 2016

Flamefire commented Sep 2, 2016

ax3l commented Sep 5, 2016

ax3l commented Sep 5, 2016 •

edited

Loading

Flamefire commented Sep 5, 2016

ax3l commented Sep 6, 2016 •

edited

Loading

ax3l commented Sep 7, 2016

ax3l commented Apr 19, 2017 •

edited

Loading

ax3l commented Apr 19, 2017

ax3l commented May 24, 2017 •

edited

Loading

ax3l commented Feb 7, 2018 •

edited

Loading

ax3l commented Dec 1, 2020

Align > 32 bytes #1563

Align > 32 bytes #1563

Comments

ax3l commented Aug 29, 2016

psychocoderHPC commented Sep 1, 2016 • edited Loading

Flamefire commented Sep 1, 2016

psychocoderHPC commented Sep 1, 2016

Flamefire commented Sep 1, 2016

psychocoderHPC commented Sep 1, 2016

Flamefire commented Sep 1, 2016

psychocoderHPC commented Sep 1, 2016 • edited Loading

psychocoderHPC commented Sep 1, 2016 • edited Loading

Flamefire commented Sep 1, 2016

psychocoderHPC commented Sep 2, 2016 • edited Loading

psychocoderHPC commented Sep 2, 2016

psychocoderHPC commented Sep 2, 2016

Flamefire commented Sep 2, 2016

psychocoderHPC commented Sep 2, 2016

Flamefire commented Sep 2, 2016

psychocoderHPC commented Sep 2, 2016 • edited Loading

psychocoderHPC commented Sep 2, 2016

Flamefire commented Sep 2, 2016

ax3l commented Sep 5, 2016

ax3l commented Sep 5, 2016 • edited Loading

Flamefire commented Sep 5, 2016

ax3l commented Sep 6, 2016 • edited Loading

ax3l commented Sep 7, 2016

ax3l commented Apr 19, 2017 • edited Loading

ax3l commented Apr 19, 2017

ax3l commented May 24, 2017 • edited Loading

ax3l commented Feb 7, 2018 • edited Loading

ax3l commented Dec 1, 2020

psychocoderHPC commented Sep 1, 2016 •

edited

Loading

psychocoderHPC commented Sep 1, 2016 •

edited

Loading

psychocoderHPC commented Sep 1, 2016 •

edited

Loading

psychocoderHPC commented Sep 2, 2016 •

edited

Loading

psychocoderHPC commented Sep 2, 2016 •

edited

Loading

ax3l commented Sep 5, 2016 •

edited

Loading

ax3l commented Sep 6, 2016 •

edited

Loading

ax3l commented Apr 19, 2017 •

edited

Loading

ax3l commented May 24, 2017 •

edited

Loading

ax3l commented Feb 7, 2018 •

edited

Loading