Data Alignment for Kernel Parameters: 16 Byte #1566

psychocoderHPC · 2016-09-05T09:56:03Z

close #1563 and close #1553

~~- add type to get a architecture depending useful alignment~~

change __optimal_align__ based on discussion Align > 32 bytes #1563
add pre processor macro PMACC_ROUND_UP_NEXT_POW2

Tests:

remove warning Warning: ABI change for 32-byte alignment #1553
KHI heating test
KHI speedtest (result: no speed decrease)

CC-ing: @Flamefire

Flamefire · 2016-09-05T10:14:48Z

src/libPMacc/include/ppFunctions.hpp

+ * @param value integral number between [1,Inf]
+ * @return next higher pow 2 value
+ */
+#define PMACC_ROUND_UP_NEXT_POW2(value) \


You need to mention, that this is for UNSIGNED numbers only.

thx I will extent the description

Flamefire · 2016-09-05T10:22:18Z

Note: I still don't see why we should limit the alignment to the maximum useful alignment on the host if the purpose of the macro is to align structs for the device.

psychocoderHPC · 2016-09-05T10:47:52Z

The limitation to the useful alignment is added to solve #1553.
Than we are on the save side. 32byte alignment prints the note: The ABI for passing parameters with 32-byte alignment has changed in GCC 4.6

Flamefire · 2016-09-05T10:56:28Z

Ah ok, I though it was 64 bytes and up.
However I think this is still not the correct approach. We know that 32 byte alignment causes problems at gcc 4.6+. But this solution relies on the fact, that the "optimal" align chosen by the compiler is less then 32 byte or the ABI happens to still be the same. So I'd rather have 16 Bytes as the hard maximum and not introduce another dependency on GCC. Especially as icc is mentioned in https://github.com/ComputationalRadiationPhysics/picongpu/blob/release-0.2.0/INSTALL.md

ax3l · 2016-09-05T11:39:38Z

Why should this introduce a dependency on GCC? It's just ABI incompatibility within binaries created from various versions of GCC itself.

Flamefire · 2016-09-05T11:40:46Z

@ax3l Because __attribute__ ((aligned)) is GCC only(?)

ax3l · 2016-09-05T11:43:42Z

but that has nothing to do with the wish for 16byte hard limit.

the other point (syntax) is addressed above, he will check.

Flamefire · 2016-09-05T11:49:30Z

I meant: This fix uses a GCC-only feature to work around an issue. Moreover it relies on a specific evaluation of that feature (namely that the optimal align is less than what the ABI was changed for).
The same workaround can be done by using a 16byte hard-limit which a) avoids both of this problems: No GCC syntax and no guessing on how that feature is evaluated.
And as mentioned: That "optimal" align is meaningless as we align for the GPU but that align (might) be optimal for CPU

psychocoderHPC · 2016-09-05T12:22:46Z

Tested with with icc 15.2, gcc 4.8.2 and clang 3.4

#include <iostream>
#include <cstdio>
#include <boost/align/alignment_of.hpp>

#define PMACC_MIN(x,y) (((x)<=(y))?x:y)

namespace PMacc
{
    /** object to test for a useful alignment
     *
     * The compiler auto align the member array to a useful architecture
     * depending value.
     */
    struct UsefulAlignTestObject
    {
        char x[512] __attribute__ ((aligned));
    };

    /** type which defines a useful alignment for the architecture */
    typedef boost::alignment::alignment_of<UsefulAlignTestObject> useful_align_t;
}

#define PMACC_ROUND_UP_NEXT_POW2(value) \
        ((value)==1?1:                  \
        ((value)<=2?2:                  \
        ((value)<=4?4:                  \
        ((value)<=8?8:                  \
        ((value)<=16?16:                \
        ((value)<=32?32:                \
        ((value)<=64?64:128             \
        )))))))

#define __optimal_align__(byte)                                                \
    __attribute__((aligned(                                                     \
        PMACC_MIN(                                                             \
            /* 32 byte is the L2 cache line size of NVIDIA GPUs */             \
            PMACC_MIN(PMACC_ROUND_UP_NEXT_POW2(byte),32),                      \
            PMacc::useful_align_t::value                                       \
        )                                                                      \
    )))

#define PMACC_ALIGN(var,...) __optimal_align__(sizeof(__VA_ARGS__)) __VA_ARGS__ var
template<size_t N>
struct array{ char v[N];

array()
{}

array(const array& a)
{
  for(int i=0;i<N;++i)
    v[i]=a.v[i];
}
};

template<size_t T_N>
struct Foo{
static const size_t N = T_N;
PMACC_ALIGN(dummy,array<N> );
Foo()
{
for(int i=0;i<N;++i)
   dummy.v[i]=i;
}

};
template<size_t N>
struct CallKernel{
    void operator()(int value){
        CallKernel<N-1>()(value);
        Foo<N> foo;
        int size = sizeof(Foo<N>);
        printf("host %i %i\n", N,size);
    }
};
template<>
struct CallKernel<0>
{
    void operator()(int){}
};

int main(){
    int value=1;
    CallKernel<128>()(value);
}

Flamefire · 2016-09-05T12:32:04Z

Ok so this is not GCC-only syntax? Then only remaining question is the maximum alignment safely supported for CUDA params. (see #1553) (besides that I still see no use for host-optimal alignment for GPUs)

psychocoderHPC · 2016-09-05T12:37:45Z

Normally we need two alignment options one for fitting a object on stack (as parameter) and one for dynamic objects (e.g. Frames).
If we know that a object is never passed copy by value to a function than we can go to higher alignment up to 128 byte.

Side note: Currently I am testing if 16 byte influenced the speed.

Flamefire · 2016-09-05T12:54:13Z

I don't think it is that easy. First: Alignment introduces overhead. The positions e.g. are not stored as SoA but as AoS, so with the current alignment a bunch of threads will load 4 values/thread but will only use 3 of them. With 4/8 byte alignment the L2 cache should be used much better (using the model: all threads load x -> cacheline is loaded with x,y,z,unused for each element, all load y (from cache), all load z (from cache), but quite some cache is wasted as all those unused values had to be loaded, of course a 16/32 byte vector load might be possible with aligned data)
Even worse for the 44 byte random structs which get aligned to 64 bytes. This wastes bandwith and memory especially with increasing alignment.
Second: Vector loads are 32 byte max. So why align more?

psychocoderHPC · 2016-09-05T13:27:38Z

Why align to more than 32byte: Because 128bytes is read out of the global RAM of a GPU with each memory access.

psychocoderHPC · 2016-09-05T13:28:36Z

I tested the 16byte alignment vs the old and I cant see any speedup or slowdown. It looks like we are save with 16byte.

Flamefire · 2016-09-05T15:09:10Z

Why align to more than 32byte: Because 128bytes is read out of the global RAM of a GPU with each memory access.

Using that for arrays of structs would be pretty bad though. Assume 80 bytes structs in an array and you get a lot more requests than you need.16 byte should be enough as this is the maximum request size according to http://docs.nvidia.com/cuda/cuda-c-programming-guide/index.html#device-memory-accesses

Note: I also found the explanation for the correctness through alignment there: If mallocMC returns an unaligned (or less than 16b) base address and we would use that with non-manually aligned types it would result in errors.

psychocoderHPC · 2016-09-06T11:43:22Z

Heating Electron-Positron KHI:

psychocoderHPC · 2016-09-07T10:14:56Z

offline discussion with @axel:

currently we cut our alignment to 16byte ( this avoid's bug Align > 32 bytes #1563)
this pull request is only a bugfix
- we can tune our alignment for each type depending of the usage in a separate pull request, if needed
- with C++11 we can check each kernel parameter to full fill the alignment at compile time

psychocoderHPC · 2016-09-07T10:17:19Z

I removed the calculation of the useful alignment from this pull request.

close ComputationalRadiationPhysics#1563 and close ComputationalRadiationPhysics#1553 - change `__optimal_align__` based on discussion ComputationalRadiationPhysics#1563 - add pre processor macro `PMACC_ROUND_UP_NEXT_POW2`

Flamefire · 2016-09-07T13:21:06Z

@ax3l Please comment to latest changes. Can't see you updates ;)
Re "must vs should align" see my comment: #1566 (comment)
If our arrays are uniform (only floats or doubles) at least in size AND mallocMC returns aligned base offsets then it is a "should". For non-uniform it could be a "must" and for non-aligned base it IS a "must"

ax3l · 2016-09-07T13:29:52Z

I just helped him integrate a comment of yours he forgot :)
See here: #1566 (diff)

ax3l · 2016-09-07T14:24:01Z

thank you both for looking into the problem!

psychocoderHPC added bug a bug in the project's code component: PMacc in PMacc labels Sep 5, 2016

psychocoderHPC added this to the 0.2.0: Open Beta milestone Sep 5, 2016

psychocoderHPC assigned ax3l Sep 5, 2016

Flamefire reviewed Sep 5, 2016
View reviewed changes

psychocoderHPC force-pushed the fix-optimalAlign branch from 5a60bf2 to 7f4f4d8 Compare September 6, 2016 10:59

psychocoderHPC force-pushed the fix-optimalAlign branch 2 times, most recently from 94cf783 to b95e863 Compare September 7, 2016 10:09

psychocoderHPC force-pushed the fix-optimalAlign branch from b95e863 to 9e5412e Compare September 7, 2016 10:16

ax3l changed the title ~~fix data alignment~~ Data Alignment for Kernel Parameters: 16 Byte Sep 7, 2016

psychocoderHPC force-pushed the fix-optimalAlign branch from 9e5412e to c4e6d67 Compare September 7, 2016 12:52

fix data alignment

014632b

close ComputationalRadiationPhysics#1563 and close ComputationalRadiationPhysics#1553 - change `__optimal_align__` based on discussion ComputationalRadiationPhysics#1563 - add pre processor macro `PMACC_ROUND_UP_NEXT_POW2`

psychocoderHPC force-pushed the fix-optimalAlign branch from c4e6d67 to 014632b Compare September 7, 2016 12:53

ax3l closed this Sep 7, 2016

ax3l reopened this Sep 7, 2016

ax3l merged commit 3477300 into ComputationalRadiationPhysics:dev Sep 7, 2016

psychocoderHPC mentioned this pull request Jan 12, 2017

Warning: ABI change for 32-byte alignment #1553

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Data Alignment for Kernel Parameters: 16 Byte #1566

Data Alignment for Kernel Parameters: 16 Byte #1566

psychocoderHPC commented Sep 5, 2016 •

edited

Loading

Flamefire Sep 5, 2016

psychocoderHPC Sep 5, 2016

Flamefire commented Sep 5, 2016

psychocoderHPC commented Sep 5, 2016

Flamefire commented Sep 5, 2016

ax3l commented Sep 5, 2016

Flamefire commented Sep 5, 2016

ax3l commented Sep 5, 2016

Flamefire commented Sep 5, 2016

psychocoderHPC commented Sep 5, 2016 •

edited

Loading

Flamefire commented Sep 5, 2016

psychocoderHPC commented Sep 5, 2016

Flamefire commented Sep 5, 2016

psychocoderHPC commented Sep 5, 2016

psychocoderHPC commented Sep 5, 2016

Flamefire commented Sep 5, 2016 •

edited

Loading

psychocoderHPC commented Sep 6, 2016 •

edited by ax3l

Loading

psychocoderHPC commented Sep 7, 2016

psychocoderHPC commented Sep 7, 2016

Flamefire commented Sep 7, 2016

ax3l commented Sep 7, 2016 •

edited

Loading

ax3l commented Sep 7, 2016

Data Alignment for Kernel Parameters: 16 Byte #1566

Data Alignment for Kernel Parameters: 16 Byte #1566

Conversation

psychocoderHPC commented Sep 5, 2016 • edited Loading

Flamefire Sep 5, 2016

Choose a reason for hiding this comment

psychocoderHPC Sep 5, 2016

Choose a reason for hiding this comment

Flamefire commented Sep 5, 2016

psychocoderHPC commented Sep 5, 2016

Flamefire commented Sep 5, 2016

ax3l commented Sep 5, 2016

Flamefire commented Sep 5, 2016

ax3l commented Sep 5, 2016

Flamefire commented Sep 5, 2016

psychocoderHPC commented Sep 5, 2016 • edited Loading

Flamefire commented Sep 5, 2016

psychocoderHPC commented Sep 5, 2016

Flamefire commented Sep 5, 2016

psychocoderHPC commented Sep 5, 2016

psychocoderHPC commented Sep 5, 2016

Flamefire commented Sep 5, 2016 • edited Loading

psychocoderHPC commented Sep 6, 2016 • edited by ax3l Loading

psychocoderHPC commented Sep 7, 2016

psychocoderHPC commented Sep 7, 2016

Flamefire commented Sep 7, 2016

ax3l commented Sep 7, 2016 • edited Loading

ax3l commented Sep 7, 2016

psychocoderHPC commented Sep 5, 2016 •

edited

Loading

psychocoderHPC commented Sep 5, 2016 •

edited

Loading

Flamefire commented Sep 5, 2016 •

edited

Loading

psychocoderHPC commented Sep 6, 2016 •

edited by ax3l

Loading

ax3l commented Sep 7, 2016 •

edited

Loading