Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Data Alignment for Kernel Parameters: 16 Byte #1566

Merged

Conversation

psychocoderHPC
Copy link
Member

@psychocoderHPC psychocoderHPC commented Sep 5, 2016

close #1563 and close #1553

- add type to get a architecture depending useful alignment

  • change __optimal_align__ based on discussion Align > 32 bytes #1563
  • add pre processor macro PMACC_ROUND_UP_NEXT_POW2

Tests:

CC-ing: @Flamefire

@psychocoderHPC psychocoderHPC added bug a bug in the project's code component: PMacc in PMacc labels Sep 5, 2016
@psychocoderHPC psychocoderHPC added this to the 0.2.0: Open Beta milestone Sep 5, 2016
* @param value integral number between [1,Inf]
* @return next higher pow 2 value
*/
#define PMACC_ROUND_UP_NEXT_POW2(value) \
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You need to mention, that this is for UNSIGNED numbers only.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

thx I will extent the description

@Flamefire
Copy link
Contributor

Note: I still don't see why we should limit the alignment to the maximum useful alignment on the host if the purpose of the macro is to align structs for the device.

@psychocoderHPC
Copy link
Member Author

The limitation to the useful alignment is added to solve #1553.
Than we are on the save side. 32byte alignment prints the note: The ABI for passing parameters with 32-byte alignment has changed in GCC 4.6

@Flamefire
Copy link
Contributor

Ah ok, I though it was 64 bytes and up.
However I think this is still not the correct approach. We know that 32 byte alignment causes problems at gcc 4.6+. But this solution relies on the fact, that the "optimal" align chosen by the compiler is less then 32 byte or the ABI happens to still be the same. So I'd rather have 16 Bytes as the hard maximum and not introduce another dependency on GCC. Especially as icc is mentioned in https://github.com/ComputationalRadiationPhysics/picongpu/blob/release-0.2.0/INSTALL.md

@ax3l
Copy link
Member

ax3l commented Sep 5, 2016

Why should this introduce a dependency on GCC? It's just ABI incompatibility within binaries created from various versions of GCC itself.

@Flamefire
Copy link
Contributor

@ax3l Because __attribute__ ((aligned)) is GCC only(?)

@ax3l
Copy link
Member

ax3l commented Sep 5, 2016

but that has nothing to do with the wish for 16byte hard limit.

the other point (syntax) is addressed above, he will check.

@Flamefire
Copy link
Contributor

I meant: This fix uses a GCC-only feature to work around an issue. Moreover it relies on a specific evaluation of that feature (namely that the optimal align is less than what the ABI was changed for).
The same workaround can be done by using a 16byte hard-limit which a) avoids both of this problems: No GCC syntax and no guessing on how that feature is evaluated.
And as mentioned: That "optimal" align is meaningless as we align for the GPU but that align (might) be optimal for CPU

@psychocoderHPC
Copy link
Member Author

psychocoderHPC commented Sep 5, 2016

Tested with with icc 15.2, gcc 4.8.2 and clang 3.4

#include <iostream>
#include <cstdio>
#include <boost/align/alignment_of.hpp>

#define PMACC_MIN(x,y) (((x)<=(y))?x:y)

namespace PMacc
{
    /** object to test for a useful alignment
     *
     * The compiler auto align the member array to a useful architecture
     * depending value.
     */
    struct UsefulAlignTestObject
    {
        char x[512] __attribute__ ((aligned));
    };

    /** type which defines a useful alignment for the architecture */
    typedef boost::alignment::alignment_of<UsefulAlignTestObject> useful_align_t;
}

#define PMACC_ROUND_UP_NEXT_POW2(value) \
        ((value)==1?1:                  \
        ((value)<=2?2:                  \
        ((value)<=4?4:                  \
        ((value)<=8?8:                  \
        ((value)<=16?16:                \
        ((value)<=32?32:                \
        ((value)<=64?64:128             \
        )))))))

#define __optimal_align__(byte)                                                \
    __attribute__((aligned(                                                     \
        PMACC_MIN(                                                             \
            /* 32 byte is the L2 cache line size of NVIDIA GPUs */             \
            PMACC_MIN(PMACC_ROUND_UP_NEXT_POW2(byte),32),                      \
            PMacc::useful_align_t::value                                       \
        )                                                                      \
    )))

#define PMACC_ALIGN(var,...) __optimal_align__(sizeof(__VA_ARGS__)) __VA_ARGS__ var
template<size_t N>
struct array{ char v[N];

array()
{}

array(const array& a)
{
  for(int i=0;i<N;++i)
    v[i]=a.v[i];
}
};

template<size_t T_N>
struct Foo{
static const size_t N = T_N;
PMACC_ALIGN(dummy,array<N> );
Foo()
{
for(int i=0;i<N;++i)
   dummy.v[i]=i;
}

};
template<size_t N>
struct CallKernel{
    void operator()(int value){
        CallKernel<N-1>()(value);
        Foo<N> foo;
        int size = sizeof(Foo<N>);
        printf("host %i %i\n", N,size);
    }
};
template<>
struct CallKernel<0>
{
    void operator()(int){}
};

int main(){
    int value=1;
    CallKernel<128>()(value);
}

@Flamefire
Copy link
Contributor

Ok so this is not GCC-only syntax? Then only remaining question is the maximum alignment safely supported for CUDA params. (see #1553) (besides that I still see no use for host-optimal alignment for GPUs)

@psychocoderHPC
Copy link
Member Author

Normally we need two alignment options one for fitting a object on stack (as parameter) and one for dynamic objects (e.g. Frames).
If we know that a object is never passed copy by value to a function than we can go to higher alignment up to 128 byte.

Side note: Currently I am testing if 16 byte influenced the speed.

@Flamefire
Copy link
Contributor

I don't think it is that easy. First: Alignment introduces overhead. The positions e.g. are not stored as SoA but as AoS, so with the current alignment a bunch of threads will load 4 values/thread but will only use 3 of them. With 4/8 byte alignment the L2 cache should be used much better (using the model: all threads load x -> cacheline is loaded with x,y,z,unused for each element, all load y (from cache), all load z (from cache), but quite some cache is wasted as all those unused values had to be loaded, of course a 16/32 byte vector load might be possible with aligned data)
Even worse for the 44 byte random structs which get aligned to 64 bytes. This wastes bandwith and memory especially with increasing alignment.
Second: Vector loads are 32 byte max. So why align more?

@psychocoderHPC
Copy link
Member Author

Why align to more than 32byte: Because 128bytes is read out of the global RAM of a GPU with each memory access.

@psychocoderHPC
Copy link
Member Author

I tested the 16byte alignment vs the old and I cant see any speedup or slowdown. It looks like we are save with 16byte.

@Flamefire
Copy link
Contributor

Flamefire commented Sep 5, 2016

Why align to more than 32byte: Because 128bytes is read out of the global RAM of a GPU with each memory access.

Using that for arrays of structs would be pretty bad though. Assume 80 bytes structs in an array and you get a lot more requests than you need.16 byte should be enough as this is the maximum request size according to http://docs.nvidia.com/cuda/cuda-c-programming-guide/index.html#device-memory-accesses

Note: I also found the explanation for the correctness through alignment there: If mallocMC returns an unaligned (or less than 16b) base address and we would use that with non-manually aligned types it would result in errors.

@psychocoderHPC
Copy link
Member Author

psychocoderHPC commented Sep 6, 2016

Heating Electron-Positron KHI:
heat_posel

@psychocoderHPC psychocoderHPC force-pushed the fix-optimalAlign branch 2 times, most recently from 94cf783 to b95e863 Compare September 7, 2016 10:09
@psychocoderHPC
Copy link
Member Author

offline discussion with @axel:

  • currently we cut our alignment to 16byte ( this avoid's bug Align > 32 bytes #1563)
  • this pull request is only a bugfix
    • we can tune our alignment for each type depending of the usage in a separate pull request, if needed
    • with C++11 we can check each kernel parameter to full fill the alignment at compile time

@psychocoderHPC
Copy link
Member Author

I removed the calculation of the useful alignment from this pull request.

@ax3l ax3l changed the title fix data alignment Data Alignment for Kernel Parameters: 16 Byte Sep 7, 2016
close ComputationalRadiationPhysics#1563 and close ComputationalRadiationPhysics#1553

- change `__optimal_align__` based on discussion ComputationalRadiationPhysics#1563
- add pre processor macro `PMACC_ROUND_UP_NEXT_POW2`
@Flamefire
Copy link
Contributor

@ax3l Please comment to latest changes. Can't see you updates ;)
Re "must vs should align" see my comment: #1566 (comment)
If our arrays are uniform (only floats or doubles) at least in size AND mallocMC returns aligned base offsets then it is a "should". For non-uniform it could be a "must" and for non-aligned base it IS a "must"

@ax3l
Copy link
Member

ax3l commented Sep 7, 2016

I just helped him integrate a comment of yours he forgot :)
See here: #1566 (diff)

@ax3l ax3l closed this Sep 7, 2016
@ax3l ax3l reopened this Sep 7, 2016
@ax3l ax3l merged commit 3477300 into ComputationalRadiationPhysics:dev Sep 7, 2016
@ax3l
Copy link
Member

ax3l commented Sep 7, 2016

thank you both for looking into the problem!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug a bug in the project's code component: PMacc in PMacc
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants