-
Notifications
You must be signed in to change notification settings - Fork 217
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Data Alignment for Kernel Parameters: 16 Byte #1566
Data Alignment for Kernel Parameters: 16 Byte #1566
Conversation
* @param value integral number between [1,Inf] | ||
* @return next higher pow 2 value | ||
*/ | ||
#define PMACC_ROUND_UP_NEXT_POW2(value) \ |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
You need to mention, that this is for UNSIGNED numbers only.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
thx I will extent the description
Note: I still don't see why we should limit the alignment to the maximum useful alignment on the host if the purpose of the macro is to align structs for the device. |
The limitation to the useful alignment is added to solve #1553. |
Ah ok, I though it was 64 bytes and up. |
Why should this introduce a dependency on GCC? It's just ABI incompatibility within binaries created from various versions of GCC itself. |
@ax3l Because |
but that has nothing to do with the wish for 16byte hard limit. the other point (syntax) is addressed above, he will check. |
I meant: This fix uses a GCC-only feature to work around an issue. Moreover it relies on a specific evaluation of that feature (namely that the optimal align is less than what the ABI was changed for). |
Tested with with #include <iostream>
#include <cstdio>
#include <boost/align/alignment_of.hpp>
#define PMACC_MIN(x,y) (((x)<=(y))?x:y)
namespace PMacc
{
/** object to test for a useful alignment
*
* The compiler auto align the member array to a useful architecture
* depending value.
*/
struct UsefulAlignTestObject
{
char x[512] __attribute__ ((aligned));
};
/** type which defines a useful alignment for the architecture */
typedef boost::alignment::alignment_of<UsefulAlignTestObject> useful_align_t;
}
#define PMACC_ROUND_UP_NEXT_POW2(value) \
((value)==1?1: \
((value)<=2?2: \
((value)<=4?4: \
((value)<=8?8: \
((value)<=16?16: \
((value)<=32?32: \
((value)<=64?64:128 \
)))))))
#define __optimal_align__(byte) \
__attribute__((aligned( \
PMACC_MIN( \
/* 32 byte is the L2 cache line size of NVIDIA GPUs */ \
PMACC_MIN(PMACC_ROUND_UP_NEXT_POW2(byte),32), \
PMacc::useful_align_t::value \
) \
)))
#define PMACC_ALIGN(var,...) __optimal_align__(sizeof(__VA_ARGS__)) __VA_ARGS__ var
template<size_t N>
struct array{ char v[N];
array()
{}
array(const array& a)
{
for(int i=0;i<N;++i)
v[i]=a.v[i];
}
};
template<size_t T_N>
struct Foo{
static const size_t N = T_N;
PMACC_ALIGN(dummy,array<N> );
Foo()
{
for(int i=0;i<N;++i)
dummy.v[i]=i;
}
};
template<size_t N>
struct CallKernel{
void operator()(int value){
CallKernel<N-1>()(value);
Foo<N> foo;
int size = sizeof(Foo<N>);
printf("host %i %i\n", N,size);
}
};
template<>
struct CallKernel<0>
{
void operator()(int){}
};
int main(){
int value=1;
CallKernel<128>()(value);
} |
Ok so this is not GCC-only syntax? Then only remaining question is the maximum alignment safely supported for CUDA params. (see #1553) (besides that I still see no use for host-optimal alignment for GPUs) |
Normally we need two alignment options one for fitting a object on stack (as parameter) and one for dynamic objects (e.g. Frames). Side note: Currently I am testing if 16 byte influenced the speed. |
I don't think it is that easy. First: Alignment introduces overhead. The positions e.g. are not stored as SoA but as AoS, so with the current alignment a bunch of threads will load 4 values/thread but will only use 3 of them. With 4/8 byte alignment the L2 cache should be used much better (using the model: all threads load x -> cacheline is loaded with x,y,z,unused for each element, all load y (from cache), all load z (from cache), but quite some cache is wasted as all those unused values had to be loaded, of course a 16/32 byte vector load might be possible with aligned data) |
Why align to more than 32byte: Because 128bytes is read out of the global RAM of a GPU with each memory access. |
I tested the 16byte alignment vs the old and I cant see any speedup or slowdown. It looks like we are save with 16byte. |
Using that for arrays of structs would be pretty bad though. Assume 80 bytes structs in an array and you get a lot more requests than you need.16 byte should be enough as this is the maximum request size according to http://docs.nvidia.com/cuda/cuda-c-programming-guide/index.html#device-memory-accesses Note: I also found the explanation for the correctness through alignment there: If mallocMC returns an unaligned (or less than 16b) base address and we would use that with non-manually aligned types it would result in errors. |
5a60bf2
to
7f4f4d8
Compare
94cf783
to
b95e863
Compare
offline discussion with @axel:
|
b95e863
to
9e5412e
Compare
I removed the calculation of the useful alignment from this pull request. |
9e5412e
to
c4e6d67
Compare
close ComputationalRadiationPhysics#1563 and close ComputationalRadiationPhysics#1553 - change `__optimal_align__` based on discussion ComputationalRadiationPhysics#1563 - add pre processor macro `PMACC_ROUND_UP_NEXT_POW2`
c4e6d67
to
014632b
Compare
@ax3l Please comment to latest changes. Can't see you updates ;) |
I just helped him integrate a comment of yours he forgot :) |
thank you both for looking into the problem! |
close #1563 and close #1553
- add type to get a architecture depending useful alignment__optimal_align__
based on discussion Align > 32 bytes #1563PMACC_ROUND_UP_NEXT_POW2
Tests:
CC-ing: @Flamefire