-
Notifications
You must be signed in to change notification settings - Fork 368
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
ParallelFor for Reduction #1658
ParallelFor for Reduction #1658
Conversation
@maxpkatz This is partially working. See https://github.com/AMReX-Codes/amrex/pull/1658/files#diff-efaba8e2495aecd6b06358ff8079d4336c8ef5f8c66b26c7b66e0e2cc91bac92R303 for examples. The remaining issue is GpuFuse has to be turned off. Still need to figure that out. Another issue this is not the best way to do OpenMP. But the user could choose to write separate code for OpenMP if the performance is an issue for OpenMP. |
Once we get fuse to work and verified with various versions of CUDA, I will test it with HIP. Then implement in DPCPP. |
Thanks, this design is what I was looking for.
Yes, agreed. In my experimentation with Castro, as long as you're reducing over O(1) values then just using atomics was not a performance blocker. (Of course, not necessarily true for every code.) We had to do more sophisticated OpenMP-specific reductions for arrays, but then we already have a special case anyway so this doesn't really add more complexity. |
@maxpkatz If we require the lambda function passed to ParallelFor for reduction to have an extra argument ( |
I agree with that from the perspective of safety of From the perspective of avoiding the box contains test, I assume what you're thinking is that if maybe the user writes something like
then we can use information about the box extent to avoid accessing the invalid indices of
and then it's back on them to do the |
It's always safe to call |
OK. I think I get what you're saying and it's consistent with my original sketch of code where
If with the handler we have enough information to figure this out and then do partial reductions on the partial blocks and full reductions with syncthreads on the full blocks, then it's worth considering. Indeed we're still relying on the user to actually call the same number of |
Add capability for ParallelFor to safely do reduction using deviceReduceSum, Min, etc. The user passes Gpu::KernelInfo{}.setReduction(true) to notify ParallelFor that this is a parallel reduction, and gives ParallelFor a callable that takes Gpu::Handler. A Gpu::Handler is needed to call deviceReduceSum. Also add Gpu::Buffer class, whose data pointer can be used as a device destination for deviceReduceSum. It also has a copyToHost method to copy the device result back to the host. See Tutorials/GPU/ParallelReduce for examples of how to use ParallelFor for reduction. Also note that the reduction function is OpenMP CPU threads safe. Thus the same code can run on with OpenMP when it is not built for GPU.
39c4a7c
to
66b0f32
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I get a compilation error when I try to build Tutorials/GPU/ParallelReduce
with USE_CUDA=FALSE
:
In file included from ../../../Src/Base/AMReX_GpuLaunch.H:166,
from ../../../Src/Base/AMReX_Gpu.H:19,
from ../../../Src/Base/AMReX_Reduce.H:5,
from ../../../Src/Base/AMReX_BaseFab.H:32,
from ../../../Src/Base/AMReX_FArrayBox.H:7,
from ../../../Src/Base/AMReX_MultiFab.H:10,
from main.cpp:5:
../../../Src/Base/AMReX_GpuLaunchFunctsC.H: In instantiation of ‘void amrex::ParallelFor(const amrex::Box&, L&&) [with L = main_main()::<lambda(int, int, int, const amrex::Gpu::Handler&)>]’:
../../../Src/Base/AMReX_GpuLaunchFunctsC.H:76:16: required from ‘void amrex::ParallelFor(const amrex::Gpu::KernelInfo&, const amrex::Box&, L&&) [with L = main_main()::<lambda(int, int, int, const amrex::Gpu::Handler&)>]’
main.cpp:83:14: required from here
../../../Src/Base/AMReX_GpuLaunchFunctsC.H:69:10: error: no match for call to ‘(main_main()::<lambda(int, int, int, const amrex::Gpu::Handler&)>) (int&, int&, int&)’
f(i,j,k);
~^~~~~~~
main.cpp:78:85: note: candidate: ‘main_main()::<lambda(int, int, int, const amrex::Gpu::Handler&)>’
[=] AMREX_GPU_DEVICE (int i, int j, int k, Gpu::Handler const& handler) noexcept
^~~~~~~~
main.cpp:78:85: note: candidate expects 4 arguments, 3 provided
Is that expected?
Co-authored-by: Andrew Myers <atmyers2@gmail.com>
Oh, I forgot about the cpu version. |
OK. the CPU version is fixed now. |
Summary
Add capability for
ParallelFor
to safely do reduction usingdeviceReduceSum
,Min
, etc. The user passesGpu::KernelInfo{}.setReduction(true)
to notifyParallelFor
that this is a parallel reduction, and givesParallelFor
a callablethat takes
Gpu::Handler
. AGpu::Handler
is needed to calldeviceReduceSum
.Also add
Gpu::Buffer
class, whose data pointer can be used as a devicedestination for
deviceReduceSum
. It also has acopyToHost
method tocopy the device result back to the host.
See
Tutorials/GPU/ParallelReduce
for examples of how to useParallelFor
for reduction.
Also note that the reduction function is OpenMP CPU threads safe. Thus the
same code can run on with OpenMP when it is not built for GPU.
Checklist
The proposed changes: