ParallelFor for Reduction #1658

WeiqunZhang · 2020-12-18T00:12:56Z

Summary

Add capability for ParallelFor to safely do reduction using deviceReduceSum,
Min, etc. The user passes Gpu::KernelInfo{}.setReduction(true) to notify
ParallelFor that this is a parallel reduction, and gives ParallelFor a callable
that takes Gpu::Handler. A Gpu::Handler is needed to call
deviceReduceSum.

Also add Gpu::Buffer class, whose data pointer can be used as a device
destination for deviceReduceSum. It also has a copyToHost method to
copy the device result back to the host.

See Tutorials/GPU/ParallelReduce for examples of how to use ParallelFor
for reduction.

Also note that the reduction function is OpenMP CPU threads safe. Thus the
same code can run on with OpenMP when it is not built for GPU.

Checklist

The proposed changes:

fix a bug or incorrect behavior in AMReX
add new capabilities to AMReX
changes answers in the test suite to more than roundoff level
are likely to significantly affect the results of downstream AMReX users
are described in the proposed changes to the AMReX documentation, if appropriate

WeiqunZhang · 2020-12-18T00:19:34Z

@maxpkatz This is partially working. See https://github.com/AMReX-Codes/amrex/pull/1658/files#diff-efaba8e2495aecd6b06358ff8079d4336c8ef5f8c66b26c7b66e0e2cc91bac92R303 for examples.

The remaining issue is GpuFuse has to be turned off. Still need to figure that out.

Another issue this is not the best way to do OpenMP. But the user could choose to write separate code for OpenMP if the performance is an issue for OpenMP.

WeiqunZhang · 2020-12-18T00:25:25Z

Once we get fuse to work and verified with various versions of CUDA, I will test it with HIP. Then implement in DPCPP.

maximumcats · 2020-12-18T00:29:35Z

@maxpkatz This is partially working. See https://github.com/AMReX-Codes/amrex/pull/1658/files#diff-efaba8e2495aecd6b06358ff8079d4336c8ef5f8c66b26c7b66e0e2cc91bac92R303 for examples.

Thanks, this design is what I was looking for.

Another issue this is not the best way to do OpenMP. But the user could choose to write separate code for OpenMP if the performance is an issue for OpenMP.

Yes, agreed. In my experimentation with Castro, as long as you're reducing over O(1) values then just using atomics was not a performance blocker. (Of course, not necessarily true for every code.) We had to do more sophisticated OpenMP-specific reductions for arrays, but then we already have a special case anyway so this doesn't really add more complexity.

WeiqunZhang · 2020-12-19T06:23:16Z

@maxpkatz If we require the lambda function passed to ParallelFor for reduction to have an extra argument (Gpu::Handler), we can get rid of the if (box.contains(i,j,k)) test the user has to do. We can store information in Handler and deviceReduce* will then be able to know whether it's safe to do block level sync. If not, for a full warp, it can still do warp level reduction first. For the rest, we do atomic onto the global memory. (For cuda, even a partial warp can do reduction with mask. Don't know if it is worth it and I have seen performance issues with the mask versions of shfl before.) We can also have an assertion in debug mode if KernelInfo says there is reduction but the lambda does not take Gpu::Handler.

maximumcats · 2020-12-19T06:29:22Z

I agree with that from the perspective of safety of deviceReduceSum, and anyway despite our previous conversation, after thinking about it more I really don't object strongly to the four-argument lambda version and would be fine to use it.

From the perspective of avoiding the box contains test, I assume what you're thinking is that if maybe the user writes something like

    deviceReduceSum(dst, array, handler);

then we can use information about the box extent to avoid accessing the invalid indices of array. In more complicated cases the user might do something like

    Real tmp = 2.0 * array(i,j,k);
    deviceReduceSum(dst, tmp, handler);

and then it's back on them to do the contains test though. But insofar as your proposal makes things strictly more robust from the AMReX side I'm on board.

WeiqunZhang · 2020-12-19T06:36:12Z

It's always safe to call deviceReduceSum(dst, tmp, handler) in users code, unless the user calls it different times for different valid cells. We will not call the user's lambda with out of bound i,j,k. The handler stores how many threads in this block called the user provided lambda.

maximumcats · 2020-12-19T07:35:02Z

OK. I think I get what you're saying and it's consistent with my original sketch of code where deviceReduceSum explicitly had information about the box extent and the current index, which I was not very careful with.

template <typename T>
AMREX_GPU_DEVICE AMREX_FORCE_INLINE
void deviceReduceSum (T * dest, T source, const int& i, const Box& bx) noexcept
{
    // Only do block reduction if there are least blockDim.x threads to sum over
    // Otherwise, do the remainder with atomics
    if (bx.length(0) % blockDim.x == 0 || i < bx.bigEnd(0) - blockDim.x) {
        source = Gpu::blockReduce<Gpu::Device::warp_size>
            (source, Gpu::warpReduce<Gpu::Device::warp_size,T,amrex::Plus<T> >(), (T)0);
        if (threadIdx.x == 0) Gpu::Atomic::AddNoRet(dest, source);
    }
    else {
        Gpu::Atomic::AddNoRet(dest, source);
    }
}

If with the handler we have enough information to figure this out and then do partial reductions on the partial blocks and full reductions with syncthreads on the full blocks, then it's worth considering. Indeed we're still relying on the user to actually call the same number of deviceReduceSum instances in the same order on every thread.

Add capability for ParallelFor to safely do reduction using deviceReduceSum, Min, etc. The user passes Gpu::KernelInfo{}.setReduction(true) to notify ParallelFor that this is a parallel reduction, and gives ParallelFor a callable that takes Gpu::Handler. A Gpu::Handler is needed to call deviceReduceSum. Also add Gpu::Buffer class, whose data pointer can be used as a device destination for deviceReduceSum. It also has a copyToHost method to copy the device result back to the host. See Tutorials/GPU/ParallelReduce for examples of how to use ParallelFor for reduction. Also note that the reduction function is OpenMP CPU threads safe. Thus the same code can run on with OpenMP when it is not built for GPU.

atmyers

I get a compilation error when I try to build Tutorials/GPU/ParallelReduce with USE_CUDA=FALSE:

In file included from ../../../Src/Base/AMReX_GpuLaunch.H:166,
                 from ../../../Src/Base/AMReX_Gpu.H:19,
                 from ../../../Src/Base/AMReX_Reduce.H:5,
                 from ../../../Src/Base/AMReX_BaseFab.H:32,
                 from ../../../Src/Base/AMReX_FArrayBox.H:7,
                 from ../../../Src/Base/AMReX_MultiFab.H:10,
                 from main.cpp:5:
../../../Src/Base/AMReX_GpuLaunchFunctsC.H: In instantiation of ‘void amrex::ParallelFor(const amrex::Box&, L&&) [with L = main_main()::<lambda(int, int, int, const amrex::Gpu::Handler&)>]’:
../../../Src/Base/AMReX_GpuLaunchFunctsC.H:76:16:   required from ‘void amrex::ParallelFor(const amrex::Gpu::KernelInfo&, const amrex::Box&, L&&) [with L = main_main()::<lambda(int, int, int, const amrex::Gpu::Handler&)>]’
main.cpp:83:14:   required from here
../../../Src/Base/AMReX_GpuLaunchFunctsC.H:69:10: error: no match for call to ‘(main_main()::<lambda(int, int, int, const amrex::Gpu::Handler&)>) (int&, int&, int&)’
         f(i,j,k);
         ~^~~~~~~
main.cpp:78:85: note: candidate: ‘main_main()::<lambda(int, int, int, const amrex::Gpu::Handler&)>’
             [=] AMREX_GPU_DEVICE (int i, int j, int k, Gpu::Handler const& handler) noexcept
                                                                                     ^~~~~~~~
main.cpp:78:85: note:   candidate expects 4 arguments, 3 provided

Is that expected?

Tutorials/GPU/ParallelReduce/main.cpp

Co-authored-by: Andrew Myers <atmyers2@gmail.com>

WeiqunZhang · 2020-12-21T20:43:19Z

Oh, I forgot about the cpu version.

WeiqunZhang · 2020-12-21T20:57:17Z

OK. the CPU version is fixed now.

WeiqunZhang force-pushed the parallel_for_reduce branch from 39c4a7c to 66b0f32 Compare December 21, 2020 00:15

WeiqunZhang changed the title ~~[WIP] ParallelFor for Reduction~~ ParallelFor for Reduction Dec 21, 2020

WeiqunZhang marked this pull request as ready for review December 21, 2020 00:15

WeiqunZhang requested review from maximumcats, atmyers and kngott December 21, 2020 00:16

maximumcats approved these changes Dec 21, 2020

View reviewed changes

atmyers reviewed Dec 21, 2020

View reviewed changes

Tutorials/GPU/ParallelReduce/main.cpp Outdated Show resolved Hide resolved

Update Tutorials/GPU/ParallelReduce/main.cpp

0c233a7

Co-authored-by: Andrew Myers <atmyers2@gmail.com>

forgot the cpu version

68b162e

atmyers approved these changes Dec 21, 2020

View reviewed changes

WeiqunZhang merged commit 10ed0e0 into AMReX-Codes:development Dec 21, 2020

WeiqunZhang deleted the parallel_for_reduce branch December 21, 2020 21:42

maximumcats mentioned this pull request Dec 23, 2020

Use AMReX deviceReduceSum for sum reductions Exawind/amr-wind#299

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

ParallelFor for Reduction #1658

ParallelFor for Reduction #1658

WeiqunZhang commented Dec 18, 2020 •

edited

Loading

WeiqunZhang commented Dec 18, 2020

WeiqunZhang commented Dec 18, 2020

maximumcats commented Dec 18, 2020

WeiqunZhang commented Dec 19, 2020

maximumcats commented Dec 19, 2020 •

edited

Loading

WeiqunZhang commented Dec 19, 2020

maximumcats commented Dec 19, 2020

atmyers left a comment

WeiqunZhang commented Dec 21, 2020

WeiqunZhang commented Dec 21, 2020

ParallelFor for Reduction #1658

ParallelFor for Reduction #1658

Conversation

WeiqunZhang commented Dec 18, 2020 • edited Loading

Summary

Checklist

WeiqunZhang commented Dec 18, 2020

WeiqunZhang commented Dec 18, 2020

maximumcats commented Dec 18, 2020

WeiqunZhang commented Dec 19, 2020

maximumcats commented Dec 19, 2020 • edited Loading

WeiqunZhang commented Dec 19, 2020

maximumcats commented Dec 19, 2020

atmyers left a comment

Choose a reason for hiding this comment

WeiqunZhang commented Dec 21, 2020

WeiqunZhang commented Dec 21, 2020

WeiqunZhang commented Dec 18, 2020 •

edited

Loading

maximumcats commented Dec 19, 2020 •

edited

Loading