Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

ParallelFor for Reduction #1658

Merged
merged 3 commits into from
Dec 21, 2020

Conversation

WeiqunZhang
Copy link
Member

@WeiqunZhang WeiqunZhang commented Dec 18, 2020

Summary

Add capability for ParallelFor to safely do reduction using deviceReduceSum,
Min, etc. The user passes Gpu::KernelInfo{}.setReduction(true) to notify
ParallelFor that this is a parallel reduction, and gives ParallelFor a callable
that takes Gpu::Handler. A Gpu::Handler is needed to call
deviceReduceSum.

Also add Gpu::Buffer class, whose data pointer can be used as a device
destination for deviceReduceSum. It also has a copyToHost method to
copy the device result back to the host.

See Tutorials/GPU/ParallelReduce for examples of how to use ParallelFor
for reduction.

Also note that the reduction function is OpenMP CPU threads safe. Thus the
same code can run on with OpenMP when it is not built for GPU.

Checklist

The proposed changes:

  • fix a bug or incorrect behavior in AMReX
  • add new capabilities to AMReX
  • changes answers in the test suite to more than roundoff level
  • are likely to significantly affect the results of downstream AMReX users
  • are described in the proposed changes to the AMReX documentation, if appropriate

@WeiqunZhang
Copy link
Member Author

@maxpkatz This is partially working. See https://github.com/AMReX-Codes/amrex/pull/1658/files#diff-efaba8e2495aecd6b06358ff8079d4336c8ef5f8c66b26c7b66e0e2cc91bac92R303 for examples.

The remaining issue is GpuFuse has to be turned off. Still need to figure that out.

Another issue this is not the best way to do OpenMP. But the user could choose to write separate code for OpenMP if the performance is an issue for OpenMP.

@WeiqunZhang
Copy link
Member Author

Once we get fuse to work and verified with various versions of CUDA, I will test it with HIP. Then implement in DPCPP.

@maximumcats
Copy link
Member

@maxpkatz This is partially working. See https://github.com/AMReX-Codes/amrex/pull/1658/files#diff-efaba8e2495aecd6b06358ff8079d4336c8ef5f8c66b26c7b66e0e2cc91bac92R303 for examples.

Thanks, this design is what I was looking for.

Another issue this is not the best way to do OpenMP. But the user could choose to write separate code for OpenMP if the performance is an issue for OpenMP.

Yes, agreed. In my experimentation with Castro, as long as you're reducing over O(1) values then just using atomics was not a performance blocker. (Of course, not necessarily true for every code.) We had to do more sophisticated OpenMP-specific reductions for arrays, but then we already have a special case anyway so this doesn't really add more complexity.

@WeiqunZhang
Copy link
Member Author

@maxpkatz If we require the lambda function passed to ParallelFor for reduction to have an extra argument (Gpu::Handler), we can get rid of the if (box.contains(i,j,k)) test the user has to do. We can store information in Handler and deviceReduce* will then be able to know whether it's safe to do block level sync. If not, for a full warp, it can still do warp level reduction first. For the rest, we do atomic onto the global memory. (For cuda, even a partial warp can do reduction with mask. Don't know if it is worth it and I have seen performance issues with the mask versions of shfl before.) We can also have an assertion in debug mode if KernelInfo says there is reduction but the lambda does not take Gpu::Handler.

@maximumcats
Copy link
Member

maximumcats commented Dec 19, 2020

I agree with that from the perspective of safety of deviceReduceSum, and anyway despite our previous conversation, after thinking about it more I really don't object strongly to the four-argument lambda version and would be fine to use it.

From the perspective of avoiding the box contains test, I assume what you're thinking is that if maybe the user writes something like

    deviceReduceSum(dst, array, handler);

then we can use information about the box extent to avoid accessing the invalid indices of array. In more complicated cases the user might do something like

    Real tmp = 2.0 * array(i,j,k);
    deviceReduceSum(dst, tmp, handler);

and then it's back on them to do the contains test though. But insofar as your proposal makes things strictly more robust from the AMReX side I'm on board.

@WeiqunZhang
Copy link
Member Author

It's always safe to call deviceReduceSum(dst, tmp, handler) in users code, unless the user calls it different times for different valid cells. We will not call the user's lambda with out of bound i,j,k. The handler stores how many threads in this block called the user provided lambda.

@maximumcats
Copy link
Member

OK. I think I get what you're saying and it's consistent with my original sketch of code where deviceReduceSum explicitly had information about the box extent and the current index, which I was not very careful with.

template <typename T>
AMREX_GPU_DEVICE AMREX_FORCE_INLINE
void deviceReduceSum (T * dest, T source, const int& i, const Box& bx) noexcept
{
    // Only do block reduction if there are least blockDim.x threads to sum over
    // Otherwise, do the remainder with atomics
    if (bx.length(0) % blockDim.x == 0 || i < bx.bigEnd(0) - blockDim.x) {
        source = Gpu::blockReduce<Gpu::Device::warp_size>
            (source, Gpu::warpReduce<Gpu::Device::warp_size,T,amrex::Plus<T> >(), (T)0);
        if (threadIdx.x == 0) Gpu::Atomic::AddNoRet(dest, source);
    }
    else {
        Gpu::Atomic::AddNoRet(dest, source);
    }
}

If with the handler we have enough information to figure this out and then do partial reductions on the partial blocks and full reductions with syncthreads on the full blocks, then it's worth considering. Indeed we're still relying on the user to actually call the same number of deviceReduceSum instances in the same order on every thread.

Add capability for ParallelFor to safely do reduction using deviceReduceSum,
Min, etc. The user passes Gpu::KernelInfo{}.setReduction(true) to notify
ParallelFor that this is a parallel reduction, and gives ParallelFor a
callable that takes Gpu::Handler. A Gpu::Handler is needed to call
deviceReduceSum.

Also add Gpu::Buffer class, whose data pointer can be used as a device
destination for deviceReduceSum. It also has a copyToHost method to copy the
device result back to the host.

See Tutorials/GPU/ParallelReduce for examples of how to use ParallelFor for
reduction.

Also note that the reduction function is OpenMP CPU threads safe. Thus the
same code can run on with OpenMP when it is not built for GPU.
@WeiqunZhang WeiqunZhang changed the title [WIP] ParallelFor for Reduction ParallelFor for Reduction Dec 21, 2020
@WeiqunZhang WeiqunZhang marked this pull request as ready for review December 21, 2020 00:15
Copy link
Member

@atmyers atmyers left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I get a compilation error when I try to build Tutorials/GPU/ParallelReduce with USE_CUDA=FALSE:

In file included from ../../../Src/Base/AMReX_GpuLaunch.H:166,
                 from ../../../Src/Base/AMReX_Gpu.H:19,
                 from ../../../Src/Base/AMReX_Reduce.H:5,
                 from ../../../Src/Base/AMReX_BaseFab.H:32,
                 from ../../../Src/Base/AMReX_FArrayBox.H:7,
                 from ../../../Src/Base/AMReX_MultiFab.H:10,
                 from main.cpp:5:
../../../Src/Base/AMReX_GpuLaunchFunctsC.H: In instantiation of ‘void amrex::ParallelFor(const amrex::Box&, L&&) [with L = main_main()::<lambda(int, int, int, const amrex::Gpu::Handler&)>]’:
../../../Src/Base/AMReX_GpuLaunchFunctsC.H:76:16:   required from ‘void amrex::ParallelFor(const amrex::Gpu::KernelInfo&, const amrex::Box&, L&&) [with L = main_main()::<lambda(int, int, int, const amrex::Gpu::Handler&)>]’
main.cpp:83:14:   required from here
../../../Src/Base/AMReX_GpuLaunchFunctsC.H:69:10: error: no match for call to ‘(main_main()::<lambda(int, int, int, const amrex::Gpu::Handler&)>) (int&, int&, int&)’
         f(i,j,k);
         ~^~~~~~~
main.cpp:78:85: note: candidate: ‘main_main()::<lambda(int, int, int, const amrex::Gpu::Handler&)>’
             [=] AMREX_GPU_DEVICE (int i, int j, int k, Gpu::Handler const& handler) noexcept
                                                                                     ^~~~~~~~
main.cpp:78:85: note:   candidate expects 4 arguments, 3 provided

Is that expected?

Tutorials/GPU/ParallelReduce/main.cpp Outdated Show resolved Hide resolved
Co-authored-by: Andrew Myers <atmyers2@gmail.com>
@WeiqunZhang
Copy link
Member Author

Oh, I forgot about the cpu version.

@WeiqunZhang
Copy link
Member Author

OK. the CPU version is fixed now.

@WeiqunZhang WeiqunZhang merged commit 10ed0e0 into AMReX-Codes:development Dec 21, 2020
@WeiqunZhang WeiqunZhang deleted the parallel_for_reduce branch December 21, 2020 21:42
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants