int overflow in ReduceOps.eval(Box, ...) #3453

AlexanderSinn · 2023-07-27T16:45:10Z

If the number of cells in a Box exceeds the maximum integer, eval wont work correctly and in my case would not loop over all elements without giving an error. A 64 bit index type should be used instead of int.

amrex/Src/Base/AMReX_Reduce.H

Line 504 in 34c0ae3

int ncells = box.numPts();

and

amrex/Src/Base/AMReX_Reduce.H

Line 564 in 34c0ae3

int ncells = box.numPts();

WeiqunZhang · 2023-07-27T16:58:59Z

We have the same issue in ParalleFor. It wasn't really an issue before because GPUs used to not have so much memory. But now it has become an issue. Presumably switching to long will use more registers.

AlexanderSinn · 2023-07-27T17:17:23Z

In my case there isn’t an allocation of the size of the box, I need to count how many particles need to be initialized. If the additional number of registers is a problem, one could split the box inside .eval() so that every chunk has less than max int number of cells. However, in my testing the number of registers usually only has a small impact compared to optimizing for memory access pattern.

For now, I can use

reduce_op.eval(
    domain_box.numPts(), reduce_data,
    [=] AMREX_GPU_DEVICE (amrex::Long idx) -> ReduceTuple
        {
        auto [i, j, k] = domain_box.atOffset3d(idx).arr;
        ...

WeiqunZhang · 2023-07-27T17:43:27Z

We could add some assertions first before we decide what to do.

AlexanderSinn · 2023-08-21T12:18:40Z

I looked at a few thigs in Compiler Explorer and it seems the main reason using a 64 bit icell would be slower and use more registers is the 64 bit integer division by lenxy and lenx. A100 does not have division assembly instructions so it needs to be emulated. For 64 bit this seems to be so bad that in the assembly it will check if it can use a 32 bit division instead (also emulated). Since the divisor is the same for all threads the CPU could help by precomputing some values to replace the division on the GPU with multiplication. I found this library that does that https://github.com/NVIDIA/cutlass/blob/2a9fa23e06b1fc0b7fab7a3e29ff1b17e325da7f/include/cutlass/fast_math.h#L404-L488
maybe we could do something similar. I haven't tested the performance but I could see this being faster than the current version using 32 bit division. It needs uint128_t for the precomputation so its not super easy to copy.

AlexanderSinn added the bug label Jul 27, 2023

ax3l added the enhancement label Feb 2, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

int overflow in ReduceOps.eval(Box, ...) #3453

int overflow in ReduceOps.eval(Box, ...) #3453

AlexanderSinn commented Jul 27, 2023

WeiqunZhang commented Jul 27, 2023

AlexanderSinn commented Jul 27, 2023 •

edited

Loading

WeiqunZhang commented Jul 27, 2023

AlexanderSinn commented Aug 21, 2023

int overflow in ReduceOps.eval(Box, ...) #3453

int overflow in ReduceOps.eval(Box, ...) #3453

Comments

AlexanderSinn commented Jul 27, 2023

WeiqunZhang commented Jul 27, 2023

AlexanderSinn commented Jul 27, 2023 • edited Loading

WeiqunZhang commented Jul 27, 2023

AlexanderSinn commented Aug 21, 2023

AlexanderSinn commented Jul 27, 2023 •

edited

Loading