Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

int overflow in ReduceOps.eval(Box, ...) #3453

Open
AlexanderSinn opened this issue Jul 27, 2023 · 4 comments
Open

int overflow in ReduceOps.eval(Box, ...) #3453

AlexanderSinn opened this issue Jul 27, 2023 · 4 comments

Comments

@AlexanderSinn
Copy link
Member

If the number of cells in a Box exceeds the maximum integer, eval wont work correctly and in my case would not loop over all elements without giving an error. A 64 bit index type should be used instead of int.

int ncells = box.numPts();

and
int ncells = box.numPts();

@WeiqunZhang
Copy link
Member

We have the same issue in ParalleFor. It wasn't really an issue before because GPUs used to not have so much memory. But now it has become an issue. Presumably switching to long will use more registers.

@AlexanderSinn
Copy link
Member Author

AlexanderSinn commented Jul 27, 2023

In my case there isn’t an allocation of the size of the box, I need to count how many particles need to be initialized. If the additional number of registers is a problem, one could split the box inside .eval() so that every chunk has less than max int number of cells. However, in my testing the number of registers usually only has a small impact compared to optimizing for memory access pattern.

For now, I can use

reduce_op.eval(
    domain_box.numPts(), reduce_data,
    [=] AMREX_GPU_DEVICE (amrex::Long idx) -> ReduceTuple
        {
        auto [i, j, k] = domain_box.atOffset3d(idx).arr;
        ...

@WeiqunZhang
Copy link
Member

We could add some assertions first before we decide what to do.

@AlexanderSinn
Copy link
Member Author

I looked at a few thigs in Compiler Explorer and it seems the main reason using a 64 bit icell would be slower and use more registers is the 64 bit integer division by lenxy and lenx. A100 does not have division assembly instructions so it needs to be emulated. For 64 bit this seems to be so bad that in the assembly it will check if it can use a 32 bit division instead (also emulated). Since the divisor is the same for all threads the CPU could help by precomputing some values to replace the division on the GPU with multiplication. I found this library that does that https://github.com/NVIDIA/cutlass/blob/2a9fa23e06b1fc0b7fab7a3e29ff1b17e325da7f/include/cutlass/fast_math.h#L404-L488
maybe we could do something similar. I haven't tested the performance but I could see this being faster than the current version using 32 bit division. It needs uint128_t for the precomputation so its not super easy to copy.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

3 participants