-
Notifications
You must be signed in to change notification settings - Fork 217
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Piz Daint: CUDA memory error with random number #2357
Comments
Checking the hdf5 output of the hdf5 plugin, I found no particles.
|
The macro particle counter also results in zero particles. |
With all debug output on one sees that the error occurs after initialization and during the particle distribution according to the density profile. The error occurs in
The last verbose output message in
|
This issue comes from the random number generator used during random position initialization. Using quiet start solves the issue. Thus @BeyondEspresso and @steindev, this will not be an issue with TWTS since all random distributions will be done one the CPU beforehand. @HighIander and @n01r even when using quiet start, you will most likely encounter the same issue when using an ionization scheme based on probability. @ax3l or @psychocoderHPC Is setting
Okay |
Please increase the free memory in the file memory.param sm_60 is correct for the P100. |
@psychocoderHPC Thanks - setting |
we should really integrate the "legacy" RNG for startups into our new state-aware RNG implementation to reduce the extra memory in |
This will not help. The rng initialization still needs lmem.
I am currently thinking about compiling for all architectures check the lmem usage by hand and than keep as much memory free as the worst case architecture needs multiplied by smx times max parallel blocks per smx.
--
Diese Nachricht wurde von meinem Android-Mobiltelefon mit K-9 Mail gesendet.
|
I reopen this issue until a more generic solution is found |
self answer to my post #2357 (comment) |
When running the default LWFA example using version
0.3.1
of PIConGPU, the simulation fails during initialization when writing checkpoints.I could reproduce this with both libSplashed compiled against ADIOS and HDF5 (parallel) only.
However, writing hdf5 output via the plugin works just fine, as long as checkpoints are not active.
I use the following modules on Piz Daint:
And build the additional libraries using the script of @ax3l here (great tool 👍)
(For hdf5 only, I removed the ADIOS library and rebuild libSplash)
Configuring worked just fine. Compiling produced a massive amount of boost warnings.
The
stderr
when--checkpoints 5000
is not active:However the simulations runs fine (see
stdout
):The
stderr
when--checkpoints 5000
is active:Here, the simulations dies during initialization (see
stdout
):Additionally to the default command line arguments of the 32 GPU example, I just used
--hdf5.period 1000
and--checkpoints 5000
.All hdf5 files from the hdf5 plugin were written correctly.
I am not sure, whether the memory errors actually are causing the failure because they occur as well without checkpoints (just a bit differently).
Any idea how to solve this issue?
I think neither @HighIander 's project nor the TWTS project (cc @BeyondEspresso and @steindev) will work without checkpoints.
The text was updated successfully, but these errors were encountered: