Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Piz Daint: CUDA memory error with random number #2357

Open
PrometheusPi opened this issue Nov 4, 2017 · 10 comments
Open

Piz Daint: CUDA memory error with random number #2357

PrometheusPi opened this issue Nov 4, 2017 · 10 comments
Assignees
Labels
backend: cuda CUDA backend bug a bug in the project's code

Comments

@PrometheusPi
Copy link
Member

When running the default LWFA example using version 0.3.1 of PIConGPU, the simulation fails during initialization when writing checkpoints.

I could reproduce this with both libSplashed compiled against ADIOS and HDF5 (parallel) only.
However, writing hdf5 output via the plugin works just fine, as long as checkpoints are not active.

I use the following modules on Piz Daint:

  1) modules/3.2.10.6
  2) eproxy/2.0.16-6.0.4.1_3.1__g001b199.ari
  3) gcc/5.3.0
  4) craype-haswell
  5) craype-network-aries
  6) craype/2.5.12
  7) cray-mpich/7.6.0
  8) slurm/17.02.7-1
  9) xalt/daint-2016.11
 10) cray-libsci/17.06.1
 11) udreg/2.3.2-6.0.4.0_12.2__g2f9c3ee.ari
 12) ugni/6.0.14-6.0.4.0_14.1__ge7db4a2.ari
 13) pmi/5.0.12
 14) dmapp/7.1.1-6.0.4.0_46.2__gb8abda2.ari
 15) gni-headers/5.0.11-6.0.4.0_7.2__g7136988.ari
 16) xpmem/2.2.2-6.0.4.0_3.1__g43b0535.ari
 17) job/2.2.2-6.0.4.0_8.2__g3c644b5.ari
 18) dvs/2.7_2.2.32-6.0.4.1_7.1__ged1923a
 19) alps/6.4.1-6.0.4.0_7.2__g86d0f3d.ari
 20) rca/2.2.11-6.0.4.0_13.2__g84de67a.ari
 21) atp/2.1.1
 22) perftools-base/6.5.1
 23) PrgEnv-gnu/6.0.4
 24) CMake/3.8.1
 25) cudatoolkit/8.0.61_2.4.3-6.0.4.0_3.1__gb475d12
 26) cray-hdf5-parallel/1.10.0.3

And build the additional libraries using the script of @ax3l here (great tool 👍)
(For hdf5 only, I removed the ADIOS library and rebuild libSplash)

Configuring worked just fine. Compiling produced a massive amount of boost warnings.

The stderr when --checkpoints 5000 is not active:

[CUDA] Error: </.../PIConGPU/picongpu/src/libPMacc/include/simulationControl/SimulationHelper.hpp>:142
what():  [CUDA] Error: out of memory
terminate called after throwing an instance of 'std::runtime_error'

However the simulations runs fine (see stdout):

Running program...
PIConGPUVerbose PHYSICS(1) | Sliding Window is ON
PIConGPUVerbose PHYSICS(1) | Courant c*dt <= 1.00229 ? 1
PIConGPUVerbose PHYSICS(1) | species e: omega_p * dt <= 0.1 ? 0.0247974
PIConGPUVerbose PHYSICS(1) | y-cells per wavelength: 18.0587
PIConGPUVerbose PHYSICS(1) | macro particles per gpu: 1048576
PIConGPUVerbose PHYSICS(1) | typical macro particle weighting: 6955.06
PIConGPUVerbose PHYSICS(1) | UNIT_SPEED 2.99792e+08
PIConGPUVerbose PHYSICS(1) | UNIT_TIME 1.39e-16
PIConGPUVerbose PHYSICS(1) | UNIT_LENGTH 4.16712e-08
PIConGPUVerbose PHYSICS(1) | UNIT_MASS 6.33563e-27
PIConGPUVerbose PHYSICS(1) | UNIT_CHARGE 1.11432e-15
PIConGPUVerbose PHYSICS(1) | UNIT_EFIELD 1.22627e+13
PIConGPUVerbose PHYSICS(1) | UNIT_BFIELD 40903.8
PIConGPUVerbose PHYSICS(1) | UNIT_ENERGY 5.69418e-10
initialization time: 28sec 699msec = 28 sec
  0 % =        0 | time elapsed:             1sec 663msec | avg time per step:   0msec
  5 % =      500 | time elapsed:             7sec 502msec | avg time per step:  11msec
 10 % =     1000 | time elapsed:            13sec 565msec | avg time per step:  12msec
 15 % =     1500 | time elapsed:            21sec 462msec | avg time per step:  12msec
 20 % =     2000 | time elapsed:            27sec 534msec | avg time per step:  12msec
 25 % =     2500 | time elapsed:            35sec 354msec | avg time per step:  12msec
 30 % =     3000 | time elapsed:            41sec 435msec | avg time per step:  12msec
 35 % =     3500 | time elapsed:            49sec 218msec | avg time per step:  11msec
 40 % =     4000 | time elapsed:            55sec 218msec | avg time per step:  11msec
 45 % =     4500 | time elapsed:       1min  2sec 711msec | avg time per step:  11msec
 50 % =     5000 | time elapsed:       1min  8sec 778msec | avg time per step:  12msec
 55 % =     5500 | time elapsed:       1min 16sec 412msec | avg time per step:  11msec
 60 % =     6000 | time elapsed:       1min 22sec 415msec | avg time per step:  11msec
 65 % =     6500 | time elapsed:       1min 30sec  39msec | avg time per step:  11msec
 70 % =     7000 | time elapsed:       1min 36sec 119msec | avg time per step:  12msec
 75 % =     7500 | time elapsed:       1min 43sec 995msec | avg time per step:  11msec
 80 % =     8000 | time elapsed:       1min 50sec 135msec | avg time per step:  12msec
 85 % =     8500 | time elapsed:       1min 57sec 521msec | avg time per step:  11msec
 90 % =     9000 | time elapsed:       2min  3sec 471msec | avg time per step:  11msec
 95 % =     9500 | time elapsed:       2min 11sec  54msec | avg time per step:  11msec
100 % =    10000 | time elapsed:       2min 17sec  13msec | avg time per step:  11msec
calculation  simulation time:  2min 18sec 795msec = 138 sec
full simulation time:  2min 47sec 717msec = 167 sec

The stderr when --checkpoints 5000 is active:

[CUDA] Error: </.../PIConGPU/picongpu/src/libPMacc/include/eventSystem/Manager.tpp>:41
what():  [CUDA] Error: out of memory
terminate called after throwing an instance of 'std::runtime_error'

Here, the simulations dies during initialization (see stdout):

Running program...
PIConGPUVerbose PHYSICS(1) | Sliding Window is ON
PIConGPUVerbose PHYSICS(1) | Courant c*dt <= 1.00229 ? 1
PIConGPUVerbose PHYSICS(1) | species e: omega_p * dt <= 0.1 ? 0.0247974
PIConGPUVerbose PHYSICS(1) | y-cells per wavelength: 18.0587
PIConGPUVerbose PHYSICS(1) | macro particles per gpu: 1048576
PIConGPUVerbose PHYSICS(1) | typical macro particle weighting: 6955.06
PIConGPUVerbose PHYSICS(1) | UNIT_SPEED 2.99792e+08
PIConGPUVerbose PHYSICS(1) | UNIT_TIME 1.39e-16
PIConGPUVerbose PHYSICS(1) | UNIT_LENGTH 4.16712e-08
PIConGPUVerbose PHYSICS(1) | UNIT_MASS 6.33563e-27
PIConGPUVerbose PHYSICS(1) | UNIT_CHARGE 1.11432e-15
PIConGPUVerbose PHYSICS(1) | UNIT_EFIELD 1.22627e+13
PIConGPUVerbose PHYSICS(1) | UNIT_BFIELD 40903.8
PIConGPUVerbose PHYSICS(1) | UNIT_ENERGY 5.69418e-10
initialization time: 28sec 670msec = 28 sec

Additionally to the default command line arguments of the 32 GPU example, I just used --hdf5.period 1000 and --checkpoints 5000.
All hdf5 files from the hdf5 plugin were written correctly.

I am not sure, whether the memory errors actually are causing the failure because they occur as well without checkpoints (just a bit differently).
Any idea how to solve this issue?
I think neither @HighIander 's project nor the TWTS project (cc @BeyondEspresso and @steindev) will work without checkpoints.

@PrometheusPi PrometheusPi added backend: cuda CUDA backend bug a bug in the project's code warning code produces/produced a warning and removed warning code produces/produced a warning labels Nov 4, 2017
@PrometheusPi
Copy link
Member Author

Checking the hdf5 output of the hdf5 plugin, I found no particles.
However, the default LWFA setup should contain particles.

# output of h5ls -r simData_0.h5
/                        Group
/data                    Group
/data/0                  Group
/data/0/fields           Group
/data/0/fields/B         Group
/data/0/fields/B/x       Dataset {128, 896, 128}
/data/0/fields/B/y       Dataset {128, 896, 128}
/data/0/fields/B/z       Dataset {128, 896, 128}
/data/0/fields/E         Group
/data/0/fields/E/x       Dataset {128, 896, 128}
/data/0/fields/E/y       Dataset {128, 896, 128}
/data/0/fields/E/z       Dataset {128, 896, 128}
/data/0/fields/e_chargeDensity Dataset {128, 896, 128}
/data/0/fields/e_energyDensity Dataset {128, 896, 128}
/data/0/fields/e_particleMomentumComponent Dataset {128, 896, 128}
/data/0/particles        Group
/data/0/particles/e      Group
/data/0/particles/e/charge Group
/data/0/particles/e/mass Group
/data/0/particles/e/momentum Group
/data/0/particles/e/momentum/x Dataset {NULL}
/data/0/particles/e/momentum/y Dataset {NULL}
/data/0/particles/e/momentum/z Dataset {NULL}
/data/0/particles/e/particlePatches Group
/data/0/particles/e/particlePatches/extent Group
/data/0/particles/e/particlePatches/extent/x Dataset {32}
/data/0/particles/e/particlePatches/extent/y Dataset {32}
/data/0/particles/e/particlePatches/extent/z Dataset {32}
/data/0/particles/e/particlePatches/numParticles Dataset {32}
/data/0/particles/e/particlePatches/numParticlesOffset Dataset {32}
/data/0/particles/e/particlePatches/offset Group
/data/0/particles/e/particlePatches/offset/x Dataset {32}
/data/0/particles/e/particlePatches/offset/y Dataset {32}
/data/0/particles/e/particlePatches/offset/z Dataset {32}
/data/0/particles/e/position Group
/data/0/particles/e/position/x Dataset {NULL}
/data/0/particles/e/position/y Dataset {NULL}
/data/0/particles/e/position/z Dataset {NULL}
/data/0/particles/e/positionOffset Group
/data/0/particles/e/positionOffset/x Dataset {NULL}
/data/0/particles/e/positionOffset/y Dataset {NULL}
/data/0/particles/e/positionOffset/z Dataset {NULL}
/data/0/particles/e/weighting Dataset {NULL}
/data/0/picongpu         Group
/data/0/picongpu/idProvider Group
/data/0/picongpu/idProvider/nextId Dataset {2, 8, 2}
/data/0/picongpu/idProvider/startId Dataset {2, 8, 2}
/header                  Group

@PrometheusPi
Copy link
Member Author

The macro particle counter also results in zero particles.

@PrometheusPi
Copy link
Member Author

With all debug output on one sees that the error occurs after initialization and during the particle distribution according to the density profile.

The error occurs in picongpu/src/picongpu/include/particles/Particles.tpp line 278.

PMACC_KERNEL( KernelFillGridWithParticles< Particles >{} )
        (mapper.getGridDim(), block)
	( densityFunctor, positionFunctor, totalGpuCellOffset, this->particlesBuffer->getDeviceParti\
cleBox( ), mapper );

The last verbose output message in stdout is

...
PIConGPUVerbose SIMULATION_STATE(16) | Starting simulation from timestep 0
PIConGPUVerbose SIMULATION_STATE(16) | Loading from default values finished
PMaccVerbose MEMORY(1) | DataConnector: sharing access to 'e' (1 uses)
PIConGPUVerbose SIMULATION_STATE(16) | initialize density profile for species e

@PrometheusPi
Copy link
Member Author

PrometheusPi commented Nov 4, 2017

This issue comes from the random number generator used during random position initialization. Using quiet start solves the issue.

Thus @BeyondEspresso and @steindev, this will not be an issue with TWTS since all random distributions will be done one the CPU beforehand.

@HighIander and @n01r even when using quiet start, you will most likely encounter the same issue when using an ionization scheme based on probability.

@ax3l or @psychocoderHPC Is setting CUDA_ARCH to 60 correct for the Tesla P100?

Because I am a bit confused: on the CSCS web site they say they use NVIDIA® Tesla® P100 16GB but on this web page there is no such thing as a Tesla P100 - there is just a Pascal P100 (SM_60) and a Tesla V100 (SM_70 CUDA 9 only).

Okay Tesla P100-PCIE-16GB is the same, see here.

@psychocoderHPC
Copy link
Member

Please increase the free memory in the file memory.param
This should solve your issue. The reason is that the p100 is much more parallel than all gpus before (more smx) therefore there is not enough memory for lmem used for the rng initialization.

sm_60 is correct for the P100.

@PrometheusPi
Copy link
Member Author

@psychocoderHPC Thanks - setting reservedGpuMemorySize to twice its original value (now 350 *1024*1024 * 2) solved the issue.

@PrometheusPi PrometheusPi changed the title Piz Daint: writing checkpoints fail with CUDA memory error Piz Daint: CUDA memory error with random number Nov 4, 2017
@ax3l
Copy link
Member

ax3l commented Nov 5, 2017

we should really integrate the "legacy" RNG for startups into our new state-aware RNG implementation to reduce the extra memory in memory.param dramatically.

@psychocoderHPC
Copy link
Member

psychocoderHPC commented Nov 5, 2017 via email

@ax3l ax3l added this to the 0.4.0 / 1.0.0: Next Stable milestone Nov 21, 2017
@ax3l
Copy link
Member

ax3l commented Nov 21, 2017

I reopen this issue until a more generic solution is found

@ax3l ax3l reopened this Nov 21, 2017
@psychocoderHPC
Copy link
Member

psychocoderHPC commented Nov 21, 2017

self answer to my post #2357 (comment)
It is not possible to check the lmem usage for all kernel and than multiply by the maximum hardware threads per GPU. The reason is that a P100 can handle 2048 threads per multiprocessor and contains 56 SMX.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
backend: cuda CUDA backend bug a bug in the project's code
Projects
None yet
Development

No branches or pull requests

3 participants