-
Notifications
You must be signed in to change notification settings - Fork 142
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
DMC LocalECP incorrect in GPU code on titan using CUDA 9.1 #1440
Comments
The local ECP kenel is one that is known to not be reproducibile between runs, i.e. is buggy. Something to do with walker and GPU thread/block count. Previously the differences have been small enough to be ignorable; this problem indicates it must be fixed. There are a couple of issues on this. You don't state explicitly, but is the non-local ECP term correct? |
The non-local ECP term appears to be correct. |
To save time debugging this, for the next 3 weeks the necessary pwscf file is at |
I did some VMC experimentation. On a single Kepler GPU with a fixed seed and either 1 or 320 walkers, I was able to reproduce the previously noticed non-determinism with just a few moves. i.e. Multiple runs of the executable generate slightly different results. From this short run and my current inputs we can't say if the energies are "bad" but the local electron ion and electron-electron terms are not repeatable. The much harder to compute kinetic energy and non-local electron-ion are repeatable (?!). |
VMC runs with 320 walkers are essentially the same, i.e. no 0.3 Ha shift. All inputs and outputs from test including wavefunction: https://ftp.ornl.gov/filedownload?ftp=e;dir=ICE
|
@jtkrogel where and how were you able to produce the cpu-gpu energy shift? machine, qmcpack version, software versions, node/mpi/thread counts etc. In my DMC tests so far I have not found such a sizable shift. |
The results are from runs performed by Andrea Zen (@zenandrea) on Titan with QMCPACK 3.6.0 on 4 nodes, 1 mpi task per node, 1 thread per mpi task (see files job_qmcpack_gpu-titan, input_dmcgpu.xml, and out_dmcgpu in TEST_DMC.zip). The build details, as far as I know, are according to our build_olcf_titan.sh script, but with changes to the boost and fftw libraries as follows: boost/1.62.0 fftw/3.3.4.11). Presumably with the real AoS code. @zenandrea, please check if I have missed something. |
Dear @jtkrogel and @prckent, In particular, this is my compilations script:
|
Thanks. Nothing unreasonable in the above. It should work without problems. FFTW would not cause the failures. If FFTW were wrong - and I don't recall a single case ever where it has been - the kinetic energy and Monte Carlo walk in general would also be wrong. |
I have reproduced this problem using the current develop version and with builds that pass the unit and diamond and LiH integration tests. I used the updated build script #1472 i.e. Nothing out of the ordinary. Using 1MPI, 16 OMP threads and 0/1 GPU I have a 0.6 Hartree (!) difference in the DMC energies (series 2 & 3 below), while the VMC energies agree. The difference is in the local part of the pseudopotential. The analysis below is not done carefully, but it is interesting that the kinetic energy and acceptance ratio appear to match between CPU and GPU. A 4 node run shows a slightly smaller disagreement between the codes.
|
Also worth noting that the DMC energy is above the VMC one... |
Attempting to bracket the problem:
Still puzzling is why our existing carbon diamond or LiH tests don't trigger this bug. |
|
|
By varying the number of walkers I was able to break VMC (good suggestion by @jtkrogel ). The bug is back to looking like a bad kernel. |
The linked VMC test gives incorrect results on titan. Puzzlingly, these same files give correct results on oxygen (Intel Xeon + Kepler + clang6 +cuda 10.0 currently). A naively incorrect kernel would give reproducible errors. |
@prckent I can reproduce your numbers on Titan. |
@prckent When I go back to Cuda 7.5 (using Gcc 4.9.3 and an older version of QMCPACK) I get the correct results: qmc_gpu series 1 So this could be an issue with the Cuda installation on Titan... |
@atillack Interesting. If you are using a standalone workstation with CUDA 7.5 (!), the question is whether you can break VMC by e.g. varying the number of walkers, or if running Andrea's original DMC case still breaks. |
@atillack Is there a specific build config + QMCPACK version you can recommend that does not display the problem on Titan? This may represent a practical way @zenandrea can get correct production runs sooner. |
@jtkrogel QMPACK 3.5.0 Here are the modules I have loaded (for gcc/4.9.3, module unload gcc; module load gcc/4.9.3 after "module swap PrgEnv-pgi PrgEnv-gnu" works): Currently Loaded Modulefiles:
|
@prckent @jtkrogel I just looked into the Cuda 9 changelog and found this wonderful snippet:
This at least may explain what is going on. I am not sure how to pass down this parameter to ptxas though ... Edit: Testing now. |
@prckent @jtkrogel Cuda 7.5 is still the temporary solution. The ptxas flag (-Xptxas --new-sm3x-opt=false can be put in CUDA_NVCC_FLAGS) only helps to get results halfway to the correct number on Cuda 9.1 on Titan: qmc_gpu series 1 |
@prckent @jtkrogel After talking with our Nvidia representatives, there is a code generation regression in 9.1 which is fixed in 9.2. So on Titan, it seems the only work-around is to use 7.5 for the time being. If a newer version than QMCPACK 3.5.0 is needed some (minor) code changes are needed in order to compile with Cuda 7.5:
|
@prckent @jtkrogel On Summit using Cuda 9.2 the correct results are also obtained: qmc_gpu series 1 |
@zenandrea Please ask - I am not sure that 9.2 will be installed given that Titan has only a few more months of accessibility, but other packages are certainly at risk. Are you able to move to Summit or is your time only on Titan? This is a scary problem and I am not keen on recommending use of older software. |
@prckent I have half the resources on titan and half on summit. |
@prckent @zenandrea As Cuda 9.1's behavior was seen as mostly a performance regression, the Nvidia folks are looking at our kernel giving bad numbers under 9.1 to see if there's a possible workaround. @zenandrea It's a good idea to ask but like Paul I am uncertain if this will happen in time to be useful. In the interim, with small code changes (see post above) it is possible to compile a current version of QMCPACK on Titan with Cuda 7.5 but this only works with GCC 4.9.3 as otherwise modules are missing. |
I am still open to the idea that we have illegal/buggy code, and that different CUDA versions, GPUs, etc. expose the problem in different ways. However "bad generated code" is the best explanation given the established facts. What is so strange still is that all the difficult and costly parts of the calculation involving the wavefunction are correct. |
I have a solution to use 7.5 with the current QMCPACK. Will PR soon. |
@ye-luo Thanks! |
I failed to find a clean solution through the source because I need to hack cmake. |
I'll note that an initialization bug similar to #1518 could explain these problems. |
I checked. Un fortunately #1518 is not related to this bug. |
@prckent The problem seems contained to Titan. Cuda 9.1 on Summit also gives the correct results: qmc_gpu series 1 |
I put both v3.6 and 3.7 binaries at To workaround the bug in CUDA 9.1 which gives wrong results.
|
I modified the title for posterity to record the actual determined problem. |
Dear @prckent, new binaries seems to work well on the cases I have tested so far. |
Disagreement between CPU and GPU DMC total energies was observed for a water molecule in periodic boundary conditions (8 A cubic cell, CASINO pseudopotentials, Titan at OLCF, QMCPACK 3.6.0). Issue originally reported by Andrea Zen. Original inputs and outputs: TEST_DMC.zip
From the attached outputs, the VMC energies agree, while the DMC energies differ by about 0.3 Ha:
The difference is entirely attributable to the local part of the ECP:
Note: the DMC error bars are not statistically meaningful here (10 blocks), but the difference is large enough to support this conclusion.
The oddity here is that the error is only seen in DMC and it is limited to a single potential energy term. This may indicate a bug in LocalECP that surfaces with increased walker count on the GPU (1 walker/gpu in VMC, 320 walkers/gpu in DMC). Likely, a series of VMC runs with increasing number of walkers will show this.
The text was updated successfully, but these errors were encountered: