-
Notifications
You must be signed in to change notification settings - Fork 584
Closed
Labels
Description
Bug summary
I am training a polar model with the pytorch backend. The rmse is nan in the lcurve.out file and after training for some steps, an error is raised and the training stops.
DeePMD-kit Version
DeePMD-kit v3.0.1.dev89+gc9baf668
Backend and its version
PyTorch 2.6.0+cu126
How did you download the software?
Built from source
Input Files, Running Commands, Error Log, etc.
I am running the training task in examples and using the following input file:
The lcurve.out file looks like this:
# step rmse_global_polar_val rmse_global_polar_trn lr
# If there is no available reference data, rmse_*_{val,trn} will print nan
1 nan nan 1.0e-03
100 nan nan 1.0e-03
200 nan nan 1.0e-03
300 nan nan 1.0e-03
400 nan nan 1.0e-03
500 nan nan 1.0e-03
600 nan nan 1.0e-03
700 nan nan 1.0e-03
800 nan nan 1.0e-03
900 nan nan 1.0e-03
The error message is:
[2025-02-07 10:45:35,058] DEEPMD INFO batch 900: trn: rmse_global_polar = nan, lr = 1.00e-03
[2025-02-07 10:45:35,059] DEEPMD INFO batch 900: val: rmse_global_polar = nan
[2025-02-07 10:45:35,059] DEEPMD INFO batch 900: total wall time = 1.09 s
terminate called after throwing an instance of 'c10::Error'
what(): CUDA error: initialization error
CUDA kernel errors might be asynchronously reported at some other API call, so the stacktrace below might be incorrect.
For debugging consider passing CUDA_LAUNCH_BLOCKING=1
Compile with `TORCH_USE_CUDA_DSA` to enable device-side assertions.
Exception raised from c10_cuda_check_implementation at /pytorch/c10/cuda/CUDAException.cpp:43 (most recent call first):
frame #0: c10::Error::Error(c10::SourceLocation, std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> >) + 0x98 (0x151658205788 in /scratch/gpfs/CAR/yifanl/miniforge3/envs/dp-pt/lib/python3.12/site-packages/torch/lib/libc10.so)
frame #1: c10::detail::torchCheckFail(char const*, char const*, unsigned int, std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > const&) + 0xe0 (0x1516581aeea8 in /scratch/gpfs/CAR/yifanl/miniforge3/envs/dp-pt/lib/python3.12/site-packages/torch/lib/libc10.so)
frame #2: c10::cuda::c10_cuda_check_implementation(int, char const*, char const*, int, bool) + 0x3c2 (0x1516582de3d2 in /scratch/gpfs/CAR/yifanl/miniforge3/envs/dp-pt/lib/python3.12/site-packages/torch/lib/libc10_cuda.so)
frame #3: c10::cuda::ExchangeDevice(signed char) + 0x9a (0x1516582de78a in /scratch/gpfs/CAR/yifanl/miniforge3/envs/dp-pt/lib/python3.12/site-packages/torch/lib/libc10_cuda.so)
frame #4: <unknown function> + 0xf660e3 (0x15165929f0e3 in /scratch/gpfs/CAR/yifanl/miniforge3/envs/dp-pt/lib/python3.12/site-packages/torch/lib/libtorch_cuda.so)
frame #5: <unknown function> + 0xf62cfb (0x15165929bcfb in /scratch/gpfs/CAR/yifanl/miniforge3/envs/dp-pt/lib/python3.12/site-packages/torch/lib/libtorch_cuda.so)
frame #6: <unknown function> + 0xf69814 (0x1516592a2814 in /scratch/gpfs/CAR/yifanl/miniforge3/envs/dp-pt/lib/python3.12/site-packages/torch/lib/libtorch_cuda.so)
frame #7: <unknown function> + 0x4f2092 (0x1516a15f4092 in /scratch/gpfs/CAR/yifanl/miniforge3/envs/dp-pt/lib/python3.12/site-packages/torch/lib/libtorch_python.so)
frame #8: c10::TensorImpl::~TensorImpl() + 0x9 (0x1516581dffe9 in /scratch/gpfs/CAR/yifanl/miniforge3/envs/dp-pt/lib/python3.12/site-packages/torch/lib/libc10.so)
frame #9: <unknown function> + 0x7a6478 (0x1516a18a8478 in /scratch/gpfs/CAR/yifanl/miniforge3/envs/dp-pt/lib/python3.12/site-packages/torch/lib/libtorch_python.so)
frame #10: THPVariable_subclass_dealloc(_object*) + 0x2bc (0x1516a18a879c in /scratch/gpfs/CAR/yifanl/miniforge3/envs/dp-pt/lib/python3.12/site-packages/torch/lib/libtorch_python.so)
frame #11: <unknown function> + 0x201adb (0x564a3b233adb in /scratch/gpfs/CAR/yifanl/miniforge3/envs/dp-pt/bin/python3.12)
frame #12: <unknown function> + 0x23337f (0x564a3b26537f in /scratch/gpfs/CAR/yifanl/miniforge3/envs/dp-pt/bin/python3.12)
frame #13: <unknown function> + 0x233344 (0x564a3b265344 in /scratch/gpfs/CAR/yifanl/miniforge3/envs/dp-pt/bin/python3.12)
frame #14: <unknown function> + 0x23204a (0x564a3b26404a in /scratch/gpfs/CAR/yifanl/miniforge3/envs/dp-pt/bin/python3.12)
frame #15: <unknown function> + 0x28e340 (0x564a3b2c0340 in /scratch/gpfs/CAR/yifanl/miniforge3/envs/dp-pt/bin/python3.12)
frame #16: <unknown function> + 0x2114d9 (0x564a3b2434d9 in /scratch/gpfs/CAR/yifanl/miniforge3/envs/dp-pt/bin/python3.12)
frame #17: <unknown function> + 0x2bb08b (0x564a3b2ed08b in /scratch/gpfs/CAR/yifanl/miniforge3/envs/dp-pt/bin/python3.12)
frame #18: <unknown function> + 0x2bb017 (0x564a3b2ed017 in /scratch/gpfs/CAR/yifanl/miniforge3/envs/dp-pt/bin/python3.12)
frame #19: <unknown function> + 0x113146 (0x564a3b145146 in /scratch/gpfs/CAR/yifanl/miniforge3/envs/dp-pt/bin/python3.12)
frame #20: <unknown function> + 0x26f3c4 (0x564a3b2a13c4 in /scratch/gpfs/CAR/yifanl/miniforge3/envs/dp-pt/bin/python3.12)
frame #21: <unknown function> + 0x287eca (0x564a3b2b9eca in /scratch/gpfs/CAR/yifanl/miniforge3/envs/dp-pt/bin/python3.12)
frame #22: PyDict_MergeFromSeq2 + 0x4d (0x564a3b2ae08d in /scratch/gpfs/CAR/yifanl/miniforge3/envs/dp-pt/bin/python3.12)
frame #23: <unknown function> + 0x2fe2d5 (0x564a3b3302d5 in /scratch/gpfs/CAR/yifanl/miniforge3/envs/dp-pt/bin/python3.12)
frame #24: <unknown function> + 0x115570 (0x564a3b147570 in /scratch/gpfs/CAR/yifanl/miniforge3/envs/dp-pt/bin/python3.12)
frame #25: <unknown function> + 0x27118f (0x564a3b2a318f in /scratch/gpfs/CAR/yifanl/miniforge3/envs/dp-pt/bin/python3.12)
frame #26: <unknown function> + 0x32a3bc (0x564a3b35c3bc in /scratch/gpfs/CAR/yifanl/miniforge3/envs/dp-pt/bin/python3.12)
frame #27: <unknown function> + 0x27091a (0x564a3b2a291a in /scratch/gpfs/CAR/yifanl/miniforge3/envs/dp-pt/bin/python3.12)
frame #28: <unknown function> + 0x113768 (0x564a3b145768 in /scratch/gpfs/CAR/yifanl/miniforge3/envs/dp-pt/bin/python3.12)
frame #29: _PyObject_FastCallDictTstate + 0x1ee (0x564a3b2392fe in /scratch/gpfs/CAR/yifanl/miniforge3/envs/dp-pt/bin/python3.12)
frame #30: <unknown function> + 0x23229c (0x564a3b26429c in /scratch/gpfs/CAR/yifanl/miniforge3/envs/dp-pt/bin/python3.12)
frame #31: _PyObject_MakeTpCall + 0x274 (0x564a3b236714 in /scratch/gpfs/CAR/yifanl/miniforge3/envs/dp-pt/bin/python3.12)
frame #32: <unknown function> + 0x1126a1 (0x564a3b1446a1 in /scratch/gpfs/CAR/yifanl/miniforge3/envs/dp-pt/bin/python3.12)
frame #33: _PyObject_FastCallDictTstate + 0x1ee (0x564a3b2392fe in /scratch/gpfs/CAR/yifanl/miniforge3/envs/dp-pt/bin/python3.12)
frame #34: <unknown function> + 0x23229c (0x564a3b26429c in /scratch/gpfs/CAR/yifanl/miniforge3/envs/dp-pt/bin/python3.12)
frame #35: _PyObject_MakeTpCall + 0x274 (0x564a3b236714 in /scratch/gpfs/CAR/yifanl/miniforge3/envs/dp-pt/bin/python3.12)
frame #36: <unknown function> + 0x1126a1 (0x564a3b1446a1 in /scratch/gpfs/CAR/yifanl/miniforge3/envs/dp-pt/bin/python3.12)
frame #37: PyObject_CallOneArg + 0x54 (0x564a3b261c14 in /scratch/gpfs/CAR/yifanl/miniforge3/envs/dp-pt/bin/python3.12)
frame #38: <unknown function> + 0x2fea5a (0x564a3b330a5a in /scratch/gpfs/CAR/yifanl/miniforge3/envs/dp-pt/bin/python3.12)
frame #39: PyObject_GetIter + 0x13 (0x564a3b232193 in /scratch/gpfs/CAR/yifanl/miniforge3/envs/dp-pt/bin/python3.12)
frame #40: <unknown function> + 0x113768 (0x564a3b145768 in /scratch/gpfs/CAR/yifanl/miniforge3/envs/dp-pt/bin/python3.12)
frame #41: PyEval_EvalCode + 0xa1 (0x564a3b2ec741 in /scratch/gpfs/CAR/yifanl/miniforge3/envs/dp-pt/bin/python3.12)
frame #42: <unknown function> + 0x2def1a (0x564a3b310f1a in /scratch/gpfs/CAR/yifanl/miniforge3/envs/dp-pt/bin/python3.12)
frame #43: <unknown function> + 0x2d9d35 (0x564a3b30bd35 in /scratch/gpfs/CAR/yifanl/miniforge3/envs/dp-pt/bin/python3.12)
frame #44: <unknown function> + 0x2f2780 (0x564a3b324780 in /scratch/gpfs/CAR/yifanl/miniforge3/envs/dp-pt/bin/python3.12)
frame #45: _PyRun_SimpleFileObject + 0x1ce (0x564a3b323dfe in /scratch/gpfs/CAR/yifanl/miniforge3/envs/dp-pt/bin/python3.12)
frame #46: _PyRun_AnyFileObject + 0x44 (0x564a3b323ac4 in /scratch/gpfs/CAR/yifanl/miniforge3/envs/dp-pt/bin/python3.12)
frame #47: Py_RunMain + 0x2fe (0x564a3b31cdfe in /scratch/gpfs/CAR/yifanl/miniforge3/envs/dp-pt/bin/python3.12)
frame #48: Py_BytesMain + 0x37 (0x564a3b2d70c7 in /scratch/gpfs/CAR/yifanl/miniforge3/envs/dp-pt/bin/python3.12)
frame #49: <unknown function> + 0x295d0 (0x1516aa4295d0 in /lib64/libc.so.6)
frame #50: __libc_start_main + 0x80 (0x1516aa429680 in /lib64/libc.so.6)
frame #51: <unknown function> + 0x2a4f71 (0x564a3b2d6f71 in /scratch/gpfs/CAR/yifanl/miniforge3/envs/dp-pt/bin/python3.12)
Fatal Python error: Aborted
Thread 0x00001512743b9640 (most recent call first):
File "/scratch/gpfs/CAR/yifanl/miniforge3/envs/dp-pt/lib/python3.12/threading.py", line 355 in wait
File "/scratch/gpfs/CAR/yifanl/miniforge3/envs/dp-pt/lib/python3.12/multiprocessing/queues.py", line 251 in _feed
File "/scratch/gpfs/CAR/yifanl/miniforge3/envs/dp-pt/lib/python3.12/threading.py", line 1012 in run
File "/scratch/gpfs/CAR/yifanl/miniforge3/envs/dp-pt/lib/python3.12/threading.py", line 1075 in _bootstrap_inner
File "/scratch/gpfs/CAR/yifanl/miniforge3/envs/dp-pt/lib/python3.12/threading.py", line 1032 in _bootstrap
Current thread 0x00001516aa64a440 (most recent call first):
Garbage-collecting
File "/scratch/gpfs/CAR/yifanl/miniforge3/envs/dp-pt/lib/python3.12/ast.py", line 90 in _convert
File "/scratch/gpfs/CAR/yifanl/miniforge3/envs/dp-pt/lib/python3.12/ast.py", line 101 in _convert
File "/scratch/gpfs/CAR/yifanl/miniforge3/envs/dp-pt/lib/python3.12/ast.py", line 112 in literal_eval
File "/scratch/gpfs/CAR/yifanl/miniforge3/envs/dp-pt/lib/python3.12/site-packages/numpy/lib/format.py", line 644 in _read_array_header
File "/scratch/gpfs/CAR/yifanl/miniforge3/envs/dp-pt/lib/python3.12/site-packages/numpy/lib/format.py", line 811 in read_array
File "/scratch/gpfs/CAR/yifanl/miniforge3/envs/dp-pt/lib/python3.12/site-packages/numpy/lib/_npyio_impl.py", line 488 in load
File "/scratch/gpfs/CAR/yifanl/Software/deepmd-kit/deepmd/utils/path.py", line 187 in load_numpy
File "/scratch/gpfs/CAR/yifanl/Software/deepmd-kit/deepmd/utils/data.py", line 629 in _load_data
File "/scratch/gpfs/CAR/yifanl/Software/deepmd-kit/deepmd/utils/data.py", line 521 in _load_set
File "/scratch/gpfs/CAR/yifanl/Software/deepmd-kit/deepmd/utils/data.py", line 246 in get_item_torch
File "/scratch/gpfs/CAR/yifanl/Software/deepmd-kit/deepmd/pt/utils/dataset.py", line 39 in __getitem__
File "/scratch/gpfs/CAR/yifanl/miniforge3/envs/dp-pt/lib/python3.12/site-packages/torch/utils/data/_utils/fetch.py", line 52 in fetch
File "/scratch/gpfs/CAR/yifanl/miniforge3/envs/dp-pt/lib/python3.12/site-packages/torch/utils/data/dataloader.py", line 764 in _next_data
File "/scratch/gpfs/CAR/yifanl/miniforge3/envs/dp-pt/lib/python3.12/site-packages/torch/utils/data/dataloader.py", line 708 in __next__
File "/scratch/gpfs/CAR/yifanl/Software/deepmd-kit/deepmd/pt/utils/dataloader.py", line 179 in __getitem__
File "/scratch/gpfs/CAR/yifanl/miniforge3/envs/dp-pt/lib/python3.12/site-packages/torch/utils/data/_utils/fetch.py", line 54 in fetch
File "/scratch/gpfs/CAR/yifanl/miniforge3/envs/dp-pt/lib/python3.12/site-packages/torch/utils/data/_utils/worker.py", line 349 in _worker_loop
File "/scratch/gpfs/CAR/yifanl/miniforge3/envs/dp-pt/lib/python3.12/multiprocessing/process.py", line 108 in run
File "/scratch/gpfs/CAR/yifanl/miniforge3/envs/dp-pt/lib/python3.12/multiprocessing/process.py", line 314 in _bootstrap
File "/scratch/gpfs/CAR/yifanl/miniforge3/envs/dp-pt/lib/python3.12/multiprocessing/popen_fork.py", line 71 in _launch
File "/scratch/gpfs/CAR/yifanl/miniforge3/envs/dp-pt/lib/python3.12/multiprocessing/popen_fork.py", line 19 in __init__
File "/scratch/gpfs/CAR/yifanl/miniforge3/envs/dp-pt/lib/python3.12/multiprocessing/context.py", line 282 in _Popen
File "/scratch/gpfs/CAR/yifanl/miniforge3/envs/dp-pt/lib/python3.12/multiprocessing/context.py", line 224 in _Popen
File "/scratch/gpfs/CAR/yifanl/miniforge3/envs/dp-pt/lib/python3.12/multiprocessing/process.py", line 121 in start
File "/scratch/gpfs/CAR/yifanl/miniforge3/envs/dp-pt/lib/python3.12/site-packages/torch/utils/data/dataloader.py", line 1146 in __init__
File "/scratch/gpfs/CAR/yifanl/miniforge3/envs/dp-pt/lib/python3.12/site-packages/torch/utils/data/dataloader.py", line 422 in _get_iterator
File "/scratch/gpfs/CAR/yifanl/miniforge3/envs/dp-pt/lib/python3.12/site-packages/torch/utils/data/dataloader.py", line 491 in __iter__
File "/scratch/gpfs/CAR/yifanl/Software/deepmd-kit/deepmd/pt/train/training.py", line 1069 in get_data
File "/scratch/gpfs/CAR/yifanl/Software/deepmd-kit/deepmd/pt/train/training.py", line 687 in step
File "/scratch/gpfs/CAR/yifanl/Software/deepmd-kit/deepmd/pt/train/training.py", line 954 in run
File "/scratch/gpfs/CAR/yifanl/Software/deepmd-kit/deepmd/pt/entrypoints/main.py", line 360 in train
File "/scratch/gpfs/CAR/yifanl/Software/deepmd-kit/deepmd/pt/entrypoints/main.py", line 527 in main
File "/scratch/gpfs/CAR/yifanl/miniforge3/envs/dp-pt/lib/python3.12/site-packages/torch/distributed/elastic/multiprocessing/errors/__init__.py", line 355 in wrapper
File "/scratch/gpfs/CAR/yifanl/Software/deepmd-kit/deepmd/main.py", line 928 in main
File "/scratch/gpfs/CAR/yifanl/miniforge3/envs/dp-pt/bin/dp", line 8 in <module>
Extension modules: numpy._core._multiarray_umath, numpy.linalg._umath_linalg, torch._C, torch._C._dynamo.autograd_compiler, torch._C._dynamo.eval_frame, torch._C._dynamo.guards, torch._C._dynamo.utils, torch._C._fft, torch._C._linalg, torch._C._nested, torch._C._nn, torch._C._sparse, torch._C._special, h5py._errors, h5py.defs, h5py._objects, h5py.h5, h5py.utils, h5py.h5t, h5py.h5s, h5py.h5ac, h5py.h5p, h5py.h5r, h5py._proxy, h5py._conv, h5py.h5z, h5py.h5a, h5py.h5d, h5py.h5ds, h5py.h5g, h5py.h5i, h5py.h5o, h5py.h5f, h5py.h5fd, h5py.h5pl, h5py.h5l, h5py._selector, yaml._yaml, numpy.random._common, numpy.random.bit_generator, numpy.random._bounded_integers, numpy.random._mt19937, numpy.random.mtrand, numpy.random._philox, numpy.random._pcg64, numpy.random._sfc64, numpy.random._generator, scipy._lib._ccallback_c, scipy.special._ufuncs_cxx, scipy.special._ufuncs, scipy.special._specfun, scipy.special._comb, scipy.linalg._fblas, scipy.linalg._flapack, scipy.linalg.cython_lapack, scipy.linalg._cythonized_array_utils, scipy.linalg._solve_toeplitz, scipy.linalg._decomp_lu_cython, scipy.linalg._matfuncs_sqrtm_triu, scipy.linalg._matfuncs_expm, scipy.linalg._linalg_pythran, scipy.linalg.cython_blas, scipy.linalg._decomp_update, scipy.sparse._sparsetools, _csparsetools, scipy.sparse._csparsetools, scipy.sparse.linalg._dsolve._superlu, scipy.sparse.linalg._eigen.arpack._arpack, scipy.sparse.linalg._propack._spropack, scipy.sparse.linalg._propack._dpropack, scipy.sparse.linalg._propack._cpropack, scipy.sparse.linalg._propack._zpropack, scipy.sparse.csgraph._tools, scipy.sparse.csgraph._shortest_path, scipy.sparse.csgraph._traversal, scipy.sparse.csgraph._min_spanning_tree, scipy.sparse.csgraph._flow, scipy.sparse.csgraph._matching, scipy.sparse.csgraph._reordering, scipy.special._ellip_harm_2, scipy.interpolate._fitpack, scipy.interpolate._dfitpack, scipy.optimize._group_columns, scipy._lib.messagestream, scipy.optimize._trlib._trlib, scipy.optimize._lbfgsb, _moduleTNC, scipy.optimize._moduleTNC, scipy.optimize._cobyla, scipy.optimize._slsqp, scipy.optimize._minpack, scipy.optimize._lsq.givens_elimination, scipy.optimize._zeros, scipy.optimize._cython_nnls, scipy._lib._uarray._uarray, scipy.linalg._decomp_interpolative, scipy.optimize._bglu_dense, scipy.optimize._lsap, scipy.spatial._ckdtree, scipy.spatial._qhull, scipy.spatial._voronoi, scipy.spatial._distance_wrap, scipy.spatial._hausdorff, scipy.spatial.transform._rotation, scipy.optimize._direct, scipy.interpolate._dierckx, scipy.interpolate._ppoly, scipy.interpolate._interpnd, scipy.interpolate._rbfinterp_pythran, scipy.interpolate._rgi_cython, scipy.interpolate._bspl (total: 111)
Traceback (most recent call last):
File "/scratch/gpfs/CAR/yifanl/miniforge3/envs/dp-pt/bin/dp", line 8, in <module>
sys.exit(main())
^^^^^^
File "/scratch/gpfs/CAR/yifanl/Software/deepmd-kit/deepmd/main.py", line 928, in main
deepmd_main(args)
File "/scratch/gpfs/CAR/yifanl/miniforge3/envs/dp-pt/lib/python3.12/site-packages/torch/distributed/elastic/multiprocessing/errors/__init__.py", line 355, in wrapper
return f(*args, **kwargs)
^^^^^^^^^^^^^^^^^^
File "/scratch/gpfs/CAR/yifanl/Software/deepmd-kit/deepmd/pt/entrypoints/main.py", line 527, in main
train(
File "/scratch/gpfs/CAR/yifanl/Software/deepmd-kit/deepmd/pt/entrypoints/main.py", line 360, in train
trainer.run()
File "/scratch/gpfs/CAR/yifanl/Software/deepmd-kit/deepmd/pt/train/training.py", line 954, in run
step(step_id)
File "/scratch/gpfs/CAR/yifanl/Software/deepmd-kit/deepmd/pt/train/training.py", line 703, in step
loss.backward()
File "/scratch/gpfs/CAR/yifanl/miniforge3/envs/dp-pt/lib/python3.12/site-packages/torch/_tensor.py", line 626, in backward
torch.autograd.backward(
File "/scratch/gpfs/CAR/yifanl/miniforge3/envs/dp-pt/lib/python3.12/site-packages/torch/autograd/__init__.py", line 347, in backward
_engine_run_backward(
File "/scratch/gpfs/CAR/yifanl/miniforge3/envs/dp-pt/lib/python3.12/site-packages/torch/autograd/graph.py", line 823, in _engine_run_backward
return Variable._execution_engine.run_backward( # Calls into the C++ engine to run the backward pass
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/scratch/gpfs/CAR/yifanl/miniforge3/envs/dp-pt/lib/python3.12/site-packages/torch/utils/data/_utils/signal_handling.py", line 73, in handler
_error_if_any_worker_fails()
RuntimeError: DataLoader worker (pid 1139677) is killed by signal: Aborted.
Steps to Reproduce
dp --pt train polar_input.json
Further Information, Files, and Links
No response