You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
../aten/src/ATen/native/cuda/IndexKernel.cu:92: operator(): block: [86,0,0], thread: [123,0,0] Assertion -sizes[i] <= index && index < sizes[i] && "index out of bounds" failed.
../aten/src/ATen/native/cuda/IndexKernel.cu:92: operator(): block: [86,0,0], thread: [75,0,0] Assertion -sizes[i] <= index && index < sizes[i] && "index out of bounds" failed.
../aten/src/ATen/native/cuda/IndexKernel.cu:92: operator(): block: [86,0,0], thread: [91,0,0] Assertion -sizes[i] <= index && index < sizes[i] && "index out of bounds" failed.
../aten/src/ATen/native/cuda/IndexKernel.cu:92: operator(): block: [86,0,0], thread: [43,0,0] Assertion -sizes[i] <= index && index < sizes[i] && "index out of bounds" failed.
../aten/src/ATen/native/cuda/IndexKernel.cu:92: operator(): block: [86,0,0], thread: [59,0,0] Assertion -sizes[i] <= index && index < sizes[i] && "index out of bounds" failed.
../aten/src/ATen/native/cuda/IndexKernel.cu:92: operator(): block: [127,0,0], thread: [107,0,0] Assertion -sizes[i] <= index && index < sizes[i] && "index out of bounds" failed.
../aten/src/ATen/native/cuda/IndexKernel.cu:92: operator(): block: [127,0,0], thread: [123,0,0] Assertion -sizes[i] <= index && index < sizes[i] && "index out of bounds" failed.
../aten/src/ATen/native/cuda/IndexKernel.cu:92: operator(): block: [4,0,0], thread: [107,0,0] Assertion -sizes[i] <= index && index < sizes[i] && "index out of bounds" failed.
../aten/src/ATen/native/cuda/IndexKernel.cu:92: operator(): block: [4,0,0], thread: [123,0,0] Assertion -sizes[i] <= index && index < sizes[i] && "index out of bounds" failed.
../aten/src/ATen/native/cuda/IndexKernel.cu:92: operator(): block: [86,0,0], thread: [11,0,0] Assertion -sizes[i] <= index && index < sizes[i] && "index out of bounds" failed.
../aten/src/ATen/native/cuda/IndexKernel.cu:92: operator(): block: [86,0,0], thread: [27,0,0] Assertion -sizes[i] <= index && index < sizes[i] && "index out of bounds" failed.
[rank0]:[E806 10:11:47.197701816 ProcessGroupNCCL.cpp:1515] [PG 0 (default_pg) Rank 0] Process group watchdog thread terminated with exception: CUDA error: device-side assert triggered [0/940]
CUDA kernel errors might be asynchronously reported at some other API call, so the stacktrace below might be incorrect.
For debugging consider passing CUDA_LAUNCH_BLOCKING=1
Compile with TORCH_USE_CUDA_DSA to enable device-side assertions.
Exception raised from c10_cuda_check_implementation at ../c10/cuda/CUDAException.cpp:43 (most recent call first):
frame #0: c10::Error::Error(c10::SourceLocation, std::string) + 0x96 (0x7fe2d5e9af86 in /root/anaconda3/lib/python3.12/site-packages/torch/lib/libc10.so)
frame #1: c10::detail::torchCheckFail(char const*, char const*, unsigned int, std::string const&) + 0x64 (0x7fe2d5e49d10 in /root/anaconda3/lib/python3.12/site-packages/torch/lib/libc10.so)
frame #2: c10::cuda::c10_cuda_check_implementation(int, char const*, char const*, int, bool) + 0x118 (0x7fe2d5f75f08 in /root/anaconda3/lib/python3.12/site-packages/torch/lib/libc10_cuda.so)
frame #3: c10d::ProcessGroupNCCL::WorkNCCL::finishedGPUExecutionInternal() const + 0x56 (0x7fe287f683e6 in /root/anaconda3/lib/python3.12/site-packages/torch/lib/libtorch_cuda.so)
frame #4: c10d::ProcessGroupNCCL::WorkNCCL::isCompleted() + 0xa0 (0x7fe287f6d600 in /root/anaconda3/lib/python3.12/site-packages/torch/lib/libtorch_cuda.so)
frame #5: c10d::ProcessGroupNCCL::watchdogHandler() + 0x1da (0x7fe287f742ba in /root/anaconda3/lib/python3.12/site-packages/torch/lib/libtorch_cuda.so)
frame #6: c10d::ProcessGroupNCCL::ncclCommWatchdog() + 0x10c (0x7fe287f766fc in /root/anaconda3/lib/python3.12/site-packages/torch/lib/libtorch_cuda.so)
frame #7: + 0xdbbf4 (0x7fe2d56c7bf4 in /root/anaconda3/bin/../lib/libstdc++.so.6)
frame #8: + 0x94ac3 (0x7fe2d6bd9ac3 in /usr/lib/x86_64-linux-gnu/libc.so.6)
frame #9: + 0x126850 (0x7fe2d6c6b850 in /usr/lib/x86_64-linux-gnu/libc.so.6)
E0806 10:11:47.450000 139709132887872 torch/distributed/elastic/multiprocessing/api.py:833] failed (exitcode: -6) local_rank: 0 (pid: 100) of binary: /root/anaconda3/bin/python
Traceback (most recent call last):
File "/root/anaconda3/bin/torchrun", line 8, in
sys.exit(main())
^^^^^^
File "/root/anaconda3/lib/python3.12/site-packages/torch/distributed/elastic/multiprocessing/errors/init.py", line 348, in wrapper
return f(*args, **kwargs)
^^^^^^^^^^^^^^^^^^
File "/root/anaconda3/lib/python3.12/site-packages/torch/distributed/run.py", line 901, in main
run(args)
File "/root/anaconda3/lib/python3.12/site-packages/torch/distributed/run.py", line 892, in run
elastic_launch(
File "/root/anaconda3/lib/python3.12/site-packages/torch/distributed/launcher/api.py", line 133, in call
return launch_agent(self._config, self._entrypoint, list(args))
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/root/anaconda3/lib/python3.12/site-packages/torch/distributed/launcher/api.py", line 264, in launch_agent
raise ChildFailedError(
torch.distributed.elastic.multiprocessing.errors.ChildFailedError:
tools/train.py FAILED
Failures:
<NO_OTHER_FAILURES>
Root Cause (first observed failure):
[0]:
time : 2024-08-06_10:11:47
host : Lab-PC
rank : 0 (local_rank: 0)
exitcode : -6 (pid: 100)
error_file: <N/A>
traceback : Signal 6 (SIGABRT) received by PID 100
I try to use custom coco dataset for training, so I only modified the rtdetrv2_pytorch/configs/dataset/coco_detection.yml with the corresponding "num_classes" and "remap_mscoco_category: False". I'm not sure whether there's any other configs that need to modify which causes the above corruption. Would anyone give a hint about how the above happened and how to solve it, that would be very helpful, thanks!
The text was updated successfully, but these errors were encountered:
../aten/src/ATen/native/cuda/IndexKernel.cu:92: operator(): block: [86,0,0], thread: [123,0,0] Assertion
-sizes[i] <= index && index < sizes[i] && "index out of bounds"
failed.../aten/src/ATen/native/cuda/IndexKernel.cu:92: operator(): block: [86,0,0], thread: [75,0,0] Assertion
-sizes[i] <= index && index < sizes[i] && "index out of bounds"
failed.../aten/src/ATen/native/cuda/IndexKernel.cu:92: operator(): block: [86,0,0], thread: [91,0,0] Assertion
-sizes[i] <= index && index < sizes[i] && "index out of bounds"
failed.../aten/src/ATen/native/cuda/IndexKernel.cu:92: operator(): block: [86,0,0], thread: [43,0,0] Assertion
-sizes[i] <= index && index < sizes[i] && "index out of bounds"
failed.../aten/src/ATen/native/cuda/IndexKernel.cu:92: operator(): block: [86,0,0], thread: [59,0,0] Assertion
-sizes[i] <= index && index < sizes[i] && "index out of bounds"
failed.../aten/src/ATen/native/cuda/IndexKernel.cu:92: operator(): block: [127,0,0], thread: [107,0,0] Assertion
-sizes[i] <= index && index < sizes[i] && "index out of bounds"
failed.../aten/src/ATen/native/cuda/IndexKernel.cu:92: operator(): block: [127,0,0], thread: [123,0,0] Assertion
-sizes[i] <= index && index < sizes[i] && "index out of bounds"
failed.../aten/src/ATen/native/cuda/IndexKernel.cu:92: operator(): block: [4,0,0], thread: [107,0,0] Assertion
-sizes[i] <= index && index < sizes[i] && "index out of bounds"
failed.../aten/src/ATen/native/cuda/IndexKernel.cu:92: operator(): block: [4,0,0], thread: [123,0,0] Assertion
-sizes[i] <= index && index < sizes[i] && "index out of bounds"
failed.../aten/src/ATen/native/cuda/IndexKernel.cu:92: operator(): block: [86,0,0], thread: [11,0,0] Assertion
-sizes[i] <= index && index < sizes[i] && "index out of bounds"
failed.../aten/src/ATen/native/cuda/IndexKernel.cu:92: operator(): block: [86,0,0], thread: [27,0,0] Assertion
-sizes[i] <= index && index < sizes[i] && "index out of bounds"
failed.[rank0]:[E806 10:11:47.197701816 ProcessGroupNCCL.cpp:1515] [PG 0 (default_pg) Rank 0] Process group watchdog thread terminated with exception: CUDA error: device-side assert triggered [0/940]
CUDA kernel errors might be asynchronously reported at some other API call, so the stacktrace below might be incorrect.
For debugging consider passing CUDA_LAUNCH_BLOCKING=1
Compile with
TORCH_USE_CUDA_DSA
to enable device-side assertions.Exception raised from c10_cuda_check_implementation at ../c10/cuda/CUDAException.cpp:43 (most recent call first):
frame #0: c10::Error::Error(c10::SourceLocation, std::string) + 0x96 (0x7fe2d5e9af86 in /root/anaconda3/lib/python3.12/site-packages/torch/lib/libc10.so)
frame #1: c10::detail::torchCheckFail(char const*, char const*, unsigned int, std::string const&) + 0x64 (0x7fe2d5e49d10 in /root/anaconda3/lib/python3.12/site-packages/torch/lib/libc10.so)
frame #2: c10::cuda::c10_cuda_check_implementation(int, char const*, char const*, int, bool) + 0x118 (0x7fe2d5f75f08 in /root/anaconda3/lib/python3.12/site-packages/torch/lib/libc10_cuda.so)
frame #3: c10d::ProcessGroupNCCL::WorkNCCL::finishedGPUExecutionInternal() const + 0x56 (0x7fe287f683e6 in /root/anaconda3/lib/python3.12/site-packages/torch/lib/libtorch_cuda.so)
frame #4: c10d::ProcessGroupNCCL::WorkNCCL::isCompleted() + 0xa0 (0x7fe287f6d600 in /root/anaconda3/lib/python3.12/site-packages/torch/lib/libtorch_cuda.so)
frame #5: c10d::ProcessGroupNCCL::watchdogHandler() + 0x1da (0x7fe287f742ba in /root/anaconda3/lib/python3.12/site-packages/torch/lib/libtorch_cuda.so)
frame #6: c10d::ProcessGroupNCCL::ncclCommWatchdog() + 0x10c (0x7fe287f766fc in /root/anaconda3/lib/python3.12/site-packages/torch/lib/libtorch_cuda.so)
frame #7: + 0xdbbf4 (0x7fe2d56c7bf4 in /root/anaconda3/bin/../lib/libstdc++.so.6)
frame #8: + 0x94ac3 (0x7fe2d6bd9ac3 in /usr/lib/x86_64-linux-gnu/libc.so.6)
frame #9: + 0x126850 (0x7fe2d6c6b850 in /usr/lib/x86_64-linux-gnu/libc.so.6)
E0806 10:11:47.450000 139709132887872 torch/distributed/elastic/multiprocessing/api.py:833] failed (exitcode: -6) local_rank: 0 (pid: 100) of binary: /root/anaconda3/bin/python
Traceback (most recent call last):
File "/root/anaconda3/bin/torchrun", line 8, in
sys.exit(main())
^^^^^^
File "/root/anaconda3/lib/python3.12/site-packages/torch/distributed/elastic/multiprocessing/errors/init.py", line 348, in wrapper
return f(*args, **kwargs)
^^^^^^^^^^^^^^^^^^
File "/root/anaconda3/lib/python3.12/site-packages/torch/distributed/run.py", line 901, in main
run(args)
File "/root/anaconda3/lib/python3.12/site-packages/torch/distributed/run.py", line 892, in run
elastic_launch(
File "/root/anaconda3/lib/python3.12/site-packages/torch/distributed/launcher/api.py", line 133, in call
return launch_agent(self._config, self._entrypoint, list(args))
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/root/anaconda3/lib/python3.12/site-packages/torch/distributed/launcher/api.py", line 264, in launch_agent
raise ChildFailedError(
torch.distributed.elastic.multiprocessing.errors.ChildFailedError:
tools/train.py FAILED
Failures:
<NO_OTHER_FAILURES>
Root Cause (first observed failure):
[0]:
time : 2024-08-06_10:11:47
host : Lab-PC
rank : 0 (local_rank: 0)
exitcode : -6 (pid: 100)
error_file: <N/A>
traceback : Signal 6 (SIGABRT) received by PID 100
Environment
ubuntu 22.04, GeForce RTX3090, Driver 550.67, cuda 12.4, python3.12.4, torch2.4.0, torchvision0.19.0
Execution Command
CUDA_VISIBLE_DEVICES=0 torchrun --master_port=9909 tools/train.py -c configs/rtdetrv2/rtdetrv2_r50vd_m_7x_coco.yml --seed=0
I try to use custom coco dataset for training, so I only modified the rtdetrv2_pytorch/configs/dataset/coco_detection.yml with the corresponding "num_classes" and "remap_mscoco_category: False". I'm not sure whether there's any other configs that need to modify which causes the above corruption. Would anyone give a hint about how the above happened and how to solve it, that would be very helpful, thanks!
The text was updated successfully, but these errors were encountered: