Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

RTDETRv2 training error #406

Open
philp123 opened this issue Aug 6, 2024 · 1 comment
Open

RTDETRv2 training error #406

philp123 opened this issue Aug 6, 2024 · 1 comment

Comments

@philp123
Copy link

philp123 commented Aug 6, 2024

../aten/src/ATen/native/cuda/IndexKernel.cu:92: operator(): block: [86,0,0], thread: [123,0,0] Assertion -sizes[i] <= index && index < sizes[i] && "index out of bounds" failed.
../aten/src/ATen/native/cuda/IndexKernel.cu:92: operator(): block: [86,0,0], thread: [75,0,0] Assertion -sizes[i] <= index && index < sizes[i] && "index out of bounds" failed.
../aten/src/ATen/native/cuda/IndexKernel.cu:92: operator(): block: [86,0,0], thread: [91,0,0] Assertion -sizes[i] <= index && index < sizes[i] && "index out of bounds" failed.
../aten/src/ATen/native/cuda/IndexKernel.cu:92: operator(): block: [86,0,0], thread: [43,0,0] Assertion -sizes[i] <= index && index < sizes[i] && "index out of bounds" failed.
../aten/src/ATen/native/cuda/IndexKernel.cu:92: operator(): block: [86,0,0], thread: [59,0,0] Assertion -sizes[i] <= index && index < sizes[i] && "index out of bounds" failed.
../aten/src/ATen/native/cuda/IndexKernel.cu:92: operator(): block: [127,0,0], thread: [107,0,0] Assertion -sizes[i] <= index && index < sizes[i] && "index out of bounds" failed.
../aten/src/ATen/native/cuda/IndexKernel.cu:92: operator(): block: [127,0,0], thread: [123,0,0] Assertion -sizes[i] <= index && index < sizes[i] && "index out of bounds" failed.
../aten/src/ATen/native/cuda/IndexKernel.cu:92: operator(): block: [4,0,0], thread: [107,0,0] Assertion -sizes[i] <= index && index < sizes[i] && "index out of bounds" failed.
../aten/src/ATen/native/cuda/IndexKernel.cu:92: operator(): block: [4,0,0], thread: [123,0,0] Assertion -sizes[i] <= index && index < sizes[i] && "index out of bounds" failed.
../aten/src/ATen/native/cuda/IndexKernel.cu:92: operator(): block: [86,0,0], thread: [11,0,0] Assertion -sizes[i] <= index && index < sizes[i] && "index out of bounds" failed.
../aten/src/ATen/native/cuda/IndexKernel.cu:92: operator(): block: [86,0,0], thread: [27,0,0] Assertion -sizes[i] <= index && index < sizes[i] && "index out of bounds" failed.

[rank0]:[E806 10:11:47.197701816 ProcessGroupNCCL.cpp:1515] [PG 0 (default_pg) Rank 0] Process group watchdog thread terminated with exception: CUDA error: device-side assert triggered [0/940]
CUDA kernel errors might be asynchronously reported at some other API call, so the stacktrace below might be incorrect.
For debugging consider passing CUDA_LAUNCH_BLOCKING=1
Compile with TORCH_USE_CUDA_DSA to enable device-side assertions.

Exception raised from c10_cuda_check_implementation at ../c10/cuda/CUDAException.cpp:43 (most recent call first):
frame #0: c10::Error::Error(c10::SourceLocation, std::string) + 0x96 (0x7fe2d5e9af86 in /root/anaconda3/lib/python3.12/site-packages/torch/lib/libc10.so)
frame #1: c10::detail::torchCheckFail(char const*, char const*, unsigned int, std::string const&) + 0x64 (0x7fe2d5e49d10 in /root/anaconda3/lib/python3.12/site-packages/torch/lib/libc10.so)
frame #2: c10::cuda::c10_cuda_check_implementation(int, char const*, char const*, int, bool) + 0x118 (0x7fe2d5f75f08 in /root/anaconda3/lib/python3.12/site-packages/torch/lib/libc10_cuda.so)
frame #3: c10d::ProcessGroupNCCL::WorkNCCL::finishedGPUExecutionInternal() const + 0x56 (0x7fe287f683e6 in /root/anaconda3/lib/python3.12/site-packages/torch/lib/libtorch_cuda.so)
frame #4: c10d::ProcessGroupNCCL::WorkNCCL::isCompleted() + 0xa0 (0x7fe287f6d600 in /root/anaconda3/lib/python3.12/site-packages/torch/lib/libtorch_cuda.so)
frame #5: c10d::ProcessGroupNCCL::watchdogHandler() + 0x1da (0x7fe287f742ba in /root/anaconda3/lib/python3.12/site-packages/torch/lib/libtorch_cuda.so)
frame #6: c10d::ProcessGroupNCCL::ncclCommWatchdog() + 0x10c (0x7fe287f766fc in /root/anaconda3/lib/python3.12/site-packages/torch/lib/libtorch_cuda.so)
frame #7: + 0xdbbf4 (0x7fe2d56c7bf4 in /root/anaconda3/bin/../lib/libstdc++.so.6)
frame #8: + 0x94ac3 (0x7fe2d6bd9ac3 in /usr/lib/x86_64-linux-gnu/libc.so.6)
frame #9: + 0x126850 (0x7fe2d6c6b850 in /usr/lib/x86_64-linux-gnu/libc.so.6)

E0806 10:11:47.450000 139709132887872 torch/distributed/elastic/multiprocessing/api.py:833] failed (exitcode: -6) local_rank: 0 (pid: 100) of binary: /root/anaconda3/bin/python
Traceback (most recent call last):
File "/root/anaconda3/bin/torchrun", line 8, in
sys.exit(main())
^^^^^^
File "/root/anaconda3/lib/python3.12/site-packages/torch/distributed/elastic/multiprocessing/errors/init.py", line 348, in wrapper
return f(*args, **kwargs)
^^^^^^^^^^^^^^^^^^
File "/root/anaconda3/lib/python3.12/site-packages/torch/distributed/run.py", line 901, in main
run(args)
File "/root/anaconda3/lib/python3.12/site-packages/torch/distributed/run.py", line 892, in run
elastic_launch(
File "/root/anaconda3/lib/python3.12/site-packages/torch/distributed/launcher/api.py", line 133, in call
return launch_agent(self._config, self._entrypoint, list(args))
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/root/anaconda3/lib/python3.12/site-packages/torch/distributed/launcher/api.py", line 264, in launch_agent
raise ChildFailedError(
torch.distributed.elastic.multiprocessing.errors.ChildFailedError:

tools/train.py FAILED

Failures:
<NO_OTHER_FAILURES>

Root Cause (first observed failure):
[0]:
time : 2024-08-06_10:11:47
host : Lab-PC
rank : 0 (local_rank: 0)
exitcode : -6 (pid: 100)
error_file: <N/A>
traceback : Signal 6 (SIGABRT) received by PID 100

Environment
ubuntu 22.04, GeForce RTX3090, Driver 550.67, cuda 12.4, python3.12.4, torch2.4.0, torchvision0.19.0

Execution Command
CUDA_VISIBLE_DEVICES=0 torchrun --master_port=9909 tools/train.py -c configs/rtdetrv2/rtdetrv2_r50vd_m_7x_coco.yml --seed=0

I try to use custom coco dataset for training, so I only modified the rtdetrv2_pytorch/configs/dataset/coco_detection.yml with the corresponding "num_classes" and "remap_mscoco_category: False". I'm not sure whether there's any other configs that need to modify which causes the above corruption. Would anyone give a hint about how the above happened and how to solve it, that would be very helpful, thanks!

@VimukthiRandika1997
Copy link

num_classes should be actual no.of classes + 1 when you set remap_mscoco_category: False
That's why you are getting index error!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants