[Bug]: call for stack trace for "Watchdog caught collective operation timeout"

### Your current environment

```text
The output of `python collect_env.py`
```


### 🐛 Describe the bug

We received quite a lot report about "Watchdog caught collective operation timeout", which is flaky and difficult to reproduce. It happens after running for some time.

To analyze the error, we need to collect enough stack trace. If you encounter a similar problem, please paste enough stack trace for us to debug.

Example: https://buildkite.com/vllm/ci-aws/builds/3548#01906e81-54c6-4713-beb7-d08f3c873200 caught one such error.

Please include the first line of error, together with the Python stack trace.

In the following example, it seems one process has illegal memory access. It dies, but the rest process is still in allreduce, and is waiting for it, causing the timeout problem. From the python level stack trace, it happens when we profile the run, and it seems to be related with moe layer.

```text
[rank3]:[E ProcessGroupNCCL.cpp:1414] [PG 2 Rank 3] Process group watchdog thread terminated with exception: CUDA error: an illegal memory access was encountered
--
  | CUDA kernel errors might be asynchronously reported at some other API call, so the stacktrace below might be incorrect.
  | For debugging consider passing CUDA_LAUNCH_BLOCKING=1.
  | Compile with `TORCH_USE_CUDA_DSA` to enable device-side assertions.
  |  
  | Exception raised from c10_cuda_check_implementation at ../c10/cuda/CUDAException.cpp:43 (most recent call first):
  | frame #0: c10::Error::Error(c10::SourceLocation, std::string) + 0x57 (0x7fd5c7e92897 in /usr/local/lib/python3.10/dist-packages/torch/lib/libc10.so)
  | frame #1: c10::detail::torchCheckFail(char const*, char const*, unsigned int, std::string const&) + 0x64 (0x7fd5c7e42b25 in /usr/local/lib/python3.10/dist-packages/torch/lib/libc10.so)
  | frame #2: c10::cuda::c10_cuda_check_implementation(int, char const*, char const*, int, bool) + 0x118 (0x7fd5c7f6a718 in /usr/local/lib/python3.10/dist-packages/torch/lib/libc10_cuda.so)
  | frame #3: c10d::ProcessGroupNCCL::WorkNCCL::finishedGPUExecutionInternal() const + 0x56 (0x7fd57bc4ae36 in /usr/local/lib/python3.10/dist-packages/torch/lib/libtorch_cuda.so)
  | frame #4: c10d::ProcessGroupNCCL::WorkNCCL::isCompleted() + 0x58 (0x7fd57bc4ef38 in /usr/local/lib/python3.10/dist-packages/torch/lib/libtorch_cuda.so)
  | frame #5: c10d::ProcessGroupNCCL::watchdogHandler() + 0x77c (0x7fd57bc545ac in /usr/local/lib/python3.10/dist-packages/torch/lib/libtorch_cuda.so)
  | frame #6: c10d::ProcessGroupNCCL::ncclCommWatchdog() + 0x10c (0x7fd57bc5531c in /usr/local/lib/python3.10/dist-packages/torch/lib/libtorch_cuda.so)
  | frame #7: <unknown function> + 0xdc253 (0x7fd5c76b0253 in /usr/lib/x86_64-linux-gnu/libstdc++.so.6)
  | frame #8: <unknown function> + 0x94ac3 (0x7fd5c90d9ac3 in /usr/lib/x86_64-linux-gnu/libc.so.6)
  | frame #9: <unknown function> + 0x126850 (0x7fd5c916b850 in /usr/lib/x86_64-linux-gnu/libc.so.6)
  |  
  | terminate called after throwing an instance of 'c10::DistBackendError'
  | what():  [PG 2 Rank 3] Process group watchdog thread terminated with exception: CUDA error: an illegal memory access was encountered
  | CUDA kernel errors might be asynchronously reported at some other API call, so the stacktrace below might be incorrect.
  | For debugging consider passing CUDA_LAUNCH_BLOCKING=1.
  | Compile with `TORCH_USE_CUDA_DSA` to enable device-side assertions.
  |  
  | Exception raised from c10_cuda_check_implementation at ../c10/cuda/CUDAException.cpp:43 (most recent call first):
  | frame #0: c10::Error::Error(c10::SourceLocation, std::string) + 0x57 (0x7fd5c7e92897 in /usr/local/lib/python3.10/dist-packages/torch/lib/libc10.so)
  | frame #1: c10::detail::torchCheckFail(char const*, char const*, unsigned int, std::string const&) + 0x64 (0x7fd5c7e42b25 in /usr/local/lib/python3.10/dist-packages/torch/lib/libc10.so)
  | frame #2: c10::cuda::c10_cuda_check_implementation(int, char const*, char const*, int, bool) + 0x118 (0x7fd5c7f6a718 in /usr/local/lib/python3.10/dist-packages/torch/lib/libc10_cuda.so)
  | frame #3: c10d::ProcessGroupNCCL::WorkNCCL::finishedGPUExecutionInternal() const + 0x56 (0x7fd57bc4ae36 in /usr/local/lib/python3.10/dist-packages/torch/lib/libtorch_cuda.so)
  | frame #4: c10d::ProcessGroupNCCL::WorkNCCL::isCompleted() + 0x58 (0x7fd57bc4ef38 in /usr/local/lib/python3.10/dist-packages/torch/lib/libtorch_cuda.so)
  | frame #5: c10d::ProcessGroupNCCL::watchdogHandler() + 0x77c (0x7fd57bc545ac in /usr/local/lib/python3.10/dist-packages/torch/lib/libtorch_cuda.so)
  | frame #6: c10d::ProcessGroupNCCL::ncclCommWatchdog() + 0x10c (0x7fd57bc5531c in /usr/local/lib/python3.10/dist-packages/torch/lib/libtorch_cuda.so)
  | frame #7: <unknown function> + 0xdc253 (0x7fd5c76b0253 in /usr/lib/x86_64-linux-gnu/libstdc++.so.6)
  | frame #8: <unknown function> + 0x94ac3 (0x7fd5c90d9ac3 in /usr/lib/x86_64-linux-gnu/libc.so.6)
  | frame #9: <unknown function> + 0x126850 (0x7fd5c916b850 in /usr/lib/x86_64-linux-gnu/libc.so.6)
  |  
  | Exception raised from ncclCommWatchdog at ../torch/csrc/distributed/c10d/ProcessGroupNCCL.cpp:1418 (most recent call first):
  | frame #0: c10::Error::Error(c10::SourceLocation, std::string) + 0x57 (0x7fd5c7e92897 in /usr/local/lib/python3.10/dist-packages/torch/lib/libc10.so)
  | frame #1: <unknown function> + 0xe32e33 (0x7fd57b8d7e33 in /usr/local/lib/python3.10/dist-packages/torch/lib/libtorch_cuda.so)
  | frame #2: <unknown function> + 0xdc253 (0x7fd5c76b0253 in /usr/lib/x86_64-linux-gnu/libstdc++.so.6)
  | frame #3: <unknown function> + 0x94ac3 (0x7fd5c90d9ac3 in /usr/lib/x86_64-linux-gnu/libc.so.6)
  | frame #4: <unknown function> + 0x126850 (0x7fd5c916b850 in /usr/lib/x86_64-linux-gnu/libc.so.6)
  |  
  | ERROR 07-01 13:54:43 multiproc_worker_utils.py:120] Worker VllmWorkerProcess pid 1159 died, exit code: -6
  | INFO 07-01 13:54:43 multiproc_worker_utils.py:123] Killing local vLLM worker processes
  | [rank0]:[E ProcessGroupNCCL.cpp:563] [Rank 0] Watchdog caught collective operation timeout: WorkNCCL(SeqNum=2, OpType=ALLREDUCE, NumelIn=134217728, NumelOut=134217728, Timeout(ms)=600000) ran for 600059 milliseconds before timing out.
  | [rank0]:[E ProcessGroupNCCL.cpp:1537] [PG 2 Rank 0] Timeout at NCCL work: 2, last enqueued NCCL work: 2, last completed NCCL work: 1.
  | [rank0]:[E ProcessGroupNCCL.cpp:577] [Rank 0] Some NCCL operations have failed or timed out. Due to the asynchronous nature of CUDA kernels, subsequent GPU operations might run on corrupted/incomplete data.
  | [rank0]:[E ProcessGroupNCCL.cpp:583] [Rank 0] To avoid data inconsistency, we are taking the entire process down.
  | [rank0]:[E ProcessGroupNCCL.cpp:1414] [PG 2 Rank 0] Process group watchdog thread terminated with exception: [Rank 0] Watchdog caught collective operation timeout: WorkNCCL(SeqNum=2, OpType=ALLREDUCE, NumelIn=134217728, NumelOut=134217728, Timeout(ms)=600000) ran for 600059 milliseconds before timing out.
  | Exception raised from checkTimeout at ../torch/csrc/distributed/c10d/ProcessGroupNCCL.cpp:565 (most recent call first):
  | frame #0: c10::Error::Error(c10::SourceLocation, std::string) + 0x57 (0x7f9ed797a897 in /usr/local/lib/python3.10/dist-packages/torch/lib/libc10.so)
  | frame #1: c10d::ProcessGroupNCCL::WorkNCCL::checkTimeout(std::optional<std::chrono::duration<long, std::ratio<1l, 1000l> > >) + 0x1d2 (0x7f9e8b64f1b2 in /usr/local/lib/python3.10/dist-packages/torch/lib/libtorch_cuda.so)
  | frame #2: c10d::ProcessGroupNCCL::watchdogHandler() + 0x1a0 (0x7f9e8b653fd0 in /usr/local/lib/python3.10/dist-packages/torch/lib/libtorch_cuda.so)
  | frame #3: c10d::ProcessGroupNCCL::ncclCommWatchdog() + 0x10c (0x7f9e8b65531c in /usr/local/lib/python3.10/dist-packages/torch/lib/libtorch_cuda.so)
  | frame #4: <unknown function> + 0xdc253 (0x7f9ed70b0253 in /usr/lib/x86_64-linux-gnu/libstdc++.so.6)
  | frame #5: <unknown function> + 0x94ac3 (0x7f9f821e3ac3 in /usr/lib/x86_64-linux-gnu/libc.so.6)
  | frame #6: <unknown function> + 0x126850 (0x7f9f82275850 in /usr/lib/x86_64-linux-gnu/libc.so.6)
  |  
  | terminate called after throwing an instance of 'c10::DistBackendError'
  | what():  [PG 2 Rank 0] Process group watchdog thread terminated with exception: [Rank 0] Watchdog caught collective operation timeout: WorkNCCL(SeqNum=2, OpType=ALLREDUCE, NumelIn=134217728, NumelOut=134217728, Timeout(ms)=600000) ran for 600059 milliseconds before timing out.
  | Exception raised from checkTimeout at ../torch/csrc/distributed/c10d/ProcessGroupNCCL.cpp:565 (most recent call first):
  | frame #0: c10::Error::Error(c10::SourceLocation, std::string) + 0x57 (0x7f9ed797a897 in /usr/local/lib/python3.10/dist-packages/torch/lib/libc10.so)
  | frame #1: c10d::ProcessGroupNCCL::WorkNCCL::checkTimeout(std::optional<std::chrono::duration<long, std::ratio<1l, 1000l> > >) + 0x1d2 (0x7f9e8b64f1b2 in /usr/local/lib/python3.10/dist-packages/torch/lib/libtorch_cuda.so)
  | frame #2: c10d::ProcessGroupNCCL::watchdogHandler() + 0x1a0 (0x7f9e8b653fd0 in /usr/local/lib/python3.10/dist-packages/torch/lib/libtorch_cuda.so)
  | frame #3: c10d::ProcessGroupNCCL::ncclCommWatchdog() + 0x10c (0x7f9e8b65531c in /usr/local/lib/python3.10/dist-packages/torch/lib/libtorch_cuda.so)
  | frame #4: <unknown function> + 0xdc253 (0x7f9ed70b0253 in /usr/lib/x86_64-linux-gnu/libstdc++.so.6)
  | frame #5: <unknown function> + 0x94ac3 (0x7f9f821e3ac3 in /usr/lib/x86_64-linux-gnu/libc.so.6)
  | frame #6: <unknown function> + 0x126850 (0x7f9f82275850 in /usr/lib/x86_64-linux-gnu/libc.so.6)
  |  
  | Exception raised from ncclCommWatchdog at ../torch/csrc/distributed/c10d/ProcessGroupNCCL.cpp:1418 (most recent call first):
  | frame #0: c10::Error::Error(c10::SourceLocation, std::string) + 0x57 (0x7f9ed797a897 in /usr/local/lib/python3.10/dist-packages/torch/lib/libc10.so)
  | frame #1: <unknown function> + 0xe32e33 (0x7f9e8b2d7e33 in /usr/local/lib/python3.10/dist-packages/torch/lib/libtorch_cuda.so)
  | frame #2: <unknown function> + 0xdc253 (0x7f9ed70b0253 in /usr/lib/x86_64-linux-gnu/libstdc++.so.6)
  | frame #3: <unknown function> + 0x94ac3 (0x7f9f821e3ac3 in /usr/lib/x86_64-linux-gnu/libc.so.6)
  | frame #4: <unknown function> + 0x126850 (0x7f9f82275850 in /usr/lib/x86_64-linux-gnu/libc.so.6)
  |  
  | Fatal Python error: Aborted
  |  
  | Thread 0x00007f96975fe640 (most recent call first):
  | File "/usr/lib/python3.10/threading.py", line 320 in wait
  | File "/usr/lib/python3.10/multiprocessing/queues.py", line 231 in _feed
  | File "/usr/lib/python3.10/threading.py", line 953 in run
  | File "/usr/lib/python3.10/threading.py", line 1016 in _bootstrap_inner
  | File "/usr/lib/python3.10/threading.py", line 973 in _bootstrap
  |  
  | Thread 0x00007f96acbfd640 (most recent call first):
  | File "/usr/lib/python3.10/threading.py", line 324 in wait
  | File "/usr/lib/python3.10/threading.py", line 607 in wait
  | File "/usr/local/lib/python3.10/dist-packages/tqdm/_monitor.py", line 60 in run
  | File "/usr/lib/python3.10/threading.py", line 1016 in _bootstrap_inner
  | File "/usr/lib/python3.10/threading.py", line 973 in _bootstrap
  |  
  | Thread 0x00007f9f8214e480 (most recent call first):
  | File "/usr/local/lib/python3.10/dist-packages/torch/_ops.py", line 854 in __call__
  | File "/usr/local/lib/python3.10/dist-packages/vllm/_custom_ops.py", line 357 in topk_softmax
  | File "/usr/local/lib/python3.10/dist-packages/vllm/_custom_ops.py", line 34 in wrapper
  | File "/usr/local/lib/python3.10/dist-packages/vllm/model_executor/layers/fused_moe/fused_moe.py", line 357 in fused_topk
  | File "/usr/local/lib/python3.10/dist-packages/vllm/model_executor/layers/fused_moe/fused_moe.py", line 549 in fused_moe
  | File "/usr/local/lib/python3.10/dist-packages/vllm/model_executor/models/mixtral.py", line 273 in forward
  | File "/usr/local/lib/python3.10/dist-packages/torch/nn/modules/module.py", line 1541 in _call_impl
  | File "/usr/local/lib/python3.10/dist-packages/torch/nn/modules/module.py", line 1532 in _wrapped_call_impl
  | File "/usr/local/lib/python3.10/dist-packages/vllm/model_executor/models/mixtral.py", line 426 in forward
  | File "/usr/local/lib/python3.10/dist-packages/torch/nn/modules/module.py", line 1541 in _call_impl
  | File "/usr/local/lib/python3.10/dist-packages/torch/nn/modules/module.py", line 1532 in _wrapped_call_impl
  | File "/usr/local/lib/python3.10/dist-packages/vllm/model_executor/models/mixtral.py", line 470 in forward
  | File "/usr/local/lib/python3.10/dist-packages/torch/nn/modules/module.py", line 1541 in _call_impl
  | File "/usr/local/lib/python3.10/dist-packages/torch/nn/modules/module.py", line 1532 in _wrapped_call_impl
  | File "/usr/local/lib/python3.10/dist-packages/vllm/model_executor/models/mixtral.py", line 540 in forward
  | File "/usr/local/lib/python3.10/dist-packages/torch/nn/modules/module.py", line 1541 in _call_impl
  | File "/usr/local/lib/python3.10/dist-packages/torch/nn/modules/module.py", line 1532 in _wrapped_call_impl
  | File "/usr/local/lib/python3.10/dist-packages/vllm/worker/model_runner.py", line 1135 in execute_model
  | File "/usr/local/lib/python3.10/dist-packages/torch/utils/_contextlib.py", line 115 in decorate_context
  | File "/usr/local/lib/python3.10/dist-packages/vllm/worker/model_runner.py", line 814 in profile_run
  | File "/usr/local/lib/python3.10/dist-packages/torch/utils/_contextlib.py", line 115 in decorate_context
  | File "/usr/local/lib/python3.10/dist-packages/vllm/worker/worker.py", line 175 in determine_num_available_blocks
  | File "/usr/local/lib/python3.10/dist-packages/torch/utils/_contextlib.py", line 115 in decorate_context
  | File "/usr/local/lib/python3.10/dist-packages/vllm/executor/multiproc_gpu_executor.py", line 122 in _run_workers
  | File "/usr/local/lib/python3.10/dist-packages/vllm/executor/distributed_gpu_executor.py", line 38 in determine_num_available_blocks
  | File "/usr/local/lib/python3.10/dist-packages/vllm/engine/llm_engine.py", line 344 in _initialize_kv_caches
  | File "/usr/local/lib/python3.10/dist-packages/vllm/engine/llm_engine.py", line 251 in __init__
  | File "/usr/local/lib/python3.10/dist-packages/vllm/engine/llm_engine.py", line 405 in from_engine_args
  | File "/usr/local/lib/python3.10/dist-packages/vllm/entrypoints/llm.py", line 144 in __init__
  | File "/usr/local/lib/python3.10/dist-packages/lm_eval/models/vllm_causallms.py", line 97 in __init__
  | File "/usr/local/lib/python3.10/dist-packages/lm_eval/api/model.py", line 133 in create_from_arg_string
  | File "/usr/local/lib/python3.10/dist-packages/lm_eval/evaluator.py", line 164 in simple_evaluate
  | File "/usr/local/lib/python3.10/dist-packages/lm_eval/utils.py", line 288 in _wrapper
  | File "/vllm-workspace/.buildkite/lm-eval-harness/test_lm_eval_correctness.py", line 29 in launch_lm_eval
  | File "/vllm-workspace/.buildkite/lm-eval-harness/test_lm_eval_correctness.py", line 45 in test_lm_eval_correctness
  | File "/usr/local/lib/python3.10/dist-packages/_pytest/python.py", line 162 in pytest_pyfunc_call
  | File "/usr/local/lib/python3.10/dist-packages/pluggy/_callers.py", line 103 in _multicall
  | File "/usr/local/lib/python3.10/dist-packages/pluggy/_manager.py", line 120 in _hookexec
  | File "/usr/local/lib/python3.10/dist-packages/pluggy/_hooks.py", line 513 in __call__
  | File "/usr/local/lib/python3.10/dist-packages/_pytest/python.py", line 1632 in runtest
  | File "/usr/local/lib/python3.10/dist-packages/_pytest/runner.py", line 173 in pytest_runtest_call
  | File "/usr/local/lib/python3.10/dist-packages/pluggy/_callers.py", line 103 in _multicall
  | File "/usr/local/lib/python3.10/dist-packages/pluggy/_manager.py", line 120 in _hookexec
  | File "/usr/local/lib/python3.10/dist-packages/pluggy/_hooks.py", line 513 in __call__
  | File "/usr/local/lib/python3.10/dist-packages/_pytest/runner.py", line 241 in <lambda>
  | File "/usr/local/lib/python3.10/dist-packages/_pytest/runner.py", line 341 in from_call
  | File "/usr/local/lib/python3.10/dist-packages/_pytest/runner.py", line 240 in call_and_report
  | File "/usr/local/lib/python3.10/dist-packages/_pytest/runner.py", line 135 in runtestprotocol
  | File "/usr/local/lib/python3.10/dist-packages/_pytest/runner.py", line 116 in pytest_runtest_protocol
  | File "/usr/local/lib/python3.10/dist-packages/pluggy/_callers.py", line 103 in _multicall
  | File "/usr/local/lib/python3.10/dist-packages/pluggy/_manager.py", line 120 in _hookexec
  | File "/usr/local/lib/python3.10/dist-packages/pluggy/_hooks.py", line 513 in __call__
  | File "/usr/local/lib/python3.10/dist-packages/_pytest/main.py", line 364 in pytest_runtestloop
  | File "/usr/local/lib/python3.10/dist-packages/pluggy/_callers.py", line 103 in _multicall
  | File "/usr/local/lib/python3.10/dist-packages/pluggy/_manager.py", line 120 in _hookexec
  | File "/usr/local/lib/python3.10/dist-packages/pluggy/_hooks.py", line 513 in __call__
  | File "/usr/local/lib/python3.10/dist-packages/_pytest/main.py", line 339 in _main
  | File "/usr/local/lib/python3.10/dist-packages/_pytest/main.py", line 285 in wrap_session
  | File "/usr/local/lib/python3.10/dist-packages/_pytest/main.py", line 332 in pytest_cmdline_main
  | File "/usr/local/lib/python3.10/dist-packages/pluggy/_callers.py", line 103 in _multicall
  | File "/usr/local/lib/python3.10/dist-packages/pluggy/_manager.py", line 120 in _hookexec
  | File "/usr/local/lib/python3.10/dist-packages/pluggy/_hooks.py", line 513 in __call__
  | File "/usr/local/lib/python3.10/dist-packages/_pytest/config/__init__.py", line 178 in main
  | File "/usr/local/lib/python3.10/dist-packages/_pytest/config/__init__.py", line 206 in console_main
  | File "/usr/local/bin/pytest", line 8 in <module>
```

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Uh oh!

[Bug]: call for stack trace for "Watchdog caught collective operation timeout" #6042

Your current environment

🐛 Describe the bug

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Uh oh!

[Bug]: call for stack trace for "Watchdog caught collective operation timeout" #6042

Description

Your current environment

🐛 Describe the bug

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions