-
Notifications
You must be signed in to change notification settings - Fork 50
Closed
Labels
bugSomething isn't workingSomething isn't working
Description
What happened?
When launching a batch job, the job crashes with the following error.
What are the steps to reproduce the bug?
launch a batch job
Version
Platform (OS and architecture)
leonardo,juwels,hpc2020
Relevant log output
1: [rank1]:[E702 12:22:50.103110475 ProcessGroupNCCL.cpp:629] [Rank 1] Watchdog caught collective operation timeout: WorkNCCL(SeqNum=52, OpType=ALLREDUCE, NumelIn=11, NumelOut=11, Timeout(ms)=240000) ran for 240000 milliseconds before timing out.
2: [rank2]:[E702 12:22:50.160528660 ProcessGroupNCCL.cpp:629] [Rank 2] Watchdog caught collective operation timeout: WorkNCCL(SeqNum=52, OpType=ALLREDUCE, NumelIn=11, NumelOut=11, Timeout(ms)=240000) ran for 240057 milliseconds before timing out.
1: [rank1]:[E702 12:22:50.186106643 ProcessGroupNCCL.cpp:2168] [PG ID 0 PG GUID 0(default_pg) Rank 1] failure detected by watchdog at work sequence id: 52 PG status: last enqueued work: 52, last completed work: 51
2: [rank2]:[E702 12:22:50.186148043 ProcessGroupNCCL.cpp:2168] [PG ID 0 PG GUID 0(default_pg) Rank 2] failure detected by watchdog at work sequence id: 52 PG status: last enqueued work: 52, last completed work: 51
1: [rank1]:[E702 12:22:50.186169433 ProcessGroupNCCL.cpp:667] Stack trace of the failed collective not found, potentially because FlightRecorder is disabled. You can enable it by setting TORCH_NCCL_TRACE_BUFFER_SIZE to a non-zero value.
1: [rank1]:[E702 12:22:50.186183053 ProcessGroupNCCL.cpp:681] [Rank 1] Some NCCL operations have failed or timed out. Due to the asynchronous nature of CUDA kernels, subsequent GPU operations might run on corrupted/incomplete data.
2: [rank2]:[E702 12:22:50.186189443 ProcessGroupNCCL.cpp:667] Stack trace of the failed collective not found, potentially because FlightRecorder is disabled. You can enable it by setting TORCH_NCCL_TRACE_BUFFER_SIZE to a non-zero value.
2: [rank2]:[E702 12:22:50.186205244 ProcessGroupNCCL.cpp:681] [Rank 2] Some NCCL operations have failed or timed out. Due to the asynchronous nature of CUDA kernels, subsequent GPU operations might run on corrupted/incomplete data.
1: [rank1]:[E702 12:22:50.186190253 ProcessGroupNCCL.cpp:695] [Rank 1] To avoid data inconsistency, we are taking the entire process down.
2: [rank2]:[E702 12:22:50.186212574 ProcessGroupNCCL.cpp:695] [Rank 2] To avoid data inconsistency, we are taking the entire process down.Accompanying data
No response
Organisation
No response
Metadata
Metadata
Assignees
Labels
bugSomething isn't workingSomething isn't working
Type
Projects
Status
Done