Skip to content

NCCL timeout issue #433

@tjhunter

Description

@tjhunter

log.txt

What happened?

When launching a batch job, the job crashes with the following error.

What are the steps to reproduce the bug?

launch a batch job

Version

2515950

Platform (OS and architecture)

leonardo,juwels,hpc2020

Relevant log output

1: [rank1]:[E702 12:22:50.103110475 ProcessGroupNCCL.cpp:629] [Rank 1] Watchdog caught collective operation timeout: WorkNCCL(SeqNum=52, OpType=ALLREDUCE, NumelIn=11, NumelOut=11, Timeout(ms)=240000) ran for 240000 milliseconds before timing out.
2: [rank2]:[E702 12:22:50.160528660 ProcessGroupNCCL.cpp:629] [Rank 2] Watchdog caught collective operation timeout: WorkNCCL(SeqNum=52, OpType=ALLREDUCE, NumelIn=11, NumelOut=11, Timeout(ms)=240000) ran for 240057 milliseconds before timing out.
1: [rank1]:[E702 12:22:50.186106643 ProcessGroupNCCL.cpp:2168] [PG ID 0 PG GUID 0(default_pg) Rank 1]  failure detected by watchdog at work sequence id: 52 PG status: last enqueued work: 52, last completed work: 51
2: [rank2]:[E702 12:22:50.186148043 ProcessGroupNCCL.cpp:2168] [PG ID 0 PG GUID 0(default_pg) Rank 2]  failure detected by watchdog at work sequence id: 52 PG status: last enqueued work: 52, last completed work: 51
1: [rank1]:[E702 12:22:50.186169433 ProcessGroupNCCL.cpp:667] Stack trace of the failed collective not found, potentially because FlightRecorder is disabled. You can enable it by setting TORCH_NCCL_TRACE_BUFFER_SIZE to a non-zero value.
1: [rank1]:[E702 12:22:50.186183053 ProcessGroupNCCL.cpp:681] [Rank 1] Some NCCL operations have failed or timed out. Due to the asynchronous nature of CUDA kernels, subsequent GPU operations might run on corrupted/incomplete data.
2: [rank2]:[E702 12:22:50.186189443 ProcessGroupNCCL.cpp:667] Stack trace of the failed collective not found, potentially because FlightRecorder is disabled. You can enable it by setting TORCH_NCCL_TRACE_BUFFER_SIZE to a non-zero value.
2: [rank2]:[E702 12:22:50.186205244 ProcessGroupNCCL.cpp:681] [Rank 2] Some NCCL operations have failed or timed out. Due to the asynchronous nature of CUDA kernels, subsequent GPU operations might run on corrupted/incomplete data.
1: [rank1]:[E702 12:22:50.186190253 ProcessGroupNCCL.cpp:695] [Rank 1] To avoid data inconsistency, we are taking the entire process down.
2: [rank2]:[E702 12:22:50.186212574 ProcessGroupNCCL.cpp:695] [Rank 2] To avoid data inconsistency, we are taking the entire process down.

Accompanying data

No response

Organisation

No response

Metadata

Metadata

Assignees

No one assigned

    Labels

    bugSomething isn't working

    Type

    No type

    Projects

    Status

    Done

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions