NCCL timeout issue

[log.txt](https://github.com/user-attachments/files/21017639/log.txt)

### What happened?

When launching a batch job, the job crashes with the following error.

### What are the steps to reproduce the bug?

launch a batch job

### Version

2515950c0bff96c56adfea7e3917880c86ff4d45

### Platform (OS and architecture)

leonardo,juwels,hpc2020

### Relevant log output

```shell
1: [rank1]:[E702 12:22:50.103110475 ProcessGroupNCCL.cpp:629] [Rank 1] Watchdog caught collective operation timeout: WorkNCCL(SeqNum=52, OpType=ALLREDUCE, NumelIn=11, NumelOut=11, Timeout(ms)=240000) ran for 240000 milliseconds before timing out.
2: [rank2]:[E702 12:22:50.160528660 ProcessGroupNCCL.cpp:629] [Rank 2] Watchdog caught collective operation timeout: WorkNCCL(SeqNum=52, OpType=ALLREDUCE, NumelIn=11, NumelOut=11, Timeout(ms)=240000) ran for 240057 milliseconds before timing out.
1: [rank1]:[E702 12:22:50.186106643 ProcessGroupNCCL.cpp:2168] [PG ID 0 PG GUID 0(default_pg) Rank 1]  failure detected by watchdog at work sequence id: 52 PG status: last enqueued work: 52, last completed work: 51
2: [rank2]:[E702 12:22:50.186148043 ProcessGroupNCCL.cpp:2168] [PG ID 0 PG GUID 0(default_pg) Rank 2]  failure detected by watchdog at work sequence id: 52 PG status: last enqueued work: 52, last completed work: 51
1: [rank1]:[E702 12:22:50.186169433 ProcessGroupNCCL.cpp:667] Stack trace of the failed collective not found, potentially because FlightRecorder is disabled. You can enable it by setting TORCH_NCCL_TRACE_BUFFER_SIZE to a non-zero value.
1: [rank1]:[E702 12:22:50.186183053 ProcessGroupNCCL.cpp:681] [Rank 1] Some NCCL operations have failed or timed out. Due to the asynchronous nature of CUDA kernels, subsequent GPU operations might run on corrupted/incomplete data.
2: [rank2]:[E702 12:22:50.186189443 ProcessGroupNCCL.cpp:667] Stack trace of the failed collective not found, potentially because FlightRecorder is disabled. You can enable it by setting TORCH_NCCL_TRACE_BUFFER_SIZE to a non-zero value.
2: [rank2]:[E702 12:22:50.186205244 ProcessGroupNCCL.cpp:681] [Rank 2] Some NCCL operations have failed or timed out. Due to the asynchronous nature of CUDA kernels, subsequent GPU operations might run on corrupted/incomplete data.
1: [rank1]:[E702 12:22:50.186190253 ProcessGroupNCCL.cpp:695] [Rank 1] To avoid data inconsistency, we are taking the entire process down.
2: [rank2]:[E702 12:22:50.186212574 ProcessGroupNCCL.cpp:695] [Rank 2] To avoid data inconsistency, we are taking the entire process down.
```

### Accompanying data

_No response_

### Organisation

_No response_

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

NCCL timeout issue #433

What happened?

What are the steps to reproduce the bug?

Version

Platform (OS and architecture)

Relevant log output

Accompanying data

Organisation

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

NCCL timeout issue #433

Description

What happened?

What are the steps to reproduce the bug?

Version

Platform (OS and architecture)

Relevant log output

Accompanying data

Organisation

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions