Skip to content

StreamingDataset causes NCCL timeout when using multiple nodes #340

Open
@hubenjm

Description

@hubenjm

🐛 Bug

I'm running a training job with 2 nodes in SageMaker using torchrun to launch. I'm using a CombinedStreamingDataset for the training dataset and using train_weight_factors = [0.8,0.07,0.07,0.07]. The training stops printing out log messages after some fixed number of batches (depending on random seed I guess). Where the training stops is deterministic if seed is fixed, based on my experiments. Then the NCCL timeout triggers an exception after 30 minutes. The training code works fine on a single node though.

To Reproduce

Use CombinedStreamingDataset for training dataset with train_weight_factors not None and iterate_over_all = False. Launch training with torchrun with num_nodes > 1.


2024-08-23T21:52:39.473Z | pytorch_lightning - INFO - RANK 0 - on_train_batch_end - EPOCH 1 BATCH 70/inf(0.00%): train_loss=0.08913; train_unweighted_loss=0.05355; train_rec=0.77778; train_neg_rec=1.00000; train_prec=1.00000; lr=0.0000499696
-- | --
  | 2024-08-23T21:52:48.476Z | pytorch_lightning - INFO - RANK 0 - on_train_batch_end - EPOCH 1 BATCH 80/inf(0.00%): train_loss=0.07478; train_unweighted_loss=0.09251; train_rec=0.50000; train_neg_rec=0.98684; train_prec=0.80000; lr=0.0000499693
  | 2024-08-23T22:22:50.870Z | [rank6]:[E ProcessGroupNCCL.cpp:563] [Rank 6] Watchdog caught collective operation timeout: WorkNCCL(SeqNum=70426, OpType=ALLREDUCE, NumelIn=1, NumelOut=1, Timeout(ms)=1800000) ran for 1800038 milliseconds before timing out.
  | 2024-08-23T22:22:50.870Z | [rank4]:[E ProcessGroupNCCL.cpp:563] [Rank 4] Watchdog caught collective operation timeout: WorkNCCL(SeqNum=70426, OpType=ALLREDUCE, NumelIn=1, NumelOut=1, Timeout(ms)=1800000) ran for 1800058 milliseconds before timing out.
  | 2024-08-23T22:22:50.870Z | [rank5]:[E ProcessGroupNCCL.cpp:563] [Rank 5] Watchdog caught collective operation timeout: WorkNCCL(SeqNum=70426, OpType=ALLREDUCE, NumelIn=1, NumelOut=1, Timeout(ms)=1800000) ran for 1800062 milliseconds before timing out.
  | 2024-08-23T22:22:50.870Z | [rank1]:[E ProcessGroupNCCL.cpp:563] [Rank 1] Watchdog caught collective operation timeout: WorkNCCL(SeqNum=70426, OpType=ALLREDUCE, NumelIn=1, NumelOut=1, Timeout(ms)=1800000) ran for 1800064 milliseconds before timing out.
  | 2024-08-23T22:22:50.870Z | [rank0]:[E ProcessGroupNCCL.cpp:563] [Rank 0] Watchdog caught collective operation timeout: WorkNCCL(SeqNum=70426, OpType=ALLREDUCE, NumelIn=1, NumelOut=1, Timeout(ms)=1800000) ran for 1800068 milliseconds before timing out.
  | 2024-08-23T22:22:50.870Z | [rank3]:[E ProcessGroupNCCL.cpp:563] [Rank 3] Watchdog caught collective operation timeout: WorkNCCL(SeqNum=70426, OpType=ALLREDUCE, NumelIn=1, NumelOut=1, Timeout(ms)=1800000) ran for 1800076 milliseconds before timing out.
  | 2024-08-23T22:22:50.870Z | [rank2]:[E ProcessGroupNCCL.cpp:563] [Rank 2] Watchdog caught collective operation timeout: WorkNCCL(SeqNum=70426, OpType=ALLREDUCE, NumelIn=1, NumelOut=1, Timeout(ms)=1800000) ran for 1800083 milliseconds before timing out.
  | 2024-08-23T22:22:50.870Z | [rank7]:[E ProcessGroupNCCL.cpp:563] [Rank 7] Watchdog caught collective operation timeout: WorkNCCL(SeqNum=70426, OpType=ALLREDUCE, NumelIn=1, NumelOut=1, Timeout(ms)=1800000) ran for 1800085 milliseconds before timing out.
  | 2024-08-23T22:22:51.870Z | [rank6]:[E ProcessGroupNCCL.cpp:1537] [PG 0 Rank 6] Timeout at NCCL work: 70426, last enqueued NCCL work: 70426, last completed NCCL work: 70425.
  | 2024-08-23T22:22:51.870Z | [rank6]:[E ProcessGroupNCCL.cpp:577] [Rank 6] Some NCCL operations have failed or timed out. Due to the asynchronous nature of CUDA kernels, subsequent GPU operations might run on corrupted/incomplete data.
  | 2024-08-23T22:22:51.870Z | [rank6]:[E ProcessGroupNCCL.cpp:583] [Rank 6] To avoid data inconsistency, we are taking the entire process down.
  | 2024-08-23T22:22:51.870Z | [rank6]:[E ProcessGroupNCCL.cpp:1414] [PG 0 Rank 6] Process group watchdog thread terminated with exception: [Rank 6] Watchdog caught collective operation timeout: WorkNCCL(SeqNum=70426, OpType=ALLREDUCE, NumelIn=1, NumelOut=1, Timeout(ms)=1800000) ran for 1800038 milliseconds before timing out.
  | 2024-08-23T22:22:51.871Z | Exception raised from checkTimeout at /opt/conda/conda-bld/pytorch_1714328519311/work/torch/csrc/distributed/c10d/ProcessGroupNCCL.cpp:565 (most recent call first):
  | 2024-08-23T22:22:51.871Z | frame #0: c10::Error::Error(c10::SourceLocation, std::string) + 0x57 (0x7f4d9b80a897 in /opt/conda/lib/python3.11/site-packages/torch/lib/libc10.so)
  | 2024-08-23T22:22:51.871Z | frame #1: c10d::ProcessGroupNCCL::WorkNCCL::checkTimeout(std::optional<std::chrono::duration<long, std::ratio<1l, 1000l> > >) + 0x1d2 (0x7f4d9cb03e12 in /opt/conda/lib/python3.11/site-packages/torch/lib/libtorch_cuda.so)
  | 2024-08-23T22:22:51.871Z | frame #2: c10d::ProcessGroupNCCL::watchdogHandler() + 0x1a0 (0x7f4d9cb08c30 in /opt/conda/lib/python3.11/site-packages/torch/lib/libtorch_cuda.so)
  | 2024-08-23T22:22:51.871Z | frame #3: c10d::ProcessGroupNCCL::ncclCommWatchdog() + 0x10c (0x7f4d9cb09f7c in /opt/conda/lib/python3.11/site-packages/torch/lib/libtorch_cuda.so)
  | 2024-08-23T22:22:51.871Z | frame #4: <unknown function> + 0xd3e95 (0x7f4deb021e95 in /opt/conda/lib/python3.11/site-packages/torch/lib/../../../.././libstdc++.so.6)
  | 2024-08-23T22:22:51.871Z | frame #5: <unknown function> + 0x8609 (0x7f4df4be2609 in /usr/lib/x86_64-linux-gnu/libpthread.so.0)
  | 2024-08-23T22:22:51.871Z | frame #6: clone + 0x43 (0x7f4df49ab353 in /usr/lib/x86_64-linux-gnu/libc.so.6)
  | 2024-08-23T22:22:51.871Z | terminate called after throwing an instance of 'c10::DistBackendError' what():
  | 2024-08-23T22:22:51.871Z | [PG 0 Rank 6] Process group watchdog thread terminated with exception: [Rank 6] Watchdog caught collective operation timeout: WorkNCCL(SeqNum=70426, OpType=ALLREDUCE, NumelIn=1, NumelOut=1, Timeout(ms)=1800000) ran for 1800038 milliseconds before timing out.
  | 2024-08-23T22:22:51.871Z | Exception raised from checkTimeout at /opt/conda/conda-bld/pytorch_1714328519311/work/torch/csrc/distributed/c10d/ProcessGroupNCCL.cpp:565 (most recent call first):
  | 2024-08-23T22:22:51.871Z | frame #0: c10::Error::Error(c10::SourceLocation, std::string) + 0x57 (0x7f4d9b80a897 in /opt/conda/lib/python3.11/site-packages/torch/lib/libc10.so)
  | 2024-08-23T22:22:51.871Z | frame #1: c10d::ProcessGroupNCCL::WorkNCCL::checkTimeout(std::optional<std::chrono::duration<long, std::ratio<1l, 1000l> > >) + 0x1d2 (0x7f4d9cb03e12 in /opt/conda/lib/python3.11/site-packages/torch/lib/libtorch_cuda.so)
  | 2024-08-23T22:22:51.871Z | frame #2: c10d::ProcessGroupNCCL::watchdogHandler() + 0x1a0 (0x7f4d9cb08c30 in /opt/conda/lib/python3.11/site-packages/torch/lib/libtorch_cuda.so)
  | 2024-08-23T22:22:51.871Z | frame #3: c10d::ProcessGroupNCCL::ncclCommWatchdog() + 0x10c (0x7f4d9cb09f7c in /opt/conda/lib/python3.11/site-packages/torch/lib/libtorch_cuda.so)
  | 2024-08-23T22:22:51.871Z | frame #4: <unknown function> + 0xd3e95 (0x7f4deb021e95 in /opt/conda/lib/python3.11/site-packages/torch/lib/../../../.././libstdc++.so.6)
  | 2024-08-23T22:22:51.871Z | frame #5: <unknown function> + 0x8609 (0x7f4df4be2609 in /usr/lib/x86_64-linux-gnu/libpthread.so.0)
  | 2024-08-23T22:22:51.871Z | frame #6: clone + 0x43 (0x7f4df49ab353 in /usr/lib/x86_64-linux-gnu/libc.so.6)
  | 2024-08-23T22:22:51.871Z | Exception raised from ncclCommWatchdog at /opt/conda/conda-bld/pytorch_1714328519311/work/torch/csrc/distributed/c10d/ProcessGroupNCCL.cpp:1418 (most recent call first):
  | 2024-08-23T22:22:51.871Z | frame #0: c10::Error::Error(c10::SourceLocation, std::string) + 0x57 (0x7f4d9b80a897 in /opt/conda/lib/python3.11/site-packages/torch/lib/libc10.so)
  | 2024-08-23T22:22:51.871Z | frame #1: <unknown function> + 0xe56473 (0x7f4d9c78b473 in /opt/conda/lib/python3.11/site-packages/torch/lib/libtorch_cuda.so)
  | 2024-08-23T22:22:51.871Z | frame #2: <unknown function> + 0xd3e95 (0x7f4deb021e95 in /opt/conda/lib/python3.11/site-packages/torch/lib/../../../.././libstdc++.so.6)
  | 2024-08-23T22:22:51.871Z | frame #3: <unknown function> + 0x8609 (0x7f4df4be2609 in /usr/lib/x86_64-linux-gnu/libpthread.so.0)
  | 2024-08-23T22:22:51.871Z | frame #4: clone + 0x43 (0x7f4df49ab353 in /usr/lib/x86_64-linux-gnu/libc.so.6)
  | 2024-08-23T22:22:51.872Z | [rank1]:[E ProcessGroupNCCL.cpp:1537] [PG 0 Rank 1] Timeout at NCCL work: 70426, last enqueued NCCL work: 70426, last completed NCCL work: 70425.
  | 2024-08-23T22:22:51.872Z | [rank1]:[E ProcessGroupNCCL.cpp:577] [Rank 1] Some NCCL operations have failed or timed out. Due to the asynchronous nature of CUDA kernels, subsequent GPU operations might run on corrupted/incomplete data.
  | 2024-08-23T22:22:51.872Z | [rank1]:[E ProcessGroupNCCL.cpp:583] [Rank 1] To avoid data inconsistency, we are taking the entire process down.
  | 2024-08-23T22:22:51.872Z | [rank1]:[E ProcessGroupNCCL.cpp:1414] [PG 0 Rank 1] Process group watchdog thread terminated with exception: [Rank 1] Watchdog caught collective operation timeout: WorkNCCL(SeqNum=70426, OpType=ALLREDUCE, NumelIn=1, NumelOut=1, Timeout(ms)=1800000) ran for 1800064 milliseconds before timing out.
  | 2024-08-23T22:22:51.872Z | Exception raised from checkTimeout at /opt/conda/conda-bld/pytorch_1714328519311/work/torch/csrc/distributed/c10d/ProcessGroupNCCL.cpp:565 (most recent call first):
  | 2024-08-23T22:22:51.872Z | frame #0: c10::Error::Error(c10::SourceLocation, std::string) + 0x57 (0x7fb96b255897 in /opt/conda/lib/python3.11/site-packages/torch/lib/libc10.so)
  | 2024-08-23T22:22:51.872Z | frame #1: c10d::ProcessGroupNCCL::WorkNCCL::checkTimeout(std::optional<std::chrono::duration<long, std::ratio<1l, 1000l> > >) + 0x1d2 (0x7fb96c54ee12 in /opt/conda/lib/python3.11/site-packages/torch/lib/libtorch_cuda.so)
  | 2024-08-23T22:22:51.872Z | frame #2: c10d::ProcessGroupNCCL::watchdogHandler() + 0x1a0 (0x7fb96c553c30 in /opt/conda/lib/python3.11/site-packages/torch/lib/libtorch_cuda.so)
  | 2024-08-23T22:22:51.872Z | frame #3: c10d::ProcessGroupNCCL::ncclCommWatchdog() + 0x10c (0x7fb96c554f7c in /opt/conda/lib/python3.11/site-packages/torch/lib/libtorch_cuda.so)
  | 2024-08-23T22:22:51.872Z | frame #4: <unknown function> + 0xd3e95 (0x7fb9baa6ce95 in /opt/conda/lib/python3.11/site-packages/torch/lib/../../../.././libstdc++.so.6)
  | 2024-08-23T22:22:51.872Z | frame #5: <unknown function> + 0x8609 (0x7fb9c462d609 in /usr/lib/x86_64-linux-gnu/libpthread.so.0)
  | 2024-08-23T22:22:51.872Z | frame #6: clone + 0x43 (0x7fb9c43f6353 in /usr/lib/x86_64-linux-gnu/libc.so.6)
  | 2024-08-23T22:22:51.872Z | terminate called after throwing an instance of 'c10::DistBackendError' what(): [PG 0 Rank 1] Process group watchdog thread terminated with exception: [Rank 1] Watchdog caught collective operation timeout: WorkNCCL(SeqNum=70426, OpType=ALLREDUCE, NumelIn=1, NumelOut=1, Timeout(ms)=1800000) ran for 1800064 milliseconds before timing out.
  | 2024-08-23T22:22:51.872Z | Exception raised from checkTimeout at /opt/conda/conda-bld/pytorch_1714328519311/work/torch/csrc/distributed/c10d/ProcessGroupNCCL.cpp:565 (most recent call first):
  | 2024-08-23T22:22:51.872Z | frame #0: c10::Error::Error(c10::SourceLocation, std::string) + 0x57 (0x7fb96b255897 in /opt/conda/lib/python3.11/site-packages/torch/lib/libc10.so)
  | 2024-08-23T22:22:51.872Z | frame #1: c10d::ProcessGroupNCCL::WorkNCCL::checkTimeout(std::optional<std::chrono::duration<long, std::ratio<1l, 1000l> > >) + 0x1d2 (0x7fb96c54ee12 in /opt/conda/lib/python3.11/site-packages/torch/lib/libtorch_cuda.so)
  | 2024-08-23T22:22:51.872Z | frame #2: c10d::ProcessGroupNCCL::watchdogHandler() + 0x1a0 (0x7fb96c553c30 in /opt/conda/lib/python3.11/site-packages/torch/lib/libtorch_cuda.so)
  | 2024-08-23T22:22:51.872Z | frame #3: c10d::ProcessGroupNCCL::ncclCommWatchdog() + 0x10c (0x7fb96c554f7c in /opt/conda/lib/python3.11/site-packages/torch/lib/libtorch_cuda.so)
  | 2024-08-23T22:22:51.872Z | frame #4: <unknown function> + 0xd3e95 (0x7fb9baa6ce95 in /opt/conda/lib/python3.11/site-packages/torch/lib/../../../.././libstdc++.so.6)
  | 2024-08-23T22:22:51.872Z | frame #5: <unknown function> + 0x8609 (0x7fb9c462d609 in /usr/lib/x86_64-linux-gnu/libpthread.so.0)
  | 2024-08-23T22:22:51.872Z | frame #6: clone + 0x43 (0x7fb9c43f6353 in /usr/lib/x86_64-linux-gnu/libc.so.6)
  | 2024-08-23T22:22:51.872Z | Exception raised from ncclCommWatchdog at /opt/conda/conda-bld/pytorch_1714328519311/work/torch/csrc/distributed/c10d/ProcessGroupNCCL.cpp:1418 (most recent call first):
  | 2024-08-23T22:22:51.872Z | frame #0: c10::Error::Error(c10::SourceLocation, std::string) + 0x57 (0x7fb96b255897 in /opt/conda/lib/python3.11/site-packages/torch/lib/libc10.so)
  | 2024-08-23T22:22:51.872Z | frame #1: <unknown function> + 0xe56473 (0x7fb96c1d6473 in /opt/conda/lib/python3.11/site-packages/torch/lib/libtorch_cuda.so)
  | 2024-08-23T22:22:51.872Z | frame #2: <unknown function> + 0xd3e95 (0x7fb9baa6ce95 in /opt/conda/lib/python3.11/site-packages/torch/lib/../../../.././libstdc++.so.6)
  | 2024-08-23T22:22:51.872Z | frame #3: <unknown function> + 0x8609 (0x7fb9c462d609 in /usr/lib/x86_64-linux-gnu/libpthread.so.0)
  | 2024-08-23T22:22:51.872Z | frame #4: clone + 0x43 (0x7fb9c43f6353 in /usr/lib/x86_64-linux-gnu/libc.so.6)
  | 2024-08-23T22:22:51.872Z | [rank5]:[E ProcessGroupNCCL.cpp:1537] [PG 0 Rank 5] Timeout at NCCL work: 70426, last enqueued NCCL work: 70426, last completed NCCL work: 70425.
  | 2024-08-23T22:22:51.872Z | [rank5]:[E ProcessGroupNCCL.cpp:577] [Rank 5] Some NCCL operations have failed or timed out. Due to the asynchronous nature of CUDA kernels, subsequent GPU operations might run on corrupted/incomplete data.
  | 2024-08-23T22:22:51.872Z | [rank5]:[E ProcessGroupNCCL.cpp:583] [Rank 5] To avoid data inconsistency, we are taking the entire process down.
  | 2024-08-23T22:22:51.872Z | [rank5]:[E ProcessGroupNCCL.cpp:1414] [PG 0 Rank 5] Process group watchdog thread terminated with exception: [Rank 5] Watchdog caught collective operation timeout: WorkNCCL(SeqNum=70426, OpType=ALLREDUCE, NumelIn=1, NumelOut=1, Timeout(ms)=1800000) ran for 1800062 milliseconds before timing out.
  | 2024-08-23T22:22:51.872Z | Exception raised from checkTimeout at /opt/conda/conda-bld/pytorch_1714328519311/work/torch/csrc/distributed/c10d/ProcessGroupNCCL.cpp:565 (most recent call first):
  | 2024-08-23T22:22:51.872Z | frame #0: c10::Error::Error(c10::SourceLocation, std::string) + 0x57 (0x7fd01c362897 in /opt/conda/lib/python3.11/site-packages/torch/lib/libc10.so)
  | 2024-08-23T22:22:51.872Z | frame #1: c10d::ProcessGroupNCCL::WorkNCCL::checkTimeout(std::optional<std::chrono::duration<long, std::ratio<1l, 1000l> > >) + 0x1d2 (0x7fd01d65be12 in /opt/conda/lib/python3.11/site-packages/torch/lib/libtorch_cuda.so)
  | 2024-08-23T22:22:51.872Z | frame #2: c10d::ProcessGroupNCCL::watchdogHandler() + 0x1a0 (0x7fd01d660c30 in /opt/conda/lib/python3.11/site-packages/torch/lib/libtorch_cuda.so)
  | 2024-08-23T22:22:51.872Z | frame #3: c10d::ProcessGroupNCCL::ncclCommWatchdog() + 0x10c (0x7fd01d661f7c in /opt/conda/lib/python3.11/site-packages/torch/lib/libtorch_cuda.so)
  | 2024-08-23T22:22:51.872Z | frame #4: <unknown function> + 0xd3e95 (0x7fd06bb79e95 in /opt/conda/lib/python3.11/site-packages/torch/lib/../../../.././libstdc++.so.6)
  | 2024-08-23T22:22:51.872Z | frame #5: <unknown function> + 0x8609 (0x7fd07573a609 in /usr/lib/x86_64-linux-gnu/libpthread.so.0)
  | 2024-08-23T22:22:51.872Z | frame #6: clone + 0x43 (0x7fd075503353 in /usr/lib/x86_64-linux-gnu/libc.so.6)
  | 2024-08-23T22:22:51.872Z | terminate called after throwing an instance of 'c10::DistBackendError' what(): [PG 0 Rank 5] Process group watchdog thread terminated with exception: [Rank 5] Watchdog caught collective operation timeout: WorkNCCL(SeqNum=70426, OpType=ALLREDUCE, NumelIn=1, NumelOut=1, Timeout(ms)=1800000) ran for 1800062 milliseconds before timing out.
  | 2024-08-23T22:22:51.872Z | Exception raised from checkTimeout at /opt/conda/conda-bld/pytorch_1714328519311/work/torch/csrc/distributed/c10d/ProcessGroupNCCL.cpp:565 (most recent call first):
  | 2024-08-23T22:22:51.872Z | frame #0: c10::Error::Error(c10::SourceLocation, std::string) + 0x57 (0x7fd01c362897 in /opt/conda/lib/python3.11/site-packages/torch/lib/libc10.so)
  | 2024-08-23T22:22:51.872Z | frame #1: c10d::ProcessGroupNCCL::WorkNCCL::checkTimeout(std::optional<std::chrono::duration<long, std::ratio<1l, 1000l> > >) + 0x1d2 (0x7fd01d65be12 in /opt/conda/lib/python3.11/site-packages/torch/lib/libtorch_cuda.so)
  | 2024-08-23T22:22:51.873Z | frame #2: c10d::ProcessGroupNCCL::watchdogHandler() + 0x1a0 (0x7fd01d660c30 in /opt/conda/lib/python3.11/site-packages/torch/lib/libtorch_cuda.so)
  | 2024-08-23T22:22:51.873Z | frame #3: c10d::ProcessGroupNCCL::ncclCommWatchdog() + 0x10c (0x7fd01d661f7c in /opt/conda/lib/python3.11/site-packages/torch/lib/libtorch_cuda.so)
  | 2024-08-23T22:22:51.873Z | frame #4: <unknown function> + 0xd3e95 (0x7fd06bb79e95 in /opt/conda/lib/python3.11/site-packages/torch/lib/../../../.././libstdc++.so.6)
  | 2024-08-23T22:22:51.873Z | frame #5: <unknown function> + 0x8609 (0x7fd07573a609 in /usr/lib/x86_64-linux-gnu/libpthread.so.0)
  | 2024-08-23T22:22:51.873Z | frame #6: clone + 0x43 (0x7fd075503353 in /usr/lib/x86_64-linux-gnu/libc.so.6)
  | 2024-08-23T22:22:51.873Z | Exception raised from ncclCommWatchdog at /opt/conda/conda-bld/pytorch_1714328519311/work/torch/csrc/distributed/c10d/ProcessGroupNCCL.cpp:1418 (most recent call first):
  | 2024-08-23T22:22:51.873Z | frame #0: c10::Error::Error(c10::SourceLocation, std::string) + 0x57 (0x7fd01c362897 in /opt/conda/lib/python3.11/site-packages/torch/lib/libc10.so)
  | 2024-08-23T22:22:51.873Z | frame #1: <unknown function> + 0xe56473 (0x7fd01d2e3473 in /opt/conda/lib/python3.11/site-packages/torch/lib/libtorch_cuda.so)
  | 2024-08-23T22:22:51.873Z | frame #2: <unknown function> + 0xd3e95 (0x7fd06bb79e95 in /opt/conda/lib/python3.11/site-packages/torch/lib/../../../.././libstdc++.so.6)
  | 2024-08-23T22:22:51.873Z | frame #3: <unknown function> + 0x8609 (0x7fd07573a609 in /usr/lib/x86_64-linux-gnu/libpthread.so.0)
  | 2024-08-23T22:22:51.873Z | frame #4: clone + 0x43 (0x7fd075503353 in /usr/lib/x86_64-linux-gnu/libc.so.6)
  | 2024-08-23T22:22:51.873Z | [rank4]:[E ProcessGroupNCCL.cpp:1537] [PG 0 Rank 4] Timeout at NCCL work: 70426, last enqueued NCCL work: 70426, last completed NCCL work: 70425.
  | 2024-08-23T22:22:51.873Z | [rank4]:[E ProcessGroupNCCL.cpp:577] [Rank 4] Some NCCL operations have failed or timed out. Due to the asynchronous nature of CUDA kernels, subsequent GPU operations might run on corrupted/incomplete data.
  | 2024-08-23T22:22:51.873Z | [rank4]:[E ProcessGroupNCCL.cpp:583] [Rank 4] To avoid data inconsistency, we are taking the entire process down.
  | 2024-08-23T22:22:51.873Z | [rank4]:[E ProcessGroupNCCL.cpp:1414] [PG 0 Rank 4] Process group watchdog thread terminated with exception: [Rank 4] Watchdog caught collective operation timeout: WorkNCCL(SeqNum=70426, OpType=ALLREDUCE, NumelIn=1, NumelOut=1, Timeout(ms)=1800000) ran for 1800058 milliseconds before timing out.
  | 2024-08-23T22:22:51.873Z | Exception raised from checkTimeout at /opt/conda/conda-bld/pytorch_1714328519311/work/torch/csrc/distributed/c10d/ProcessGroupNCCL.cpp:565 (most recent call first):
  | 2024-08-23T22:22:51.873Z | frame #0: c10::Error::Error(c10::SourceLocation, std::string) + 0x57 (0x7f74c40ea897 in /opt/conda/lib/python3.11/site-packages/torch/lib/libc10.so)
  | 2024-08-23T22:22:51.873Z | frame #1: c10d::ProcessGroupNCCL::WorkNCCL::checkTimeout(std::optional<std::chrono::duration<long, std::ratio<1l, 1000l> > >) + 0x1d2 (0x7f74c53e3e12 in /opt/conda/lib/python3.11/site-packages/torch/lib/libtorch_cuda.so)
  | 2024-08-23T22:22:51.873Z | frame #2: c10d::ProcessGroupNCCL::watchdogHandler() + 0x1a0 (0x7f74c53e8c30 in /opt/conda/lib/python3.11/site-packages/torch/lib/libtorch_cuda.so)
  | 2024-08-23T22:22:51.873Z | frame #3: c10d::ProcessGroupNCCL::ncclCommWatchdog() + 0x10c (0x7f74c53e9f7c in /opt/conda/lib/python3.11/site-packages/torch/lib/libtorch_cuda.so)
  | 2024-08-23T22:22:51.873Z | frame #4: <unknown function> + 0xd3e95 (0x7f7513901e95 in /opt/conda/lib/python3.11/site-packages/torch/lib/../../../.././libstdc++.so.6)
  | 2024-08-23T22:22:51.873Z | frame #5: <unknown function> + 0x8609 (0x7f751d4c2609 in /usr/lib/x86_64-linux-gnu/libpthread.so.0)
  | 2024-08-23T22:22:51.873Z | frame #6: clone + 0x43 (0x7f751d28b353 in /usr/lib/x86_64-linux-gnu/libc.so.6)
  | 2024-08-23T22:22:51.873Z | terminate called after throwing an instance of 'c10::DistBackendError' what(): [PG 0 Rank 4] Process group watchdog thread terminated with exception: [Rank 4] Watchdog caught collective operation timeout: WorkNCCL(SeqNum=70426, OpType=ALLREDUCE, NumelIn=1, NumelOut=1, Timeout(ms)=1800000) ran for 1800058 milliseconds before timing out.
  | 2024-08-23T22:22:51.873Z | Exception raised from checkTimeout at /opt/conda/conda-bld/pytorch_1714328519311/work/torch/csrc/distributed/c10d/ProcessGroupNCCL.cpp:565 (most recent call first):
  | 2024-08-23T22:22:51.873Z | frame #0: c10::Error::Error(c10::SourceLocation, std::string) + 0x57 (0x7f74c40ea897 in /opt/conda/lib/python3.11/site-packages/torch/lib/libc10.so)
  | 2024-08-23T22:22:51.873Z | frame #1: c10d::ProcessGroupNCCL::WorkNCCL::checkTimeout(std::optional<std::chrono::duration<long, std::ratio<1l, 1000l> > >) + 0x1d2 (0x7f74c53e3e12 in /opt/conda/lib/python3.11/site-packages/torch/lib/libtorch_cuda.so)
  | 2024-08-23T22:22:51.873Z | frame #2: c10d::ProcessGroupNCCL::watchdogHandler() + 0x1a0 (0x7f74c53e8c30 in /opt/conda/lib/python3.11/site-packages/torch/lib/libtorch_cuda.so)
  | 2024-08-23T22:22:51.873Z | frame #3: c10d::ProcessGroupNCCL::ncclCommWatchdog() + 0x10c (0x7f74c53e9f7c in /opt/conda/lib/python3.11/site-packages/torch/lib/libtorch_cuda.so)
  | 2024-08-23T22:22:51.873Z | frame #4: <unknown function> + 0xd3e95 (0x7f7513901e95 in /opt/conda/lib/python3.11/site-packages/torch/lib/../../../.././libstdc++.so.6)
  | 2024-08-23T22:22:51.873Z | frame #5: <unknown function> + 0x8609 (0x7f751d4c2609 in /usr/lib/x86_64-linux-gnu/libpthread.so.0)
  | 2024-08-23T22:22:51.873Z | frame #6: clone + 0x43 (0x7f751d28b353 in /usr/lib/x86_64-linux-gnu/libc.so.6)
  | 2024-08-23T22:22:51.873Z | Exception raised from ncclCommWatchdog at /opt/conda/conda-bld/pytorch_1714328519311/work/torch/csrc/distributed/c10d/ProcessGroupNCCL.cpp:1418 (most recent call first):
  | 2024-08-23T22:22:51.873Z | frame #0: c10::Error::Error(c10::SourceLocation, std::string) + 0x57 (0x7f74c40ea897 in /opt/conda/lib/python3.11/site-packages/torch/lib/libc10.so)
  | 2024-08-23T22:22:51.873Z | frame #1: <unknown function> + 0xe56473 (0x7f74c506b473 in /opt/conda/lib/python3.11/site-packages/torch/lib/libtorch_cuda.so)
  | 2024-08-23T22:22:51.873Z | frame #2: <unknown function> + 0xd3e95 (0x7f7513901e95 in /opt/conda/lib/python3.11/site-packages/torch/lib/../../../.././libstdc++.so.6)
  | 2024-08-23T22:22:51.873Z | frame #3: <unknown function> + 0x8609 (0x7f751d4c2609 in /usr/lib/x86_64-linux-gnu/libpthread.so.0)
  | 2024-08-23T22:22:51.873Z | frame #4: clone + 0x43 (0x7f751d28b353 in /usr/lib/x86_64-linux-gnu/libc.so.6)
  | 2024-08-23T22:22:51.873Z | [rank0]:[E ProcessGroupNCCL.cpp:1537] [PG 0 Rank 0] Timeout at NCCL work: 70426, last enqueued NCCL work: 70426, last completed NCCL work: 70425.
  | 2024-08-23T22:22:51.873Z | [rank0]:[E ProcessGroupNCCL.cpp:577] [Rank 0] Some NCCL operations have failed or timed out. Due to the asynchronous nature of CUDA kernels, subsequent GPU operations might run on corrupted/incomplete data.
  | 2024-08-23T22:22:51.873Z | [rank0]:[E ProcessGroupNCCL.cpp:583] [Rank 0] To avoid data inconsistency, we are taking the entire process down.
  | 2024-08-23T22:22:51.873Z | [rank0]:[E ProcessGroupNCCL.cpp:1414] [PG 0 Rank 0] Process group watchdog thread terminated with exception: [Rank 0] Watchdog caught collective operation timeout: WorkNCCL(SeqNum=70426, OpType=ALLREDUCE, NumelIn=1, NumelOut=1, Timeout(ms)=1800000) ran for 1800068 milliseconds before timing out.
  | 2024-08-23T22:22:51.873Z | Exception raised from checkTimeout at /opt/conda/conda-bld/pytorch_1714328519311/work/torch/csrc/distributed/c10d/ProcessGroupNCCL.cpp:565 (most recent call first):
  | 2024-08-23T22:22:51.873Z | frame #0: c10::Error::Error(c10::SourceLocation, std::string) + 0x57 (0x7f483567d897 in /opt/conda/lib/python3.11/site-packages/torch/lib/libc10.so)
  | 2024-08-23T22:22:51.873Z | frame #1: c10d::ProcessGroupNCCL::WorkNCCL::checkTimeout(std::optional<std::chrono::duration<long, std::ratio<1l, 1000l> > >) + 0x1d2 (0x7f4836976e12 in /opt/conda/lib/python3.11/site-packages/torch/lib/libtorch_cuda.so)
  | 2024-08-23T22:22:51.874Z | frame #2: c10d::ProcessGroupNCCL::watchdogHandler() + 0x1a0 (0x7f483697bc30 in /opt/conda/lib/python3.11/site-packages/torch/lib/libtorch_cuda.so)
  | 2024-08-23T22:22:51.874Z | frame #3: c10d::ProcessGroupNCCL::ncclCommWatchdog() + 0x10c (0x7f483697cf7c in /opt/conda/lib/python3.11/site-packages/torch/lib/libtorch_cuda.so)
  | 2024-08-23T22:22:51.874Z | frame #4: <unknown function> + 0xd3e95 (0x7f4884e94e95 in /opt/conda/lib/python3.11/site-packages/torch/lib/../../../.././libstdc++.so.6)
  | 2024-08-23T22:22:51.874Z | frame #5: <unknown function> + 0x8609 (0x7f488ea55609 in /usr/lib/x86_64-linux-gnu/libpthread.so.0)
  | 2024-08-23T22:22:51.874Z | frame #6: clone + 0x43 (0x7f488e81e353 in /usr/lib/x86_64-linux-gnu/libc.so.6)
  | 2024-08-23T22:22:51.874Z | terminate called after throwing an instance of 'c10::DistBackendError' what(): [PG 0 Rank 0] Process group watchdog thread terminated with exception: [Rank 0] Watchdog caught collective operation timeout: WorkNCCL(SeqNum=70426, OpType=ALLREDUCE, NumelIn=1, NumelOut=1, Timeout(ms)=1800000) ran for 1800068 milliseconds before timing out.
  | 2024-08-23T22:22:51.874Z | Exception raised from checkTimeout at /opt/conda/conda-bld/pytorch_1714328519311/work/torch/csrc/distributed/c10d/ProcessGroupNCCL.cpp:565 (most recent call first):
  | 2024-08-23T22:22:51.874Z | frame #0: c10::Error::Error(c10::SourceLocation, std::string) + 0x57 (0x7f483567d897 in /opt/conda/lib/python3.11/site-packages/torch/lib/libc10.so)
  | 2024-08-23T22:22:51.874Z | frame #1: c10d::ProcessGroupNCCL::WorkNCCL::checkTimeout(std::optional<std::chrono::duration<long, std::ratio<1l, 1000l> > >) + 0x1d2 (0x7f4836976e12 in /opt/conda/lib/python3.11/site-packages/torch/lib/libtorch_cuda.so)
  | 2024-08-23T22:22:51.874Z | frame #2: c10d::ProcessGroupNCCL::watchdogHandler() + 0x1a0 (0x7f483697bc30 in /opt/conda/lib/python3.11/site-packages/torch/lib/libtorch_cuda.so)
  | 2024-08-23T22:22:51.874Z | frame #3: c10d::ProcessGroupNCCL::ncclCommWatchdog() + 0x10c (0x7f483697cf7c in /opt/conda/lib/python3.11/site-packages/torch/lib/libtorch_cuda.so)
  | 2024-08-23T22:22:51.874Z | frame #4: <unknown function> + 0xd3e95 (0x7f4884e94e95 in /opt/conda/lib/python3.11/site-packages/torch/lib/../../../.././libstdc++.so.6)
  | 2024-08-23T22:22:51.874Z | frame #5: <unknown function> + 0x8609 (0x7f488ea55609 in /usr/lib/x86_64-linux-gnu/libpthread.so.0)
  | 2024-08-23T22:22:51.874Z | frame #6: clone + 0x43 (0x7f488e81e353 in /usr/lib/x86_64-linux-gnu/libc.so.6)
  | 2024-08-23T22:22:51.874Z | Exception raised from ncclCommWatchdog at /opt/conda/conda-bld/pytorch_1714328519311/work/torch/csrc/distributed/c10d/ProcessGroupNCCL.cpp:1418 (most recent call first):
  | 2024-08-23T22:22:51.874Z | frame #0: c10::Error::Error(c10::SourceLocation, std::string) + 0x57 (0x7f483567d897 in /opt/conda/lib/python3.11/site-packages/torch/lib/libc10.so)
  | 2024-08-23T22:22:51.874Z | frame #1: <unknown function> + 0xe56473 (0x7f48365fe473 in /opt/conda/lib/python3.11/site-packages/torch/lib/libtorch_cuda.so)
  | 2024-08-23T22:22:51.874Z | frame #2: <unknown function> + 0xd3e95 (0x7f4884e94e95 in /opt/conda/lib/python3.11/site-packages/torch/lib/../../../.././libstdc++.so.6)
  | 2024-08-23T22:22:51.874Z | frame #3: <unknown function> + 0x8609 (0x7f488ea55609 in /usr/lib/x86_64-linux-gnu/libpthread.so.0)
  | 2024-08-23T22:22:51.874Z | frame #4: clone + 0x43 (0x7f488e81e353 in /usr/lib/x86_64-linux-gnu/libc.so.6)
  | 2024-08-23T22:22:51.874Z | [rank3]:[E ProcessGroupNCCL.cpp:1537] [PG 0 Rank 3] Timeout at NCCL work: 70426, last enqueued NCCL work: 70427, last completed NCCL work: 70425.
  | 2024-08-23T22:22:51.874Z | [rank3]:[E ProcessGroupNCCL.cpp:577] [Rank 3] Some NCCL operations have failed or timed out. Due to the asynchronous nature of CUDA kernels, subsequent GPU operations might run on corrupted/incomplete data.
  | 2024-08-23T22:22:51.874Z | [rank3]:[E ProcessGroupNCCL.cpp:583] [Rank 3] To avoid data inconsistency, we are taking the entire process down.
  | 2024-08-23T22:22:51.874Z | [rank3]:[E ProcessGroupNCCL.cpp:1414] [PG 0 Rank 3] Process group watchdog thread terminated with exception: [Rank 3] Watchdog caught collective operation timeout: WorkNCCL(SeqNum=70426, OpType=ALLREDUCE, NumelIn=1, NumelOut=1, Timeout(ms)=1800000) ran for 1800076 milliseconds before timing out.
  | 2024-08-23T22:22:51.874Z | Exception raised from checkTimeout at /opt/conda/conda-bld/pytorch_1714328519311/work/torch/csrc/distributed/c10d/ProcessGroupNCCL.cpp:565 (most recent call first):
  | 2024-08-23T22:22:51.874Z | frame #0: c10::Error::Error(c10::SourceLocation, std::string) + 0x57 (0x7f787e8c4897 in /opt/conda/lib/python3.11/site-packages/torch/lib/libc10.so)
  | 2024-08-23T22:22:51.874Z | frame #1: c10d::ProcessGroupNCCL::WorkNCCL::checkTimeout(std::optional<std::chrono::duration<long, std::ratio<1l, 1000l> > >) + 0x1d2 (0x7f787fbbde12 in /opt/conda/lib/python3.11/site-packages/torch/lib/libtorch_cuda.so)
  | 2024-08-23T22:22:51.874Z | frame #2: c10d::ProcessGroupNCCL::watchdogHandler() + 0x1a0 (0x7f787fbc2c30 in /opt/conda/lib/python3.11/site-packages/torch/lib/libtorch_cuda.so)
  | 2024-08-23T22:22:51.874Z | frame #3: c10d::ProcessGroupNCCL::ncclCommWatchdog() + 0x10c (0x7f787fbc3f7c in /opt/conda/lib/python3.11/site-packages/torch/lib/libtorch_cuda.so)
  | 2024-08-23T22:22:51.874Z | frame #4: <unknown function> + 0xd3e95 (0x7f78ce0dbe95 in /opt/conda/lib/python3.11/site-packages/torch/lib/../../../.././libstdc++.so.6)
  | 2024-08-23T22:22:51.874Z | frame #5: <unknown function> + 0x8609 (0x7f78d7c9c609 in /usr/lib/x86_64-linux-gnu/libpthread.so.0)
  | 2024-08-23T22:22:51.874Z | frame #6: clone + 0x43 (0x7f78d7a65353 in /usr/lib/x86_64-linux-gnu/libc.so.6)
  | 2024-08-23T22:22:51.874Z | terminate called after throwing an instance of 'c10::DistBackendError' what(): [PG 0 Rank 3] Process group watchdog thread terminated with exception: [Rank 3] Watchdog caught collective operation timeout: WorkNCCL(SeqNum=70426, OpType=ALLREDUCE, NumelIn=1, NumelOut=1, Timeout(ms)=1800000) ran for 1800076 milliseconds before timing out.
  | 2024-08-23T22:22:51.874Z | Exception raised from checkTimeout at /opt/conda/conda-bld/pytorch_1714328519311/work/torch/csrc/distributed/c10d/ProcessGroupNCCL.cpp:565 (most recent call first):
  | 2024-08-23T22:22:51.874Z | frame #0: c10::Error::Error(c10::SourceLocation, std::string) + 0x57 (0x7f787e8c4897 in /opt/conda/lib/python3.11/site-packages/torch/lib/libc10.so)
  | 2024-08-23T22:22:51.874Z | frame #1: c10d::ProcessGroupNCCL::WorkNCCL::checkTimeout(std::optional<std::chrono::duration<long, std::ratio<1l, 1000l> > >) + 0x1d2 (0x7f787fbbde12 in /opt/conda/lib/python3.11/site-packages/torch/lib/libtorch_cuda.so)
  | 2024-08-23T22:22:51.874Z | frame #2: c10d::ProcessGroupNCCL::watchdogHandler() + 0x1a0 (0x7f787fbc2c30 in /opt/conda/lib/python3.11/site-packages/torch/lib/libtorch_cuda.so)
  | 2024-08-23T22:22:51.874Z | frame #3: c10d::ProcessGroupNCCL::ncclCommWatchdog() + 0x10c (0x7f787fbc3f7c in /opt/conda/lib/python3.11/site-packages/torch/lib/libtorch_cuda.so)
  | 2024-08-23T22:22:51.874Z | frame #4: <unknown function> + 0xd3e95 (0x7f78ce0dbe95 in /opt/conda/lib/python3.11/site-packages/torch/lib/../../../.././libstdc++.so.6)
  | 2024-08-23T22:22:51.874Z | frame #5: <unknown function> + 0x8609 (0x7f78d7c9c609 in /usr/lib/x86_64-linux-gnu/libpthread.so.0)
  | 2024-08-23T22:22:51.874Z | frame #6: clone + 0x43 (0x7f78d7a65353 in /usr/lib/x86_64-linux-gnu/libc.so.6)
  | 2024-08-23T22:22:51.874Z | Exception raised from ncclCommWatchdog at /opt/conda/conda-bld/pytorch_1714328519311/work/torch/csrc/distributed/c10d/ProcessGroupNCCL.cpp:1418 (most recent call first):
  | 2024-08-23T22:22:51.874Z | frame #0: c10::Error::Error(c10::SourceLocation, std::string) + 0x57 (0x7f787e8c4897 in /opt/conda/lib/python3.11/site-packages/torch/lib/libc10.so)
  | 2024-08-23T22:22:51.874Z | frame #1: <unknown function> + 0xe56473 (0x7f787f845473 in /opt/conda/lib/python3.11/site-packages/torch/lib/libtorch_cuda.so)
  | 2024-08-23T22:22:51.874Z | frame #2: <unknown function> + 0xd3e95 (0x7f78ce0dbe95 in /opt/conda/lib/python3.11/site-packages/torch/lib/../../../.././libstdc++.so.6)
  | 2024-08-23T22:22:51.875Z | frame #3: <unknown function> + 0x8609 (0x7f78d7c9c609 in /usr/lib/x86_64-linux-gnu/libpthread.so.0)
  | 2024-08-23T22:22:51.875Z | frame #4: clone + 0x43 (0x7f78d7a65353 in /usr/lib/x86_64-linux-gnu/libc.so.6)
  | 2024-08-23T22:22:51.875Z | [rank2]:[E ProcessGroupNCCL.cpp:1537] [PG 0 Rank 2] Timeout at NCCL work: 70426, last enqueued NCCL work: 70427, last completed NCCL work: 70425.
  | 2024-08-23T22:22:51.875Z | [rank2]:[E ProcessGroupNCCL.cpp:577] [Rank 2] Some NCCL operations have failed or timed out. Due to the asynchronous nature of CUDA kernels, subsequent GPU operations might run on corrupted/incomplete data.
  | 2024-08-23T22:22:51.875Z | [rank2]:[E ProcessGroupNCCL.cpp:583] [Rank 2] To avoid data inconsistency, we are taking the entire process down.
  | 2024-08-23T22:22:51.875Z | [rank2]:[E ProcessGroupNCCL.cpp:1414] [PG 0 Rank 2] Process group watchdog thread terminated with exception: [Rank 2] Watchdog caught collective operation timeout: WorkNCCL(SeqNum=70426, OpType=ALLREDUCE, NumelIn=1, NumelOut=1, Timeout(ms)=1800000) ran for 1800083 milliseconds before timing out.
  | 2024-08-23T22:22:51.875Z | Exception raised from checkTimeout at /opt/conda/conda-bld/pytorch_1714328519311/work/torch/csrc/distributed/c10d/ProcessGroupNCCL.cpp:565 (most recent call first):
  | 2024-08-23T22:22:51.875Z | frame #0: c10::Error::Error(c10::SourceLocation, std::string) + 0x57 (0x7f5afbae7897 in /opt/conda/lib/python3.11/site-packages/torch/lib/libc10.so)
  | 2024-08-23T22:22:51.875Z | frame #1: c10d::ProcessGroupNCCL::WorkNCCL::checkTimeout(std::optional<std::chrono::duration<long, std::ratio<1l, 1000l> > >) + 0x1d2 (0x7f5afcde0e12 in /opt/conda/lib/python3.11/site-packages/torch/lib/libtorch_cuda.so)
  | 2024-08-23T22:22:51.875Z | frame #2: c10d::ProcessGroupNCCL::watchdogHandler() + 0x1a0 (0x7f5afcde5c30 in /opt/conda/lib/python3.11/site-packages/torch/lib/libtorch_cuda.so)
  | 2024-08-23T22:22:51.875Z | frame #3: c10d::ProcessGroupNCCL::ncclCommWatchdog() + 0x10c (0x7f5afcde6f7c in /opt/conda/lib/python3.11/site-packages/torch/lib/libtorch_cuda.so)
  | 2024-08-23T22:22:51.875Z | frame #4: <unknown function> + 0xd3e95 (0x7f5b4b2fee95 in /opt/conda/lib/python3.11/site-packages/torch/lib/../../../.././libstdc++.so.6)
  | 2024-08-23T22:22:51.875Z | frame #5: <unknown function> + 0x8609 (0x7f5b54ebf609 in /usr/lib/x86_64-linux-gnu/libpthread.so.0)
  | 2024-08-23T22:22:51.875Z | frame #6: clone + 0x43 (0x7f5b54c88353 in /usr/lib/x86_64-linux-gnu/libc.so.6)
  | 2024-08-23T22:22:51.875Z | terminate called after throwing an instance of 'c10::DistBackendError' what(): [PG 0 Rank 2] Process group watchdog thread terminated with exception: [Rank 2] Watchdog caught collective operation timeout: WorkNCCL(SeqNum=70426, OpType=ALLREDUCE, NumelIn=1, NumelOut=1, Timeout(ms)=1800000) ran for 1800083 milliseconds before timing out.
  | 2024-08-23T22:22:51.875Z | Exception raised from checkTimeout at /opt/conda/conda-bld/pytorch_1714328519311/work/torch/csrc/distributed/c10d/ProcessGroupNCCL.cpp:565 (most recent call first):
  | 2024-08-23T22:22:51.875Z | frame #0: c10::Error::Error(c10::SourceLocation, std::string) + 0x57 (0x7f5afbae7897 in /opt/conda/lib/python3.11/site-packages/torch/lib/libc10.so)
  | 2024-08-23T22:22:51.875Z | frame #1: c10d::ProcessGroupNCCL::WorkNCCL::checkTimeout(std::optional<std::chrono::duration<long, std::ratio<1l, 1000l> > >) + 0x1d2 (0x7f5afcde0e12 in /opt/conda/lib/python3.11/site-packages/torch/lib/libtorch_cuda.so)
  | 2024-08-23T22:22:51.875Z | frame #2: c10d::ProcessGroupNCCL::watchdogHandler() + 0x1a0 (0x7f5afcde5c30 in /opt/conda/lib/python3.11/site-packages/torch/lib/libtorch_cuda.so)
  | 2024-08-23T22:22:51.875Z | frame #3: c10d::ProcessGroupNCCL::ncclCommWatchdog() + 0x10c (0x7f5afcde6f7c in /opt/conda/lib/python3.11/site-packages/torch/lib/libtorch_cuda.so)
  | 2024-08-23T22:22:51.875Z | frame #4: <unknown function> + 0xd3e95 (0x7f5b4b2fee95 in /opt/conda/lib/python3.11/site-packages/torch/lib/../../../.././libstdc++.so.6)
  | 2024-08-23T22:22:51.875Z | frame #5: <unknown function> + 0x8609 (0x7f5b54ebf609 in /usr/lib/x86_64-linux-gnu/libpthread.so.0)
  | 2024-08-23T22:22:51.875Z | frame #6: clone + 0x43 (0x7f5b54c88353 in /usr/lib/x86_64-linux-gnu/libc.so.6)
  | 2024-08-23T22:22:51.875Z | Exception raised from ncclCommWatchdog at /opt/conda/conda-bld/pytorch_1714328519311/work/torch/csrc/distributed/c10d/ProcessGroupNCCL.cpp:1418 (most recent call first):
  | 2024-08-23T22:22:51.875Z | frame #0: c10::Error::Error(c10::SourceLocation, std::string) + 0x57 (0x7f5afbae7897 in /opt/conda/lib/python3.11/site-packages/torch/lib/libc10.so)
  | 2024-08-23T22:22:51.875Z | frame #1: <unknown function> + 0xe56473 (0x7f5afca68473 in /opt/conda/lib/python3.11/site-packages/torch/lib/libtorch_cuda.so)
  | 2024-08-23T22:22:51.875Z | frame #2: <unknown function> + 0xd3e95 (0x7f5b4b2fee95 in /opt/conda/lib/python3.11/site-packages/torch/lib/../../../.././libstdc++.so.6)
  | 2024-08-23T22:22:51.875Z | frame #3: <unknown function> + 0x8609 (0x7f5b54ebf609 in /usr/lib/x86_64-linux-gnu/libpthread.so.0)
  | 2024-08-23T22:22:51.875Z | frame #4: clone + 0x43 (0x7f5b54c88353 in /usr/lib/x86_64-linux-gnu/libc.so.6)
  | 2024-08-23T22:22:52.876Z | [rank7]:[E ProcessGroupNCCL.cpp:1537] [PG 0 Rank 7] Timeout at NCCL work: 70426, last enqueued NCCL work: 70427, last completed NCCL work: 70425.
  | 2024-08-23T22:22:52.876Z | [rank7]:[E ProcessGroupNCCL.cpp:577] [Rank 7] Some NCCL operations have failed or timed out. Due to the asynchronous nature of CUDA kernels, subsequent GPU operations might run on corrupted/incomplete data.
  | 2024-08-23T22:22:52.876Z | [rank7]:[E ProcessGroupNCCL.cpp:583] [Rank 7] To avoid data inconsistency, we are taking the entire process down.
  | 2024-08-23T22:22:52.876Z | [rank7]:[E ProcessGroupNCCL.cpp:1414] [PG 0 Rank 7] Process group watchdog thread terminated with exception: [Rank 7] Watchdog caught collective operation timeout: WorkNCCL(SeqNum=70426, OpType=ALLREDUCE, NumelIn=1, NumelOut=1, Timeout(ms)=1800000) ran for 1800085 milliseconds before timing out.
  | 2024-08-23T22:22:52.876Z | Exception raised from checkTimeout at /opt/conda/conda-bld/pytorch_1714328519311/work/torch/csrc/distributed/c10d/ProcessGroupNCCL.cpp:565 (most recent call first):
  | 2024-08-23T22:22:52.876Z | frame #0: c10::Error::Error(c10::SourceLocation, std::string) + 0x57 (0x7f76bbadb897 in /opt/conda/lib/python3.11/site-packages/torch/lib/libc10.so)
  | 2024-08-23T22:22:52.876Z | frame #1: c10d::ProcessGroupNCCL::WorkNCCL::checkTimeout(std::optional<std::chrono::duration<long, std::ratio<1l, 1000l> > >) + 0x1d2 (0x7f76bcdd4e12 in /opt/conda/lib/python3.11/site-packages/torch/lib/libtorch_cuda.so)
  | 2024-08-23T22:22:52.876Z | frame #2: c10d::ProcessGroupNCCL::watchdogHandler() + 0x1a0 (0x7f76bcdd9c30 in /opt/conda/lib/python3.11/site-packages/torch/lib/libtorch_cuda.so)
  | 2024-08-23T22:22:52.876Z | frame #3: c10d::ProcessGroupNCCL::ncclCommWatchdog() + 0x10c (0x7f76bcddaf7c in /opt/conda/lib/python3.11/site-packages/torch/lib/libtorch_cuda.so)
  | 2024-08-23T22:22:52.876Z | frame #4: <unknown function> + 0xd3e95 (0x7f770b2f2e95 in /opt/conda/lib/python3.11/site-packages/torch/lib/../../../.././libstdc++.so.6)
  | 2024-08-23T22:22:52.876Z | frame #5: <unknown function> + 0x8609 (0x7f7714eb3609 in /usr/lib/x86_64-linux-gnu/libpthread.so.0)
  | 2024-08-23T22:22:52.876Z | frame #6: clone + 0x43 (0x7f7714c7c353 in /usr/lib/x86_64-linux-gnu/libc.so.6)
  | 2024-08-23T22:22:52.876Z | terminate called after throwing an instance of 'c10::DistBackendError' what(): [PG 0 Rank 7] Process group watchdog thread terminated with exception: [Rank 7] Watchdog caught collective operation timeout: WorkNCCL(SeqNum=70426, OpType=ALLREDUCE, NumelIn=1, NumelOut=1, Timeout(ms)=1800000) ran for 1800085 milliseconds before timing out.
  | 2024-08-23T22:22:52.876Z | Exception raised from checkTimeout at /opt/conda/conda-bld/pytorch_1714328519311/work/torch/csrc/distributed/c10d/ProcessGroupNCCL.cpp:565 (most recent call first):
  | 2024-08-23T22:22:52.876Z | frame #0: c10::Error::Error(c10::SourceLocation, std::string) + 0x57 (0x7f76bbadb897 in /opt/conda/lib/python3.11/site-packages/torch/lib/libc10.so)
  | 2024-08-23T22:22:52.876Z | frame #1: c10d::ProcessGroupNCCL::WorkNCCL::checkTimeout(std::optional<std::chrono::duration<long, std::ratio<1l, 1000l> > >) + 0x1d2 (0x7f76bcdd4e12 in /opt/conda/lib/python3.11/site-packages/torch/lib/libtorch_cuda.so)
  | 2024-08-23T22:22:52.876Z | frame #2: c10d::ProcessGroupNCCL::watchdogHandler() + 0x1a0 (0x7f76bcdd9c30 in /opt/conda/lib/python3.11/site-packages/torch/lib/libtorch_cuda.so)
  | 2024-08-23T22:22:52.876Z | frame #3: c10d::ProcessGroupNCCL::ncclCommWatchdog() + 0x10c (0x7f76bcddaf7c in /opt/conda/lib/python3.11/site-packages/torch/lib/libtorch_cuda.so)
  | 2024-08-23T22:22:52.876Z | frame #4: <unknown function> + 0xd3e95 (0x7f770b2f2e95 in /opt/conda/lib/python3.11/site-packages/torch/lib/../../../.././libstdc++.so.6)
  | 2024-08-23T22:22:52.876Z | frame #5: <unknown function> + 0x8609 (0x7f7714eb3609 in /usr/lib/x86_64-linux-gnu/libpthread.so.0)
  | 2024-08-23T22:22:52.876Z | frame #6: clone + 0x43 (0x7f7714c7c353 in /usr/lib/x86_64-linux-gnu/libc.so.6)
  | 2024-08-23T22:22:52.877Z | Exception raised from ncclCommWatchdog at /opt/conda/conda-bld/pytorch_1714328519311/work/torch/csrc/distributed/c10d/ProcessGroupNCCL.cpp:1418 (most recent call first):
  | 2024-08-23T22:22:52.877Z | frame #0: c10::Error::Error(c10::SourceLocation, std::string) + 0x57 (0x7f76bbadb897 in /opt/conda/lib/python3.11/site-packages/torch/lib/libc10.so)
  | 2024-08-23T22:22:52.877Z | frame #1: <unknown function> + 0xe56473 (0x7f76bca5c473 in /opt/conda/lib/python3.11/site-packages/torch/lib/libtorch_cuda.so)
  | 2024-08-23T22:22:52.877Z | frame #2: <unknown function> + 0xd3e95 (0x7f770b2f2e95 in /opt/conda/lib/python3.11/site-packages/torch/lib/../../../.././libstdc++.so.6)
  | 2024-08-23T22:22:52.877Z | frame #3: <unknown function> + 0x8609 (0x7f7714eb3609 in /usr/lib/x86_64-linux-gnu/libpthread.so.0)
  | 2024-08-23T22:22:52.877Z | frame #4: clone + 0x43 (0x7f7714c7c353 in /usr/lib/x86_64-linux-gnu/libc.so.6)
  | 2024-08-23T22:22:58.878Z | E0823 22:22:58.472000 140029070137152 torch/distributed/elastic/multiprocessing/api.py:826] failed (exitcode: -6) local_rank: 0 (pid: 187) of binary: /opt/conda/bin/python
  | 2024-08-23T22:22:58.878Z | Traceback (most recent call last): File "/opt/conda/bin/torchrun", line 33, in <module>
  | 2024-08-23T22:22:58.879Z | sys.exit(load_entry_point('torch==2.3.0', 'console_scripts', 'torchrun')())
  | 2024-08-23T22:22:58.879Z | ^
  | 2024-08-23T22:22:58.879Z | ^^^^^^^^^^^^^^^^^^^
  | 2024-08-23T22:22:58.879Z | ^^^^^^^^^^^^^^^^^^^
  | 2024-08-23T22:22:58.879Z | ^^^^^^^^^^^^^^^^^^^^
  | 2024-08-23T22:22:58.879Z | ^^^^^^ File "/opt/conda/lib/python3.11/site-packages/torch/distributed/elastic/multiprocessing/errors/__init__.py", line 347, in wrapper
  | 2024-08-23T22:22:58.879Z | return f(*args, **kwargs)
  | 2024-08-23T22:22:58.879Z | ^^^^
  | 2024-08-23T22:22:58.879Z | ^^^^^^^^
  | 2024-08-23T22:22:58.879Z | ^^^^^^ File "/opt/conda/lib/python3.11/site-packages/torch/distributed/run.py", line 879, in main
  | 2024-08-23T22:22:58.879Z | run(args)
  | 2024-08-23T22:22:58.879Z | File "/opt/conda/lib/python3.11/site-packages/torch/distributed/run.py", line 870, in run
  | 2024-08-23T22:22:58.879Z | elastic_launch( File "/opt/conda/lib/python3.11/site-packages/torch/distributed/launcher/api.py", line 132, in __call__
  | 2024-08-23T22:22:58.879Z | return launch_agent(self._config, self._entrypoint, list(args))
  | 2024-08-23T22:22:58.879Z | ^^^^^^
  | 2024-08-23T22:22:58.879Z | ^^^^^^^^^^^^^^^^^^
  | 2024-08-23T22:22:58.879Z | ^^^^^^^^^^^^^^^^^^^
  | 2024-08-23T22:22:58.879Z | ^^^^^^^^^^^^^ File "/opt/conda/lib/python3.11/site-packages/torch/distributed/launcher/api.py", line 263, in launch_agent
  | 2024-08-23T22:22:58.879Z | raise ChildFailedError(
  | 2024-08-23T22:22:58.879Z | torch.distributed.elastic.multiprocessing.errors.ChildFailedError:
  | 2024-08-23T22:22:58.879Z | ====================================================
  | 2024-08-23T22:22:58.879Z | train_model.py FAILED
  | 2024-08-23T22:22:58.879Z | ----------------------------------------------------
  | 2024-08-23T22:22:58.879Z | Failures:
  | 2024-08-23T22:22:58.879Z | [1]: time : 2024-08-23_22:22:58 host : algo-2 rank : 1 (local_rank: 1) exitcode : -6 (pid: 188) error_file: <N/A> traceback : Signal 6 (SIGABRT) received by PID 188
  | 2024-08-23T22:22:58.879Z | [2]: time : 2024-08-23_22:22:58 host : algo-2 rank : 2 (local_rank: 2) exitcode : -6 (pid: 189) error_file: <N/A> traceback : Signal 6 (SIGABRT) received by PID 189
  | 2024-08-23T22:22:58.879Z | [3]: time : 2024-08-23_22:22:58 host : algo-2 rank : 3 (local_rank: 3) exitcode : -6 (pid: 190) error_file: <N/A> traceback : Signal 6 (SIGABRT) received by PID 190
  | 2024-08-23T22:22:58.879Z | [4]: time : 2024-08-23_22:22:58 host : algo-2 rank : 4 (local_rank: 4) exitcode : -6 (pid: 191) error_file: <N/A> traceback : Signal 6 (SIGABRT) received by PID 191
  | 2024-08-23T22:22:58.880Z | [5]: time : 2024-08-23_22:22:58 host : algo-2 rank : 5 (local_rank: 5) exitcode : -6 (pid: 192) error_file: <N/A> traceback : Signal 6 (SIGABRT) received by PID 192
  | 2024-08-23T22:22:58.880Z | [6]: time : 2024-08-23_22:22:58 host : algo-2 rank : 6 (local_rank: 6) exitcode : -6 (pid: 193) error_file: <N/A> traceback : Signal 6 (SIGABRT) received by PID 193
  | 2024-08-23T22:22:58.880Z | [7]: time : 2024-08-23_22:22:58 host : algo-2 rank : 7 (local_rank: 7) exitcode : -6 (pid: 194) error_file: <N/A> traceback : Signal 6 (SIGABRT) received by PID 194
  | 2024-08-23T22:22:58.880Z | ----------------------------------------------------
  | 2024-08-23T22:22:58.880Z | Root Cause (first observed failure):
  | 2024-08-23T22:22:58.880Z | [0]: time : 2024-08-23_22:22:58 host : algo-2 rank : 0 (local_rank: 0) exitcode : -6 (pid: 187) error_file: <N/A> traceback : Signal 6 (SIGABRT) received by PID 187
  | 2024-08-23T22:22:58.880Z | ====================================================
  | 2024-08-23T22:23:01.881Z | 2024-08-23 22:23:01,373 sagemaker-training-toolkit INFO Waiting for the process to finish and give a return code.
  | 2024-08-23T22:23:01.881Z | 2024-08-23 22:23:01,373 sagemaker-training-toolkit INFO Done waiting for a return code. Received 1 from exiting process.
  | 2024-08-23T22:23:01.881Z | 2024-08-23 22:23:01,374 sagemaker-training-toolkit ERROR Reporting training FAILURE
  | 2024-08-23T22:23:01.881Z | 2024-08-23 22:23:01,374 sagemaker-training-toolkit ERROR ExecuteUserScriptError:
  | 2024-08-23T22:23:01.881Z | ExitCode 1
| 2024-08-23T22:23:01.892Z | Command "torchrun --nnodes 2 --nproc_per_node 8 --master_addr algo-2 --master_port 7777 --node_rank 0 train_model.py --accelerator auto --accumulate_grad_batches 1 --attribute-names x y z --batch-size 32 --devices auto --drop-path-rate 0.1 --enable_progress_bar False --fusion-activation leaky_relu --fusion-hidden-features 512 --fusion-type simplefc --global-pool max --gradient_clip_algorithm norm --gradient_clip_val 5.0 --image-encoder convnextv2_tiny --image-size 384 --image-transform-backend albumentations --layer-decay 0.9 --log-dir /opt/ml/model/logs/ --log_every_n_steps 10 --lr 0.00005 --max_steps 123046 --min-lr 0.00000001 --min-precision 0.99 --momentum 0.9 --mu 0.3 --num-classes 3 --num-workers 4 --num_nodes 2 --opt adamw --output-dir /opt/ml/model/ --pos-weight 0.3 --precision bf16-mixed --reg-mode partial --save-top-k 3 --sched cosine_decay --sched-on-updates 1 --strategy ddp_find_unused_parameters_true --sync_batchnorm True --task multilabel --tensorboard-dir /opt/ml/output/tensorboard/ --test-input .../optimized-data/test --text-encoder google-bert/bert-base-multilingual-uncased --train-inputs negatives,crop,pack,prop --train-inputs-s3-prefix .../optimized-data/ --train-weight-factors 0.8,0.07,0.07,0.07 --val-input .../optimized-data/val --val_check_interval 12304 --warmup-steps 1000 --weight-decay 0.005" | Command "torchrun --nnodes 2 --nproc_per_node 8 --master_addr algo-2 --master_port 7777 --node_rank 0 train_model.py --accelerator auto --accumulate_grad_batches 1 --attribute-names x y z --batch-size 32 --devices auto --drop-path-rate 0.1 --enable_progress_bar False --fusion-activation leaky_relu --fusion-hidden-features 512 --fusion-type simplefc --global-pool max --gradient_clip_algorithm norm --gradient_clip_val 5.0 --image-encoder convnextv2_tiny --image-size 384 --image-transform-backend albumentations --layer-decay 0.9 --log-dir /opt/ml/model/logs/ --log_every_n_steps 10 --lr 0.00005 --max_steps 123046 --min-lr 0.00000001 --min-precision 0.99 --momentum 0.9 --mu 0.3 --num-classes 3 --num-workers 4 --num_nodes 2 --opt adamw --output-dir /opt/ml/model/ --pos-weight 0.3 --precision bf16-mixed --reg-mode partial --save-top-k 3 --sched cosine_decay --sched-on-updates 1 --strategy ddp_find_unused_parameters_true --sync_batchnorm True --task multilabel --tensorboard-dir /opt/ml/output/tensorboard/ --test-input .../optimized-data/test --text-encoder google-bert/bert-base-multilingual-uncased --train-inputs negatives,x,y,z --train-inputs-s3-prefix .../optimized-data/ --train-weight-factors 0.8,0.07,0.07,0.07 --val-input .../optimized-data/val --val_check_interval 12304 --warmup-steps 1000 --weight-decay 0.005"
-- | --
  | 2024-08-23T22:23:01.892Z | 2024-08-23 22:23:01,374 sagemaker-training-toolkit ERROR Encountered exit_code 1



Code sample

Expected behavior

Training should not softlock in the middle of an epoch

Environment

  • PyTorch Version (e.g., 1.0): 2.1
  • OS (e.g., Linux): Linux
  • How you installed PyTorch (conda, pip, source): SageMaker prebuilt deep learning container (763104351884.dkr.ecr.us-west-2.amazonaws.com/pytorch-training:2.3.0-gpu-py311-cu121-ubuntu20.04-sagemaker, see https://github.com/aws/deep-learning-containers/blob/master/available_images.md)
  • Build command you used (if compiling from source):
  • Python version: 3.11
  • CUDA/cuDNN version: 12.1
  • GPU models and configuration: A10G (g5.48xlarge instance type in AWS)
  • Any other relevant information:

Additional context

If you have any other suggestions about why multi-node training with CombinedDataset would fail like this, any help is appreciated.

Metadata

Metadata

Assignees

No one assigned

    Labels

    bugSomething isn't workinghelp wantedExtra attention is needed

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions