Description
🐛 Bug
I'm running a training job with 2 nodes in SageMaker using torchrun to launch. I'm using a CombinedStreamingDataset for the training dataset and using train_weight_factors = [0.8,0.07,0.07,0.07]
. The training stops printing out log messages after some fixed number of batches (depending on random seed I guess). Where the training stops is deterministic if seed is fixed, based on my experiments. Then the NCCL timeout triggers an exception after 30 minutes. The training code works fine on a single node though.
To Reproduce
Use CombinedStreamingDataset for training dataset with train_weight_factors
not None
and iterate_over_all = False
. Launch training with torchrun
with num_nodes > 1.
2024-08-23T21:52:39.473Z | pytorch_lightning - INFO - RANK 0 - on_train_batch_end - EPOCH 1 BATCH 70/inf(0.00%): train_loss=0.08913; train_unweighted_loss=0.05355; train_rec=0.77778; train_neg_rec=1.00000; train_prec=1.00000; lr=0.0000499696
-- | --
| 2024-08-23T21:52:48.476Z | pytorch_lightning - INFO - RANK 0 - on_train_batch_end - EPOCH 1 BATCH 80/inf(0.00%): train_loss=0.07478; train_unweighted_loss=0.09251; train_rec=0.50000; train_neg_rec=0.98684; train_prec=0.80000; lr=0.0000499693
| 2024-08-23T22:22:50.870Z | [rank6]:[E ProcessGroupNCCL.cpp:563] [Rank 6] Watchdog caught collective operation timeout: WorkNCCL(SeqNum=70426, OpType=ALLREDUCE, NumelIn=1, NumelOut=1, Timeout(ms)=1800000) ran for 1800038 milliseconds before timing out.
| 2024-08-23T22:22:50.870Z | [rank4]:[E ProcessGroupNCCL.cpp:563] [Rank 4] Watchdog caught collective operation timeout: WorkNCCL(SeqNum=70426, OpType=ALLREDUCE, NumelIn=1, NumelOut=1, Timeout(ms)=1800000) ran for 1800058 milliseconds before timing out.
| 2024-08-23T22:22:50.870Z | [rank5]:[E ProcessGroupNCCL.cpp:563] [Rank 5] Watchdog caught collective operation timeout: WorkNCCL(SeqNum=70426, OpType=ALLREDUCE, NumelIn=1, NumelOut=1, Timeout(ms)=1800000) ran for 1800062 milliseconds before timing out.
| 2024-08-23T22:22:50.870Z | [rank1]:[E ProcessGroupNCCL.cpp:563] [Rank 1] Watchdog caught collective operation timeout: WorkNCCL(SeqNum=70426, OpType=ALLREDUCE, NumelIn=1, NumelOut=1, Timeout(ms)=1800000) ran for 1800064 milliseconds before timing out.
| 2024-08-23T22:22:50.870Z | [rank0]:[E ProcessGroupNCCL.cpp:563] [Rank 0] Watchdog caught collective operation timeout: WorkNCCL(SeqNum=70426, OpType=ALLREDUCE, NumelIn=1, NumelOut=1, Timeout(ms)=1800000) ran for 1800068 milliseconds before timing out.
| 2024-08-23T22:22:50.870Z | [rank3]:[E ProcessGroupNCCL.cpp:563] [Rank 3] Watchdog caught collective operation timeout: WorkNCCL(SeqNum=70426, OpType=ALLREDUCE, NumelIn=1, NumelOut=1, Timeout(ms)=1800000) ran for 1800076 milliseconds before timing out.
| 2024-08-23T22:22:50.870Z | [rank2]:[E ProcessGroupNCCL.cpp:563] [Rank 2] Watchdog caught collective operation timeout: WorkNCCL(SeqNum=70426, OpType=ALLREDUCE, NumelIn=1, NumelOut=1, Timeout(ms)=1800000) ran for 1800083 milliseconds before timing out.
| 2024-08-23T22:22:50.870Z | [rank7]:[E ProcessGroupNCCL.cpp:563] [Rank 7] Watchdog caught collective operation timeout: WorkNCCL(SeqNum=70426, OpType=ALLREDUCE, NumelIn=1, NumelOut=1, Timeout(ms)=1800000) ran for 1800085 milliseconds before timing out.
| 2024-08-23T22:22:51.870Z | [rank6]:[E ProcessGroupNCCL.cpp:1537] [PG 0 Rank 6] Timeout at NCCL work: 70426, last enqueued NCCL work: 70426, last completed NCCL work: 70425.
| 2024-08-23T22:22:51.870Z | [rank6]:[E ProcessGroupNCCL.cpp:577] [Rank 6] Some NCCL operations have failed or timed out. Due to the asynchronous nature of CUDA kernels, subsequent GPU operations might run on corrupted/incomplete data.
| 2024-08-23T22:22:51.870Z | [rank6]:[E ProcessGroupNCCL.cpp:583] [Rank 6] To avoid data inconsistency, we are taking the entire process down.
| 2024-08-23T22:22:51.870Z | [rank6]:[E ProcessGroupNCCL.cpp:1414] [PG 0 Rank 6] Process group watchdog thread terminated with exception: [Rank 6] Watchdog caught collective operation timeout: WorkNCCL(SeqNum=70426, OpType=ALLREDUCE, NumelIn=1, NumelOut=1, Timeout(ms)=1800000) ran for 1800038 milliseconds before timing out.
| 2024-08-23T22:22:51.871Z | Exception raised from checkTimeout at /opt/conda/conda-bld/pytorch_1714328519311/work/torch/csrc/distributed/c10d/ProcessGroupNCCL.cpp:565 (most recent call first):
| 2024-08-23T22:22:51.871Z | frame #0: c10::Error::Error(c10::SourceLocation, std::string) + 0x57 (0x7f4d9b80a897 in /opt/conda/lib/python3.11/site-packages/torch/lib/libc10.so)
| 2024-08-23T22:22:51.871Z | frame #1: c10d::ProcessGroupNCCL::WorkNCCL::checkTimeout(std::optional<std::chrono::duration<long, std::ratio<1l, 1000l> > >) + 0x1d2 (0x7f4d9cb03e12 in /opt/conda/lib/python3.11/site-packages/torch/lib/libtorch_cuda.so)
| 2024-08-23T22:22:51.871Z | frame #2: c10d::ProcessGroupNCCL::watchdogHandler() + 0x1a0 (0x7f4d9cb08c30 in /opt/conda/lib/python3.11/site-packages/torch/lib/libtorch_cuda.so)
| 2024-08-23T22:22:51.871Z | frame #3: c10d::ProcessGroupNCCL::ncclCommWatchdog() + 0x10c (0x7f4d9cb09f7c in /opt/conda/lib/python3.11/site-packages/torch/lib/libtorch_cuda.so)
| 2024-08-23T22:22:51.871Z | frame #4: <unknown function> + 0xd3e95 (0x7f4deb021e95 in /opt/conda/lib/python3.11/site-packages/torch/lib/../../../.././libstdc++.so.6)
| 2024-08-23T22:22:51.871Z | frame #5: <unknown function> + 0x8609 (0x7f4df4be2609 in /usr/lib/x86_64-linux-gnu/libpthread.so.0)
| 2024-08-23T22:22:51.871Z | frame #6: clone + 0x43 (0x7f4df49ab353 in /usr/lib/x86_64-linux-gnu/libc.so.6)
| 2024-08-23T22:22:51.871Z | terminate called after throwing an instance of 'c10::DistBackendError' what():
| 2024-08-23T22:22:51.871Z | [PG 0 Rank 6] Process group watchdog thread terminated with exception: [Rank 6] Watchdog caught collective operation timeout: WorkNCCL(SeqNum=70426, OpType=ALLREDUCE, NumelIn=1, NumelOut=1, Timeout(ms)=1800000) ran for 1800038 milliseconds before timing out.
| 2024-08-23T22:22:51.871Z | Exception raised from checkTimeout at /opt/conda/conda-bld/pytorch_1714328519311/work/torch/csrc/distributed/c10d/ProcessGroupNCCL.cpp:565 (most recent call first):
| 2024-08-23T22:22:51.871Z | frame #0: c10::Error::Error(c10::SourceLocation, std::string) + 0x57 (0x7f4d9b80a897 in /opt/conda/lib/python3.11/site-packages/torch/lib/libc10.so)
| 2024-08-23T22:22:51.871Z | frame #1: c10d::ProcessGroupNCCL::WorkNCCL::checkTimeout(std::optional<std::chrono::duration<long, std::ratio<1l, 1000l> > >) + 0x1d2 (0x7f4d9cb03e12 in /opt/conda/lib/python3.11/site-packages/torch/lib/libtorch_cuda.so)
| 2024-08-23T22:22:51.871Z | frame #2: c10d::ProcessGroupNCCL::watchdogHandler() + 0x1a0 (0x7f4d9cb08c30 in /opt/conda/lib/python3.11/site-packages/torch/lib/libtorch_cuda.so)
| 2024-08-23T22:22:51.871Z | frame #3: c10d::ProcessGroupNCCL::ncclCommWatchdog() + 0x10c (0x7f4d9cb09f7c in /opt/conda/lib/python3.11/site-packages/torch/lib/libtorch_cuda.so)
| 2024-08-23T22:22:51.871Z | frame #4: <unknown function> + 0xd3e95 (0x7f4deb021e95 in /opt/conda/lib/python3.11/site-packages/torch/lib/../../../.././libstdc++.so.6)
| 2024-08-23T22:22:51.871Z | frame #5: <unknown function> + 0x8609 (0x7f4df4be2609 in /usr/lib/x86_64-linux-gnu/libpthread.so.0)
| 2024-08-23T22:22:51.871Z | frame #6: clone + 0x43 (0x7f4df49ab353 in /usr/lib/x86_64-linux-gnu/libc.so.6)
| 2024-08-23T22:22:51.871Z | Exception raised from ncclCommWatchdog at /opt/conda/conda-bld/pytorch_1714328519311/work/torch/csrc/distributed/c10d/ProcessGroupNCCL.cpp:1418 (most recent call first):
| 2024-08-23T22:22:51.871Z | frame #0: c10::Error::Error(c10::SourceLocation, std::string) + 0x57 (0x7f4d9b80a897 in /opt/conda/lib/python3.11/site-packages/torch/lib/libc10.so)
| 2024-08-23T22:22:51.871Z | frame #1: <unknown function> + 0xe56473 (0x7f4d9c78b473 in /opt/conda/lib/python3.11/site-packages/torch/lib/libtorch_cuda.so)
| 2024-08-23T22:22:51.871Z | frame #2: <unknown function> + 0xd3e95 (0x7f4deb021e95 in /opt/conda/lib/python3.11/site-packages/torch/lib/../../../.././libstdc++.so.6)
| 2024-08-23T22:22:51.871Z | frame #3: <unknown function> + 0x8609 (0x7f4df4be2609 in /usr/lib/x86_64-linux-gnu/libpthread.so.0)
| 2024-08-23T22:22:51.871Z | frame #4: clone + 0x43 (0x7f4df49ab353 in /usr/lib/x86_64-linux-gnu/libc.so.6)
| 2024-08-23T22:22:51.872Z | [rank1]:[E ProcessGroupNCCL.cpp:1537] [PG 0 Rank 1] Timeout at NCCL work: 70426, last enqueued NCCL work: 70426, last completed NCCL work: 70425.
| 2024-08-23T22:22:51.872Z | [rank1]:[E ProcessGroupNCCL.cpp:577] [Rank 1] Some NCCL operations have failed or timed out. Due to the asynchronous nature of CUDA kernels, subsequent GPU operations might run on corrupted/incomplete data.
| 2024-08-23T22:22:51.872Z | [rank1]:[E ProcessGroupNCCL.cpp:583] [Rank 1] To avoid data inconsistency, we are taking the entire process down.
| 2024-08-23T22:22:51.872Z | [rank1]:[E ProcessGroupNCCL.cpp:1414] [PG 0 Rank 1] Process group watchdog thread terminated with exception: [Rank 1] Watchdog caught collective operation timeout: WorkNCCL(SeqNum=70426, OpType=ALLREDUCE, NumelIn=1, NumelOut=1, Timeout(ms)=1800000) ran for 1800064 milliseconds before timing out.
| 2024-08-23T22:22:51.872Z | Exception raised from checkTimeout at /opt/conda/conda-bld/pytorch_1714328519311/work/torch/csrc/distributed/c10d/ProcessGroupNCCL.cpp:565 (most recent call first):
| 2024-08-23T22:22:51.872Z | frame #0: c10::Error::Error(c10::SourceLocation, std::string) + 0x57 (0x7fb96b255897 in /opt/conda/lib/python3.11/site-packages/torch/lib/libc10.so)
| 2024-08-23T22:22:51.872Z | frame #1: c10d::ProcessGroupNCCL::WorkNCCL::checkTimeout(std::optional<std::chrono::duration<long, std::ratio<1l, 1000l> > >) + 0x1d2 (0x7fb96c54ee12 in /opt/conda/lib/python3.11/site-packages/torch/lib/libtorch_cuda.so)
| 2024-08-23T22:22:51.872Z | frame #2: c10d::ProcessGroupNCCL::watchdogHandler() + 0x1a0 (0x7fb96c553c30 in /opt/conda/lib/python3.11/site-packages/torch/lib/libtorch_cuda.so)
| 2024-08-23T22:22:51.872Z | frame #3: c10d::ProcessGroupNCCL::ncclCommWatchdog() + 0x10c (0x7fb96c554f7c in /opt/conda/lib/python3.11/site-packages/torch/lib/libtorch_cuda.so)
| 2024-08-23T22:22:51.872Z | frame #4: <unknown function> + 0xd3e95 (0x7fb9baa6ce95 in /opt/conda/lib/python3.11/site-packages/torch/lib/../../../.././libstdc++.so.6)
| 2024-08-23T22:22:51.872Z | frame #5: <unknown function> + 0x8609 (0x7fb9c462d609 in /usr/lib/x86_64-linux-gnu/libpthread.so.0)
| 2024-08-23T22:22:51.872Z | frame #6: clone + 0x43 (0x7fb9c43f6353 in /usr/lib/x86_64-linux-gnu/libc.so.6)
| 2024-08-23T22:22:51.872Z | terminate called after throwing an instance of 'c10::DistBackendError' what(): [PG 0 Rank 1] Process group watchdog thread terminated with exception: [Rank 1] Watchdog caught collective operation timeout: WorkNCCL(SeqNum=70426, OpType=ALLREDUCE, NumelIn=1, NumelOut=1, Timeout(ms)=1800000) ran for 1800064 milliseconds before timing out.
| 2024-08-23T22:22:51.872Z | Exception raised from checkTimeout at /opt/conda/conda-bld/pytorch_1714328519311/work/torch/csrc/distributed/c10d/ProcessGroupNCCL.cpp:565 (most recent call first):
| 2024-08-23T22:22:51.872Z | frame #0: c10::Error::Error(c10::SourceLocation, std::string) + 0x57 (0x7fb96b255897 in /opt/conda/lib/python3.11/site-packages/torch/lib/libc10.so)
| 2024-08-23T22:22:51.872Z | frame #1: c10d::ProcessGroupNCCL::WorkNCCL::checkTimeout(std::optional<std::chrono::duration<long, std::ratio<1l, 1000l> > >) + 0x1d2 (0x7fb96c54ee12 in /opt/conda/lib/python3.11/site-packages/torch/lib/libtorch_cuda.so)
| 2024-08-23T22:22:51.872Z | frame #2: c10d::ProcessGroupNCCL::watchdogHandler() + 0x1a0 (0x7fb96c553c30 in /opt/conda/lib/python3.11/site-packages/torch/lib/libtorch_cuda.so)
| 2024-08-23T22:22:51.872Z | frame #3: c10d::ProcessGroupNCCL::ncclCommWatchdog() + 0x10c (0x7fb96c554f7c in /opt/conda/lib/python3.11/site-packages/torch/lib/libtorch_cuda.so)
| 2024-08-23T22:22:51.872Z | frame #4: <unknown function> + 0xd3e95 (0x7fb9baa6ce95 in /opt/conda/lib/python3.11/site-packages/torch/lib/../../../.././libstdc++.so.6)
| 2024-08-23T22:22:51.872Z | frame #5: <unknown function> + 0x8609 (0x7fb9c462d609 in /usr/lib/x86_64-linux-gnu/libpthread.so.0)
| 2024-08-23T22:22:51.872Z | frame #6: clone + 0x43 (0x7fb9c43f6353 in /usr/lib/x86_64-linux-gnu/libc.so.6)
| 2024-08-23T22:22:51.872Z | Exception raised from ncclCommWatchdog at /opt/conda/conda-bld/pytorch_1714328519311/work/torch/csrc/distributed/c10d/ProcessGroupNCCL.cpp:1418 (most recent call first):
| 2024-08-23T22:22:51.872Z | frame #0: c10::Error::Error(c10::SourceLocation, std::string) + 0x57 (0x7fb96b255897 in /opt/conda/lib/python3.11/site-packages/torch/lib/libc10.so)
| 2024-08-23T22:22:51.872Z | frame #1: <unknown function> + 0xe56473 (0x7fb96c1d6473 in /opt/conda/lib/python3.11/site-packages/torch/lib/libtorch_cuda.so)
| 2024-08-23T22:22:51.872Z | frame #2: <unknown function> + 0xd3e95 (0x7fb9baa6ce95 in /opt/conda/lib/python3.11/site-packages/torch/lib/../../../.././libstdc++.so.6)
| 2024-08-23T22:22:51.872Z | frame #3: <unknown function> + 0x8609 (0x7fb9c462d609 in /usr/lib/x86_64-linux-gnu/libpthread.so.0)
| 2024-08-23T22:22:51.872Z | frame #4: clone + 0x43 (0x7fb9c43f6353 in /usr/lib/x86_64-linux-gnu/libc.so.6)
| 2024-08-23T22:22:51.872Z | [rank5]:[E ProcessGroupNCCL.cpp:1537] [PG 0 Rank 5] Timeout at NCCL work: 70426, last enqueued NCCL work: 70426, last completed NCCL work: 70425.
| 2024-08-23T22:22:51.872Z | [rank5]:[E ProcessGroupNCCL.cpp:577] [Rank 5] Some NCCL operations have failed or timed out. Due to the asynchronous nature of CUDA kernels, subsequent GPU operations might run on corrupted/incomplete data.
| 2024-08-23T22:22:51.872Z | [rank5]:[E ProcessGroupNCCL.cpp:583] [Rank 5] To avoid data inconsistency, we are taking the entire process down.
| 2024-08-23T22:22:51.872Z | [rank5]:[E ProcessGroupNCCL.cpp:1414] [PG 0 Rank 5] Process group watchdog thread terminated with exception: [Rank 5] Watchdog caught collective operation timeout: WorkNCCL(SeqNum=70426, OpType=ALLREDUCE, NumelIn=1, NumelOut=1, Timeout(ms)=1800000) ran for 1800062 milliseconds before timing out.
| 2024-08-23T22:22:51.872Z | Exception raised from checkTimeout at /opt/conda/conda-bld/pytorch_1714328519311/work/torch/csrc/distributed/c10d/ProcessGroupNCCL.cpp:565 (most recent call first):
| 2024-08-23T22:22:51.872Z | frame #0: c10::Error::Error(c10::SourceLocation, std::string) + 0x57 (0x7fd01c362897 in /opt/conda/lib/python3.11/site-packages/torch/lib/libc10.so)
| 2024-08-23T22:22:51.872Z | frame #1: c10d::ProcessGroupNCCL::WorkNCCL::checkTimeout(std::optional<std::chrono::duration<long, std::ratio<1l, 1000l> > >) + 0x1d2 (0x7fd01d65be12 in /opt/conda/lib/python3.11/site-packages/torch/lib/libtorch_cuda.so)
| 2024-08-23T22:22:51.872Z | frame #2: c10d::ProcessGroupNCCL::watchdogHandler() + 0x1a0 (0x7fd01d660c30 in /opt/conda/lib/python3.11/site-packages/torch/lib/libtorch_cuda.so)
| 2024-08-23T22:22:51.872Z | frame #3: c10d::ProcessGroupNCCL::ncclCommWatchdog() + 0x10c (0x7fd01d661f7c in /opt/conda/lib/python3.11/site-packages/torch/lib/libtorch_cuda.so)
| 2024-08-23T22:22:51.872Z | frame #4: <unknown function> + 0xd3e95 (0x7fd06bb79e95 in /opt/conda/lib/python3.11/site-packages/torch/lib/../../../.././libstdc++.so.6)
| 2024-08-23T22:22:51.872Z | frame #5: <unknown function> + 0x8609 (0x7fd07573a609 in /usr/lib/x86_64-linux-gnu/libpthread.so.0)
| 2024-08-23T22:22:51.872Z | frame #6: clone + 0x43 (0x7fd075503353 in /usr/lib/x86_64-linux-gnu/libc.so.6)
| 2024-08-23T22:22:51.872Z | terminate called after throwing an instance of 'c10::DistBackendError' what(): [PG 0 Rank 5] Process group watchdog thread terminated with exception: [Rank 5] Watchdog caught collective operation timeout: WorkNCCL(SeqNum=70426, OpType=ALLREDUCE, NumelIn=1, NumelOut=1, Timeout(ms)=1800000) ran for 1800062 milliseconds before timing out.
| 2024-08-23T22:22:51.872Z | Exception raised from checkTimeout at /opt/conda/conda-bld/pytorch_1714328519311/work/torch/csrc/distributed/c10d/ProcessGroupNCCL.cpp:565 (most recent call first):
| 2024-08-23T22:22:51.872Z | frame #0: c10::Error::Error(c10::SourceLocation, std::string) + 0x57 (0x7fd01c362897 in /opt/conda/lib/python3.11/site-packages/torch/lib/libc10.so)
| 2024-08-23T22:22:51.872Z | frame #1: c10d::ProcessGroupNCCL::WorkNCCL::checkTimeout(std::optional<std::chrono::duration<long, std::ratio<1l, 1000l> > >) + 0x1d2 (0x7fd01d65be12 in /opt/conda/lib/python3.11/site-packages/torch/lib/libtorch_cuda.so)
| 2024-08-23T22:22:51.873Z | frame #2: c10d::ProcessGroupNCCL::watchdogHandler() + 0x1a0 (0x7fd01d660c30 in /opt/conda/lib/python3.11/site-packages/torch/lib/libtorch_cuda.so)
| 2024-08-23T22:22:51.873Z | frame #3: c10d::ProcessGroupNCCL::ncclCommWatchdog() + 0x10c (0x7fd01d661f7c in /opt/conda/lib/python3.11/site-packages/torch/lib/libtorch_cuda.so)
| 2024-08-23T22:22:51.873Z | frame #4: <unknown function> + 0xd3e95 (0x7fd06bb79e95 in /opt/conda/lib/python3.11/site-packages/torch/lib/../../../.././libstdc++.so.6)
| 2024-08-23T22:22:51.873Z | frame #5: <unknown function> + 0x8609 (0x7fd07573a609 in /usr/lib/x86_64-linux-gnu/libpthread.so.0)
| 2024-08-23T22:22:51.873Z | frame #6: clone + 0x43 (0x7fd075503353 in /usr/lib/x86_64-linux-gnu/libc.so.6)
| 2024-08-23T22:22:51.873Z | Exception raised from ncclCommWatchdog at /opt/conda/conda-bld/pytorch_1714328519311/work/torch/csrc/distributed/c10d/ProcessGroupNCCL.cpp:1418 (most recent call first):
| 2024-08-23T22:22:51.873Z | frame #0: c10::Error::Error(c10::SourceLocation, std::string) + 0x57 (0x7fd01c362897 in /opt/conda/lib/python3.11/site-packages/torch/lib/libc10.so)
| 2024-08-23T22:22:51.873Z | frame #1: <unknown function> + 0xe56473 (0x7fd01d2e3473 in /opt/conda/lib/python3.11/site-packages/torch/lib/libtorch_cuda.so)
| 2024-08-23T22:22:51.873Z | frame #2: <unknown function> + 0xd3e95 (0x7fd06bb79e95 in /opt/conda/lib/python3.11/site-packages/torch/lib/../../../.././libstdc++.so.6)
| 2024-08-23T22:22:51.873Z | frame #3: <unknown function> + 0x8609 (0x7fd07573a609 in /usr/lib/x86_64-linux-gnu/libpthread.so.0)
| 2024-08-23T22:22:51.873Z | frame #4: clone + 0x43 (0x7fd075503353 in /usr/lib/x86_64-linux-gnu/libc.so.6)
| 2024-08-23T22:22:51.873Z | [rank4]:[E ProcessGroupNCCL.cpp:1537] [PG 0 Rank 4] Timeout at NCCL work: 70426, last enqueued NCCL work: 70426, last completed NCCL work: 70425.
| 2024-08-23T22:22:51.873Z | [rank4]:[E ProcessGroupNCCL.cpp:577] [Rank 4] Some NCCL operations have failed or timed out. Due to the asynchronous nature of CUDA kernels, subsequent GPU operations might run on corrupted/incomplete data.
| 2024-08-23T22:22:51.873Z | [rank4]:[E ProcessGroupNCCL.cpp:583] [Rank 4] To avoid data inconsistency, we are taking the entire process down.
| 2024-08-23T22:22:51.873Z | [rank4]:[E ProcessGroupNCCL.cpp:1414] [PG 0 Rank 4] Process group watchdog thread terminated with exception: [Rank 4] Watchdog caught collective operation timeout: WorkNCCL(SeqNum=70426, OpType=ALLREDUCE, NumelIn=1, NumelOut=1, Timeout(ms)=1800000) ran for 1800058 milliseconds before timing out.
| 2024-08-23T22:22:51.873Z | Exception raised from checkTimeout at /opt/conda/conda-bld/pytorch_1714328519311/work/torch/csrc/distributed/c10d/ProcessGroupNCCL.cpp:565 (most recent call first):
| 2024-08-23T22:22:51.873Z | frame #0: c10::Error::Error(c10::SourceLocation, std::string) + 0x57 (0x7f74c40ea897 in /opt/conda/lib/python3.11/site-packages/torch/lib/libc10.so)
| 2024-08-23T22:22:51.873Z | frame #1: c10d::ProcessGroupNCCL::WorkNCCL::checkTimeout(std::optional<std::chrono::duration<long, std::ratio<1l, 1000l> > >) + 0x1d2 (0x7f74c53e3e12 in /opt/conda/lib/python3.11/site-packages/torch/lib/libtorch_cuda.so)
| 2024-08-23T22:22:51.873Z | frame #2: c10d::ProcessGroupNCCL::watchdogHandler() + 0x1a0 (0x7f74c53e8c30 in /opt/conda/lib/python3.11/site-packages/torch/lib/libtorch_cuda.so)
| 2024-08-23T22:22:51.873Z | frame #3: c10d::ProcessGroupNCCL::ncclCommWatchdog() + 0x10c (0x7f74c53e9f7c in /opt/conda/lib/python3.11/site-packages/torch/lib/libtorch_cuda.so)
| 2024-08-23T22:22:51.873Z | frame #4: <unknown function> + 0xd3e95 (0x7f7513901e95 in /opt/conda/lib/python3.11/site-packages/torch/lib/../../../.././libstdc++.so.6)
| 2024-08-23T22:22:51.873Z | frame #5: <unknown function> + 0x8609 (0x7f751d4c2609 in /usr/lib/x86_64-linux-gnu/libpthread.so.0)
| 2024-08-23T22:22:51.873Z | frame #6: clone + 0x43 (0x7f751d28b353 in /usr/lib/x86_64-linux-gnu/libc.so.6)
| 2024-08-23T22:22:51.873Z | terminate called after throwing an instance of 'c10::DistBackendError' what(): [PG 0 Rank 4] Process group watchdog thread terminated with exception: [Rank 4] Watchdog caught collective operation timeout: WorkNCCL(SeqNum=70426, OpType=ALLREDUCE, NumelIn=1, NumelOut=1, Timeout(ms)=1800000) ran for 1800058 milliseconds before timing out.
| 2024-08-23T22:22:51.873Z | Exception raised from checkTimeout at /opt/conda/conda-bld/pytorch_1714328519311/work/torch/csrc/distributed/c10d/ProcessGroupNCCL.cpp:565 (most recent call first):
| 2024-08-23T22:22:51.873Z | frame #0: c10::Error::Error(c10::SourceLocation, std::string) + 0x57 (0x7f74c40ea897 in /opt/conda/lib/python3.11/site-packages/torch/lib/libc10.so)
| 2024-08-23T22:22:51.873Z | frame #1: c10d::ProcessGroupNCCL::WorkNCCL::checkTimeout(std::optional<std::chrono::duration<long, std::ratio<1l, 1000l> > >) + 0x1d2 (0x7f74c53e3e12 in /opt/conda/lib/python3.11/site-packages/torch/lib/libtorch_cuda.so)
| 2024-08-23T22:22:51.873Z | frame #2: c10d::ProcessGroupNCCL::watchdogHandler() + 0x1a0 (0x7f74c53e8c30 in /opt/conda/lib/python3.11/site-packages/torch/lib/libtorch_cuda.so)
| 2024-08-23T22:22:51.873Z | frame #3: c10d::ProcessGroupNCCL::ncclCommWatchdog() + 0x10c (0x7f74c53e9f7c in /opt/conda/lib/python3.11/site-packages/torch/lib/libtorch_cuda.so)
| 2024-08-23T22:22:51.873Z | frame #4: <unknown function> + 0xd3e95 (0x7f7513901e95 in /opt/conda/lib/python3.11/site-packages/torch/lib/../../../.././libstdc++.so.6)
| 2024-08-23T22:22:51.873Z | frame #5: <unknown function> + 0x8609 (0x7f751d4c2609 in /usr/lib/x86_64-linux-gnu/libpthread.so.0)
| 2024-08-23T22:22:51.873Z | frame #6: clone + 0x43 (0x7f751d28b353 in /usr/lib/x86_64-linux-gnu/libc.so.6)
| 2024-08-23T22:22:51.873Z | Exception raised from ncclCommWatchdog at /opt/conda/conda-bld/pytorch_1714328519311/work/torch/csrc/distributed/c10d/ProcessGroupNCCL.cpp:1418 (most recent call first):
| 2024-08-23T22:22:51.873Z | frame #0: c10::Error::Error(c10::SourceLocation, std::string) + 0x57 (0x7f74c40ea897 in /opt/conda/lib/python3.11/site-packages/torch/lib/libc10.so)
| 2024-08-23T22:22:51.873Z | frame #1: <unknown function> + 0xe56473 (0x7f74c506b473 in /opt/conda/lib/python3.11/site-packages/torch/lib/libtorch_cuda.so)
| 2024-08-23T22:22:51.873Z | frame #2: <unknown function> + 0xd3e95 (0x7f7513901e95 in /opt/conda/lib/python3.11/site-packages/torch/lib/../../../.././libstdc++.so.6)
| 2024-08-23T22:22:51.873Z | frame #3: <unknown function> + 0x8609 (0x7f751d4c2609 in /usr/lib/x86_64-linux-gnu/libpthread.so.0)
| 2024-08-23T22:22:51.873Z | frame #4: clone + 0x43 (0x7f751d28b353 in /usr/lib/x86_64-linux-gnu/libc.so.6)
| 2024-08-23T22:22:51.873Z | [rank0]:[E ProcessGroupNCCL.cpp:1537] [PG 0 Rank 0] Timeout at NCCL work: 70426, last enqueued NCCL work: 70426, last completed NCCL work: 70425.
| 2024-08-23T22:22:51.873Z | [rank0]:[E ProcessGroupNCCL.cpp:577] [Rank 0] Some NCCL operations have failed or timed out. Due to the asynchronous nature of CUDA kernels, subsequent GPU operations might run on corrupted/incomplete data.
| 2024-08-23T22:22:51.873Z | [rank0]:[E ProcessGroupNCCL.cpp:583] [Rank 0] To avoid data inconsistency, we are taking the entire process down.
| 2024-08-23T22:22:51.873Z | [rank0]:[E ProcessGroupNCCL.cpp:1414] [PG 0 Rank 0] Process group watchdog thread terminated with exception: [Rank 0] Watchdog caught collective operation timeout: WorkNCCL(SeqNum=70426, OpType=ALLREDUCE, NumelIn=1, NumelOut=1, Timeout(ms)=1800000) ran for 1800068 milliseconds before timing out.
| 2024-08-23T22:22:51.873Z | Exception raised from checkTimeout at /opt/conda/conda-bld/pytorch_1714328519311/work/torch/csrc/distributed/c10d/ProcessGroupNCCL.cpp:565 (most recent call first):
| 2024-08-23T22:22:51.873Z | frame #0: c10::Error::Error(c10::SourceLocation, std::string) + 0x57 (0x7f483567d897 in /opt/conda/lib/python3.11/site-packages/torch/lib/libc10.so)
| 2024-08-23T22:22:51.873Z | frame #1: c10d::ProcessGroupNCCL::WorkNCCL::checkTimeout(std::optional<std::chrono::duration<long, std::ratio<1l, 1000l> > >) + 0x1d2 (0x7f4836976e12 in /opt/conda/lib/python3.11/site-packages/torch/lib/libtorch_cuda.so)
| 2024-08-23T22:22:51.874Z | frame #2: c10d::ProcessGroupNCCL::watchdogHandler() + 0x1a0 (0x7f483697bc30 in /opt/conda/lib/python3.11/site-packages/torch/lib/libtorch_cuda.so)
| 2024-08-23T22:22:51.874Z | frame #3: c10d::ProcessGroupNCCL::ncclCommWatchdog() + 0x10c (0x7f483697cf7c in /opt/conda/lib/python3.11/site-packages/torch/lib/libtorch_cuda.so)
| 2024-08-23T22:22:51.874Z | frame #4: <unknown function> + 0xd3e95 (0x7f4884e94e95 in /opt/conda/lib/python3.11/site-packages/torch/lib/../../../.././libstdc++.so.6)
| 2024-08-23T22:22:51.874Z | frame #5: <unknown function> + 0x8609 (0x7f488ea55609 in /usr/lib/x86_64-linux-gnu/libpthread.so.0)
| 2024-08-23T22:22:51.874Z | frame #6: clone + 0x43 (0x7f488e81e353 in /usr/lib/x86_64-linux-gnu/libc.so.6)
| 2024-08-23T22:22:51.874Z | terminate called after throwing an instance of 'c10::DistBackendError' what(): [PG 0 Rank 0] Process group watchdog thread terminated with exception: [Rank 0] Watchdog caught collective operation timeout: WorkNCCL(SeqNum=70426, OpType=ALLREDUCE, NumelIn=1, NumelOut=1, Timeout(ms)=1800000) ran for 1800068 milliseconds before timing out.
| 2024-08-23T22:22:51.874Z | Exception raised from checkTimeout at /opt/conda/conda-bld/pytorch_1714328519311/work/torch/csrc/distributed/c10d/ProcessGroupNCCL.cpp:565 (most recent call first):
| 2024-08-23T22:22:51.874Z | frame #0: c10::Error::Error(c10::SourceLocation, std::string) + 0x57 (0x7f483567d897 in /opt/conda/lib/python3.11/site-packages/torch/lib/libc10.so)
| 2024-08-23T22:22:51.874Z | frame #1: c10d::ProcessGroupNCCL::WorkNCCL::checkTimeout(std::optional<std::chrono::duration<long, std::ratio<1l, 1000l> > >) + 0x1d2 (0x7f4836976e12 in /opt/conda/lib/python3.11/site-packages/torch/lib/libtorch_cuda.so)
| 2024-08-23T22:22:51.874Z | frame #2: c10d::ProcessGroupNCCL::watchdogHandler() + 0x1a0 (0x7f483697bc30 in /opt/conda/lib/python3.11/site-packages/torch/lib/libtorch_cuda.so)
| 2024-08-23T22:22:51.874Z | frame #3: c10d::ProcessGroupNCCL::ncclCommWatchdog() + 0x10c (0x7f483697cf7c in /opt/conda/lib/python3.11/site-packages/torch/lib/libtorch_cuda.so)
| 2024-08-23T22:22:51.874Z | frame #4: <unknown function> + 0xd3e95 (0x7f4884e94e95 in /opt/conda/lib/python3.11/site-packages/torch/lib/../../../.././libstdc++.so.6)
| 2024-08-23T22:22:51.874Z | frame #5: <unknown function> + 0x8609 (0x7f488ea55609 in /usr/lib/x86_64-linux-gnu/libpthread.so.0)
| 2024-08-23T22:22:51.874Z | frame #6: clone + 0x43 (0x7f488e81e353 in /usr/lib/x86_64-linux-gnu/libc.so.6)
| 2024-08-23T22:22:51.874Z | Exception raised from ncclCommWatchdog at /opt/conda/conda-bld/pytorch_1714328519311/work/torch/csrc/distributed/c10d/ProcessGroupNCCL.cpp:1418 (most recent call first):
| 2024-08-23T22:22:51.874Z | frame #0: c10::Error::Error(c10::SourceLocation, std::string) + 0x57 (0x7f483567d897 in /opt/conda/lib/python3.11/site-packages/torch/lib/libc10.so)
| 2024-08-23T22:22:51.874Z | frame #1: <unknown function> + 0xe56473 (0x7f48365fe473 in /opt/conda/lib/python3.11/site-packages/torch/lib/libtorch_cuda.so)
| 2024-08-23T22:22:51.874Z | frame #2: <unknown function> + 0xd3e95 (0x7f4884e94e95 in /opt/conda/lib/python3.11/site-packages/torch/lib/../../../.././libstdc++.so.6)
| 2024-08-23T22:22:51.874Z | frame #3: <unknown function> + 0x8609 (0x7f488ea55609 in /usr/lib/x86_64-linux-gnu/libpthread.so.0)
| 2024-08-23T22:22:51.874Z | frame #4: clone + 0x43 (0x7f488e81e353 in /usr/lib/x86_64-linux-gnu/libc.so.6)
| 2024-08-23T22:22:51.874Z | [rank3]:[E ProcessGroupNCCL.cpp:1537] [PG 0 Rank 3] Timeout at NCCL work: 70426, last enqueued NCCL work: 70427, last completed NCCL work: 70425.
| 2024-08-23T22:22:51.874Z | [rank3]:[E ProcessGroupNCCL.cpp:577] [Rank 3] Some NCCL operations have failed or timed out. Due to the asynchronous nature of CUDA kernels, subsequent GPU operations might run on corrupted/incomplete data.
| 2024-08-23T22:22:51.874Z | [rank3]:[E ProcessGroupNCCL.cpp:583] [Rank 3] To avoid data inconsistency, we are taking the entire process down.
| 2024-08-23T22:22:51.874Z | [rank3]:[E ProcessGroupNCCL.cpp:1414] [PG 0 Rank 3] Process group watchdog thread terminated with exception: [Rank 3] Watchdog caught collective operation timeout: WorkNCCL(SeqNum=70426, OpType=ALLREDUCE, NumelIn=1, NumelOut=1, Timeout(ms)=1800000) ran for 1800076 milliseconds before timing out.
| 2024-08-23T22:22:51.874Z | Exception raised from checkTimeout at /opt/conda/conda-bld/pytorch_1714328519311/work/torch/csrc/distributed/c10d/ProcessGroupNCCL.cpp:565 (most recent call first):
| 2024-08-23T22:22:51.874Z | frame #0: c10::Error::Error(c10::SourceLocation, std::string) + 0x57 (0x7f787e8c4897 in /opt/conda/lib/python3.11/site-packages/torch/lib/libc10.so)
| 2024-08-23T22:22:51.874Z | frame #1: c10d::ProcessGroupNCCL::WorkNCCL::checkTimeout(std::optional<std::chrono::duration<long, std::ratio<1l, 1000l> > >) + 0x1d2 (0x7f787fbbde12 in /opt/conda/lib/python3.11/site-packages/torch/lib/libtorch_cuda.so)
| 2024-08-23T22:22:51.874Z | frame #2: c10d::ProcessGroupNCCL::watchdogHandler() + 0x1a0 (0x7f787fbc2c30 in /opt/conda/lib/python3.11/site-packages/torch/lib/libtorch_cuda.so)
| 2024-08-23T22:22:51.874Z | frame #3: c10d::ProcessGroupNCCL::ncclCommWatchdog() + 0x10c (0x7f787fbc3f7c in /opt/conda/lib/python3.11/site-packages/torch/lib/libtorch_cuda.so)
| 2024-08-23T22:22:51.874Z | frame #4: <unknown function> + 0xd3e95 (0x7f78ce0dbe95 in /opt/conda/lib/python3.11/site-packages/torch/lib/../../../.././libstdc++.so.6)
| 2024-08-23T22:22:51.874Z | frame #5: <unknown function> + 0x8609 (0x7f78d7c9c609 in /usr/lib/x86_64-linux-gnu/libpthread.so.0)
| 2024-08-23T22:22:51.874Z | frame #6: clone + 0x43 (0x7f78d7a65353 in /usr/lib/x86_64-linux-gnu/libc.so.6)
| 2024-08-23T22:22:51.874Z | terminate called after throwing an instance of 'c10::DistBackendError' what(): [PG 0 Rank 3] Process group watchdog thread terminated with exception: [Rank 3] Watchdog caught collective operation timeout: WorkNCCL(SeqNum=70426, OpType=ALLREDUCE, NumelIn=1, NumelOut=1, Timeout(ms)=1800000) ran for 1800076 milliseconds before timing out.
| 2024-08-23T22:22:51.874Z | Exception raised from checkTimeout at /opt/conda/conda-bld/pytorch_1714328519311/work/torch/csrc/distributed/c10d/ProcessGroupNCCL.cpp:565 (most recent call first):
| 2024-08-23T22:22:51.874Z | frame #0: c10::Error::Error(c10::SourceLocation, std::string) + 0x57 (0x7f787e8c4897 in /opt/conda/lib/python3.11/site-packages/torch/lib/libc10.so)
| 2024-08-23T22:22:51.874Z | frame #1: c10d::ProcessGroupNCCL::WorkNCCL::checkTimeout(std::optional<std::chrono::duration<long, std::ratio<1l, 1000l> > >) + 0x1d2 (0x7f787fbbde12 in /opt/conda/lib/python3.11/site-packages/torch/lib/libtorch_cuda.so)
| 2024-08-23T22:22:51.874Z | frame #2: c10d::ProcessGroupNCCL::watchdogHandler() + 0x1a0 (0x7f787fbc2c30 in /opt/conda/lib/python3.11/site-packages/torch/lib/libtorch_cuda.so)
| 2024-08-23T22:22:51.874Z | frame #3: c10d::ProcessGroupNCCL::ncclCommWatchdog() + 0x10c (0x7f787fbc3f7c in /opt/conda/lib/python3.11/site-packages/torch/lib/libtorch_cuda.so)
| 2024-08-23T22:22:51.874Z | frame #4: <unknown function> + 0xd3e95 (0x7f78ce0dbe95 in /opt/conda/lib/python3.11/site-packages/torch/lib/../../../.././libstdc++.so.6)
| 2024-08-23T22:22:51.874Z | frame #5: <unknown function> + 0x8609 (0x7f78d7c9c609 in /usr/lib/x86_64-linux-gnu/libpthread.so.0)
| 2024-08-23T22:22:51.874Z | frame #6: clone + 0x43 (0x7f78d7a65353 in /usr/lib/x86_64-linux-gnu/libc.so.6)
| 2024-08-23T22:22:51.874Z | Exception raised from ncclCommWatchdog at /opt/conda/conda-bld/pytorch_1714328519311/work/torch/csrc/distributed/c10d/ProcessGroupNCCL.cpp:1418 (most recent call first):
| 2024-08-23T22:22:51.874Z | frame #0: c10::Error::Error(c10::SourceLocation, std::string) + 0x57 (0x7f787e8c4897 in /opt/conda/lib/python3.11/site-packages/torch/lib/libc10.so)
| 2024-08-23T22:22:51.874Z | frame #1: <unknown function> + 0xe56473 (0x7f787f845473 in /opt/conda/lib/python3.11/site-packages/torch/lib/libtorch_cuda.so)
| 2024-08-23T22:22:51.874Z | frame #2: <unknown function> + 0xd3e95 (0x7f78ce0dbe95 in /opt/conda/lib/python3.11/site-packages/torch/lib/../../../.././libstdc++.so.6)
| 2024-08-23T22:22:51.875Z | frame #3: <unknown function> + 0x8609 (0x7f78d7c9c609 in /usr/lib/x86_64-linux-gnu/libpthread.so.0)
| 2024-08-23T22:22:51.875Z | frame #4: clone + 0x43 (0x7f78d7a65353 in /usr/lib/x86_64-linux-gnu/libc.so.6)
| 2024-08-23T22:22:51.875Z | [rank2]:[E ProcessGroupNCCL.cpp:1537] [PG 0 Rank 2] Timeout at NCCL work: 70426, last enqueued NCCL work: 70427, last completed NCCL work: 70425.
| 2024-08-23T22:22:51.875Z | [rank2]:[E ProcessGroupNCCL.cpp:577] [Rank 2] Some NCCL operations have failed or timed out. Due to the asynchronous nature of CUDA kernels, subsequent GPU operations might run on corrupted/incomplete data.
| 2024-08-23T22:22:51.875Z | [rank2]:[E ProcessGroupNCCL.cpp:583] [Rank 2] To avoid data inconsistency, we are taking the entire process down.
| 2024-08-23T22:22:51.875Z | [rank2]:[E ProcessGroupNCCL.cpp:1414] [PG 0 Rank 2] Process group watchdog thread terminated with exception: [Rank 2] Watchdog caught collective operation timeout: WorkNCCL(SeqNum=70426, OpType=ALLREDUCE, NumelIn=1, NumelOut=1, Timeout(ms)=1800000) ran for 1800083 milliseconds before timing out.
| 2024-08-23T22:22:51.875Z | Exception raised from checkTimeout at /opt/conda/conda-bld/pytorch_1714328519311/work/torch/csrc/distributed/c10d/ProcessGroupNCCL.cpp:565 (most recent call first):
| 2024-08-23T22:22:51.875Z | frame #0: c10::Error::Error(c10::SourceLocation, std::string) + 0x57 (0x7f5afbae7897 in /opt/conda/lib/python3.11/site-packages/torch/lib/libc10.so)
| 2024-08-23T22:22:51.875Z | frame #1: c10d::ProcessGroupNCCL::WorkNCCL::checkTimeout(std::optional<std::chrono::duration<long, std::ratio<1l, 1000l> > >) + 0x1d2 (0x7f5afcde0e12 in /opt/conda/lib/python3.11/site-packages/torch/lib/libtorch_cuda.so)
| 2024-08-23T22:22:51.875Z | frame #2: c10d::ProcessGroupNCCL::watchdogHandler() + 0x1a0 (0x7f5afcde5c30 in /opt/conda/lib/python3.11/site-packages/torch/lib/libtorch_cuda.so)
| 2024-08-23T22:22:51.875Z | frame #3: c10d::ProcessGroupNCCL::ncclCommWatchdog() + 0x10c (0x7f5afcde6f7c in /opt/conda/lib/python3.11/site-packages/torch/lib/libtorch_cuda.so)
| 2024-08-23T22:22:51.875Z | frame #4: <unknown function> + 0xd3e95 (0x7f5b4b2fee95 in /opt/conda/lib/python3.11/site-packages/torch/lib/../../../.././libstdc++.so.6)
| 2024-08-23T22:22:51.875Z | frame #5: <unknown function> + 0x8609 (0x7f5b54ebf609 in /usr/lib/x86_64-linux-gnu/libpthread.so.0)
| 2024-08-23T22:22:51.875Z | frame #6: clone + 0x43 (0x7f5b54c88353 in /usr/lib/x86_64-linux-gnu/libc.so.6)
| 2024-08-23T22:22:51.875Z | terminate called after throwing an instance of 'c10::DistBackendError' what(): [PG 0 Rank 2] Process group watchdog thread terminated with exception: [Rank 2] Watchdog caught collective operation timeout: WorkNCCL(SeqNum=70426, OpType=ALLREDUCE, NumelIn=1, NumelOut=1, Timeout(ms)=1800000) ran for 1800083 milliseconds before timing out.
| 2024-08-23T22:22:51.875Z | Exception raised from checkTimeout at /opt/conda/conda-bld/pytorch_1714328519311/work/torch/csrc/distributed/c10d/ProcessGroupNCCL.cpp:565 (most recent call first):
| 2024-08-23T22:22:51.875Z | frame #0: c10::Error::Error(c10::SourceLocation, std::string) + 0x57 (0x7f5afbae7897 in /opt/conda/lib/python3.11/site-packages/torch/lib/libc10.so)
| 2024-08-23T22:22:51.875Z | frame #1: c10d::ProcessGroupNCCL::WorkNCCL::checkTimeout(std::optional<std::chrono::duration<long, std::ratio<1l, 1000l> > >) + 0x1d2 (0x7f5afcde0e12 in /opt/conda/lib/python3.11/site-packages/torch/lib/libtorch_cuda.so)
| 2024-08-23T22:22:51.875Z | frame #2: c10d::ProcessGroupNCCL::watchdogHandler() + 0x1a0 (0x7f5afcde5c30 in /opt/conda/lib/python3.11/site-packages/torch/lib/libtorch_cuda.so)
| 2024-08-23T22:22:51.875Z | frame #3: c10d::ProcessGroupNCCL::ncclCommWatchdog() + 0x10c (0x7f5afcde6f7c in /opt/conda/lib/python3.11/site-packages/torch/lib/libtorch_cuda.so)
| 2024-08-23T22:22:51.875Z | frame #4: <unknown function> + 0xd3e95 (0x7f5b4b2fee95 in /opt/conda/lib/python3.11/site-packages/torch/lib/../../../.././libstdc++.so.6)
| 2024-08-23T22:22:51.875Z | frame #5: <unknown function> + 0x8609 (0x7f5b54ebf609 in /usr/lib/x86_64-linux-gnu/libpthread.so.0)
| 2024-08-23T22:22:51.875Z | frame #6: clone + 0x43 (0x7f5b54c88353 in /usr/lib/x86_64-linux-gnu/libc.so.6)
| 2024-08-23T22:22:51.875Z | Exception raised from ncclCommWatchdog at /opt/conda/conda-bld/pytorch_1714328519311/work/torch/csrc/distributed/c10d/ProcessGroupNCCL.cpp:1418 (most recent call first):
| 2024-08-23T22:22:51.875Z | frame #0: c10::Error::Error(c10::SourceLocation, std::string) + 0x57 (0x7f5afbae7897 in /opt/conda/lib/python3.11/site-packages/torch/lib/libc10.so)
| 2024-08-23T22:22:51.875Z | frame #1: <unknown function> + 0xe56473 (0x7f5afca68473 in /opt/conda/lib/python3.11/site-packages/torch/lib/libtorch_cuda.so)
| 2024-08-23T22:22:51.875Z | frame #2: <unknown function> + 0xd3e95 (0x7f5b4b2fee95 in /opt/conda/lib/python3.11/site-packages/torch/lib/../../../.././libstdc++.so.6)
| 2024-08-23T22:22:51.875Z | frame #3: <unknown function> + 0x8609 (0x7f5b54ebf609 in /usr/lib/x86_64-linux-gnu/libpthread.so.0)
| 2024-08-23T22:22:51.875Z | frame #4: clone + 0x43 (0x7f5b54c88353 in /usr/lib/x86_64-linux-gnu/libc.so.6)
| 2024-08-23T22:22:52.876Z | [rank7]:[E ProcessGroupNCCL.cpp:1537] [PG 0 Rank 7] Timeout at NCCL work: 70426, last enqueued NCCL work: 70427, last completed NCCL work: 70425.
| 2024-08-23T22:22:52.876Z | [rank7]:[E ProcessGroupNCCL.cpp:577] [Rank 7] Some NCCL operations have failed or timed out. Due to the asynchronous nature of CUDA kernels, subsequent GPU operations might run on corrupted/incomplete data.
| 2024-08-23T22:22:52.876Z | [rank7]:[E ProcessGroupNCCL.cpp:583] [Rank 7] To avoid data inconsistency, we are taking the entire process down.
| 2024-08-23T22:22:52.876Z | [rank7]:[E ProcessGroupNCCL.cpp:1414] [PG 0 Rank 7] Process group watchdog thread terminated with exception: [Rank 7] Watchdog caught collective operation timeout: WorkNCCL(SeqNum=70426, OpType=ALLREDUCE, NumelIn=1, NumelOut=1, Timeout(ms)=1800000) ran for 1800085 milliseconds before timing out.
| 2024-08-23T22:22:52.876Z | Exception raised from checkTimeout at /opt/conda/conda-bld/pytorch_1714328519311/work/torch/csrc/distributed/c10d/ProcessGroupNCCL.cpp:565 (most recent call first):
| 2024-08-23T22:22:52.876Z | frame #0: c10::Error::Error(c10::SourceLocation, std::string) + 0x57 (0x7f76bbadb897 in /opt/conda/lib/python3.11/site-packages/torch/lib/libc10.so)
| 2024-08-23T22:22:52.876Z | frame #1: c10d::ProcessGroupNCCL::WorkNCCL::checkTimeout(std::optional<std::chrono::duration<long, std::ratio<1l, 1000l> > >) + 0x1d2 (0x7f76bcdd4e12 in /opt/conda/lib/python3.11/site-packages/torch/lib/libtorch_cuda.so)
| 2024-08-23T22:22:52.876Z | frame #2: c10d::ProcessGroupNCCL::watchdogHandler() + 0x1a0 (0x7f76bcdd9c30 in /opt/conda/lib/python3.11/site-packages/torch/lib/libtorch_cuda.so)
| 2024-08-23T22:22:52.876Z | frame #3: c10d::ProcessGroupNCCL::ncclCommWatchdog() + 0x10c (0x7f76bcddaf7c in /opt/conda/lib/python3.11/site-packages/torch/lib/libtorch_cuda.so)
| 2024-08-23T22:22:52.876Z | frame #4: <unknown function> + 0xd3e95 (0x7f770b2f2e95 in /opt/conda/lib/python3.11/site-packages/torch/lib/../../../.././libstdc++.so.6)
| 2024-08-23T22:22:52.876Z | frame #5: <unknown function> + 0x8609 (0x7f7714eb3609 in /usr/lib/x86_64-linux-gnu/libpthread.so.0)
| 2024-08-23T22:22:52.876Z | frame #6: clone + 0x43 (0x7f7714c7c353 in /usr/lib/x86_64-linux-gnu/libc.so.6)
| 2024-08-23T22:22:52.876Z | terminate called after throwing an instance of 'c10::DistBackendError' what(): [PG 0 Rank 7] Process group watchdog thread terminated with exception: [Rank 7] Watchdog caught collective operation timeout: WorkNCCL(SeqNum=70426, OpType=ALLREDUCE, NumelIn=1, NumelOut=1, Timeout(ms)=1800000) ran for 1800085 milliseconds before timing out.
| 2024-08-23T22:22:52.876Z | Exception raised from checkTimeout at /opt/conda/conda-bld/pytorch_1714328519311/work/torch/csrc/distributed/c10d/ProcessGroupNCCL.cpp:565 (most recent call first):
| 2024-08-23T22:22:52.876Z | frame #0: c10::Error::Error(c10::SourceLocation, std::string) + 0x57 (0x7f76bbadb897 in /opt/conda/lib/python3.11/site-packages/torch/lib/libc10.so)
| 2024-08-23T22:22:52.876Z | frame #1: c10d::ProcessGroupNCCL::WorkNCCL::checkTimeout(std::optional<std::chrono::duration<long, std::ratio<1l, 1000l> > >) + 0x1d2 (0x7f76bcdd4e12 in /opt/conda/lib/python3.11/site-packages/torch/lib/libtorch_cuda.so)
| 2024-08-23T22:22:52.876Z | frame #2: c10d::ProcessGroupNCCL::watchdogHandler() + 0x1a0 (0x7f76bcdd9c30 in /opt/conda/lib/python3.11/site-packages/torch/lib/libtorch_cuda.so)
| 2024-08-23T22:22:52.876Z | frame #3: c10d::ProcessGroupNCCL::ncclCommWatchdog() + 0x10c (0x7f76bcddaf7c in /opt/conda/lib/python3.11/site-packages/torch/lib/libtorch_cuda.so)
| 2024-08-23T22:22:52.876Z | frame #4: <unknown function> + 0xd3e95 (0x7f770b2f2e95 in /opt/conda/lib/python3.11/site-packages/torch/lib/../../../.././libstdc++.so.6)
| 2024-08-23T22:22:52.876Z | frame #5: <unknown function> + 0x8609 (0x7f7714eb3609 in /usr/lib/x86_64-linux-gnu/libpthread.so.0)
| 2024-08-23T22:22:52.876Z | frame #6: clone + 0x43 (0x7f7714c7c353 in /usr/lib/x86_64-linux-gnu/libc.so.6)
| 2024-08-23T22:22:52.877Z | Exception raised from ncclCommWatchdog at /opt/conda/conda-bld/pytorch_1714328519311/work/torch/csrc/distributed/c10d/ProcessGroupNCCL.cpp:1418 (most recent call first):
| 2024-08-23T22:22:52.877Z | frame #0: c10::Error::Error(c10::SourceLocation, std::string) + 0x57 (0x7f76bbadb897 in /opt/conda/lib/python3.11/site-packages/torch/lib/libc10.so)
| 2024-08-23T22:22:52.877Z | frame #1: <unknown function> + 0xe56473 (0x7f76bca5c473 in /opt/conda/lib/python3.11/site-packages/torch/lib/libtorch_cuda.so)
| 2024-08-23T22:22:52.877Z | frame #2: <unknown function> + 0xd3e95 (0x7f770b2f2e95 in /opt/conda/lib/python3.11/site-packages/torch/lib/../../../.././libstdc++.so.6)
| 2024-08-23T22:22:52.877Z | frame #3: <unknown function> + 0x8609 (0x7f7714eb3609 in /usr/lib/x86_64-linux-gnu/libpthread.so.0)
| 2024-08-23T22:22:52.877Z | frame #4: clone + 0x43 (0x7f7714c7c353 in /usr/lib/x86_64-linux-gnu/libc.so.6)
| 2024-08-23T22:22:58.878Z | E0823 22:22:58.472000 140029070137152 torch/distributed/elastic/multiprocessing/api.py:826] failed (exitcode: -6) local_rank: 0 (pid: 187) of binary: /opt/conda/bin/python
| 2024-08-23T22:22:58.878Z | Traceback (most recent call last): File "/opt/conda/bin/torchrun", line 33, in <module>
| 2024-08-23T22:22:58.879Z | sys.exit(load_entry_point('torch==2.3.0', 'console_scripts', 'torchrun')())
| 2024-08-23T22:22:58.879Z | ^
| 2024-08-23T22:22:58.879Z | ^^^^^^^^^^^^^^^^^^^
| 2024-08-23T22:22:58.879Z | ^^^^^^^^^^^^^^^^^^^
| 2024-08-23T22:22:58.879Z | ^^^^^^^^^^^^^^^^^^^^
| 2024-08-23T22:22:58.879Z | ^^^^^^ File "/opt/conda/lib/python3.11/site-packages/torch/distributed/elastic/multiprocessing/errors/__init__.py", line 347, in wrapper
| 2024-08-23T22:22:58.879Z | return f(*args, **kwargs)
| 2024-08-23T22:22:58.879Z | ^^^^
| 2024-08-23T22:22:58.879Z | ^^^^^^^^
| 2024-08-23T22:22:58.879Z | ^^^^^^ File "/opt/conda/lib/python3.11/site-packages/torch/distributed/run.py", line 879, in main
| 2024-08-23T22:22:58.879Z | run(args)
| 2024-08-23T22:22:58.879Z | File "/opt/conda/lib/python3.11/site-packages/torch/distributed/run.py", line 870, in run
| 2024-08-23T22:22:58.879Z | elastic_launch( File "/opt/conda/lib/python3.11/site-packages/torch/distributed/launcher/api.py", line 132, in __call__
| 2024-08-23T22:22:58.879Z | return launch_agent(self._config, self._entrypoint, list(args))
| 2024-08-23T22:22:58.879Z | ^^^^^^
| 2024-08-23T22:22:58.879Z | ^^^^^^^^^^^^^^^^^^
| 2024-08-23T22:22:58.879Z | ^^^^^^^^^^^^^^^^^^^
| 2024-08-23T22:22:58.879Z | ^^^^^^^^^^^^^ File "/opt/conda/lib/python3.11/site-packages/torch/distributed/launcher/api.py", line 263, in launch_agent
| 2024-08-23T22:22:58.879Z | raise ChildFailedError(
| 2024-08-23T22:22:58.879Z | torch.distributed.elastic.multiprocessing.errors.ChildFailedError:
| 2024-08-23T22:22:58.879Z | ====================================================
| 2024-08-23T22:22:58.879Z | train_model.py FAILED
| 2024-08-23T22:22:58.879Z | ----------------------------------------------------
| 2024-08-23T22:22:58.879Z | Failures:
| 2024-08-23T22:22:58.879Z | [1]: time : 2024-08-23_22:22:58 host : algo-2 rank : 1 (local_rank: 1) exitcode : -6 (pid: 188) error_file: <N/A> traceback : Signal 6 (SIGABRT) received by PID 188
| 2024-08-23T22:22:58.879Z | [2]: time : 2024-08-23_22:22:58 host : algo-2 rank : 2 (local_rank: 2) exitcode : -6 (pid: 189) error_file: <N/A> traceback : Signal 6 (SIGABRT) received by PID 189
| 2024-08-23T22:22:58.879Z | [3]: time : 2024-08-23_22:22:58 host : algo-2 rank : 3 (local_rank: 3) exitcode : -6 (pid: 190) error_file: <N/A> traceback : Signal 6 (SIGABRT) received by PID 190
| 2024-08-23T22:22:58.879Z | [4]: time : 2024-08-23_22:22:58 host : algo-2 rank : 4 (local_rank: 4) exitcode : -6 (pid: 191) error_file: <N/A> traceback : Signal 6 (SIGABRT) received by PID 191
| 2024-08-23T22:22:58.880Z | [5]: time : 2024-08-23_22:22:58 host : algo-2 rank : 5 (local_rank: 5) exitcode : -6 (pid: 192) error_file: <N/A> traceback : Signal 6 (SIGABRT) received by PID 192
| 2024-08-23T22:22:58.880Z | [6]: time : 2024-08-23_22:22:58 host : algo-2 rank : 6 (local_rank: 6) exitcode : -6 (pid: 193) error_file: <N/A> traceback : Signal 6 (SIGABRT) received by PID 193
| 2024-08-23T22:22:58.880Z | [7]: time : 2024-08-23_22:22:58 host : algo-2 rank : 7 (local_rank: 7) exitcode : -6 (pid: 194) error_file: <N/A> traceback : Signal 6 (SIGABRT) received by PID 194
| 2024-08-23T22:22:58.880Z | ----------------------------------------------------
| 2024-08-23T22:22:58.880Z | Root Cause (first observed failure):
| 2024-08-23T22:22:58.880Z | [0]: time : 2024-08-23_22:22:58 host : algo-2 rank : 0 (local_rank: 0) exitcode : -6 (pid: 187) error_file: <N/A> traceback : Signal 6 (SIGABRT) received by PID 187
| 2024-08-23T22:22:58.880Z | ====================================================
| 2024-08-23T22:23:01.881Z | 2024-08-23 22:23:01,373 sagemaker-training-toolkit INFO Waiting for the process to finish and give a return code.
| 2024-08-23T22:23:01.881Z | 2024-08-23 22:23:01,373 sagemaker-training-toolkit INFO Done waiting for a return code. Received 1 from exiting process.
| 2024-08-23T22:23:01.881Z | 2024-08-23 22:23:01,374 sagemaker-training-toolkit ERROR Reporting training FAILURE
| 2024-08-23T22:23:01.881Z | 2024-08-23 22:23:01,374 sagemaker-training-toolkit ERROR ExecuteUserScriptError:
| 2024-08-23T22:23:01.881Z | ExitCode 1
| 2024-08-23T22:23:01.892Z | Command "torchrun --nnodes 2 --nproc_per_node 8 --master_addr algo-2 --master_port 7777 --node_rank 0 train_model.py --accelerator auto --accumulate_grad_batches 1 --attribute-names x y z --batch-size 32 --devices auto --drop-path-rate 0.1 --enable_progress_bar False --fusion-activation leaky_relu --fusion-hidden-features 512 --fusion-type simplefc --global-pool max --gradient_clip_algorithm norm --gradient_clip_val 5.0 --image-encoder convnextv2_tiny --image-size 384 --image-transform-backend albumentations --layer-decay 0.9 --log-dir /opt/ml/model/logs/ --log_every_n_steps 10 --lr 0.00005 --max_steps 123046 --min-lr 0.00000001 --min-precision 0.99 --momentum 0.9 --mu 0.3 --num-classes 3 --num-workers 4 --num_nodes 2 --opt adamw --output-dir /opt/ml/model/ --pos-weight 0.3 --precision bf16-mixed --reg-mode partial --save-top-k 3 --sched cosine_decay --sched-on-updates 1 --strategy ddp_find_unused_parameters_true --sync_batchnorm True --task multilabel --tensorboard-dir /opt/ml/output/tensorboard/ --test-input .../optimized-data/test --text-encoder google-bert/bert-base-multilingual-uncased --train-inputs negatives,crop,pack,prop --train-inputs-s3-prefix .../optimized-data/ --train-weight-factors 0.8,0.07,0.07,0.07 --val-input .../optimized-data/val --val_check_interval 12304 --warmup-steps 1000 --weight-decay 0.005" | Command "torchrun --nnodes 2 --nproc_per_node 8 --master_addr algo-2 --master_port 7777 --node_rank 0 train_model.py --accelerator auto --accumulate_grad_batches 1 --attribute-names x y z --batch-size 32 --devices auto --drop-path-rate 0.1 --enable_progress_bar False --fusion-activation leaky_relu --fusion-hidden-features 512 --fusion-type simplefc --global-pool max --gradient_clip_algorithm norm --gradient_clip_val 5.0 --image-encoder convnextv2_tiny --image-size 384 --image-transform-backend albumentations --layer-decay 0.9 --log-dir /opt/ml/model/logs/ --log_every_n_steps 10 --lr 0.00005 --max_steps 123046 --min-lr 0.00000001 --min-precision 0.99 --momentum 0.9 --mu 0.3 --num-classes 3 --num-workers 4 --num_nodes 2 --opt adamw --output-dir /opt/ml/model/ --pos-weight 0.3 --precision bf16-mixed --reg-mode partial --save-top-k 3 --sched cosine_decay --sched-on-updates 1 --strategy ddp_find_unused_parameters_true --sync_batchnorm True --task multilabel --tensorboard-dir /opt/ml/output/tensorboard/ --test-input .../optimized-data/test --text-encoder google-bert/bert-base-multilingual-uncased --train-inputs negatives,x,y,z --train-inputs-s3-prefix .../optimized-data/ --train-weight-factors 0.8,0.07,0.07,0.07 --val-input .../optimized-data/val --val_check_interval 12304 --warmup-steps 1000 --weight-decay 0.005"
-- | --
| 2024-08-23T22:23:01.892Z | 2024-08-23 22:23:01,374 sagemaker-training-toolkit ERROR Encountered exit_code 1
Code sample
Expected behavior
Training should not softlock in the middle of an epoch
Environment
- PyTorch Version (e.g., 1.0): 2.1
- OS (e.g., Linux): Linux
- How you installed PyTorch (
conda
,pip
, source): SageMaker prebuilt deep learning container (763104351884.dkr.ecr.us-west-2.amazonaws.com/pytorch-training:2.3.0-gpu-py311-cu121-ubuntu20.04-sagemaker, see https://github.com/aws/deep-learning-containers/blob/master/available_images.md) - Build command you used (if compiling from source):
- Python version: 3.11
- CUDA/cuDNN version: 12.1
- GPU models and configuration: A10G (g5.48xlarge instance type in AWS)
- Any other relevant information:
Additional context
If you have any other suggestions about why multi-node training with CombinedDataset would fail like this, any help is appreciated.