-
Notifications
You must be signed in to change notification settings - Fork 365
Description
When I change use " pretrain_llama_distributed.sh" to train, it works with tp=2 and pp=2, However, When I change pp to 1, it reports the following errors.
Using /root/.cache/torch_extensions/py38_cu117 as PyTorch extensions root...
No modifications detected for re-loaded extension module utils, skipping build step...
Loading extension module utils...
Time to load utils op: 0.00027251243591308594 seconds
[2023-08-11 08:57:17,337] [INFO] [engine.py:83:init] CONFIG: micro_batches=8 micro_batch_size=4
[2023-08-11 08:57:19,169] [INFO] [engine.py:138:init] RANK=0 STAGE=0 LAYERS=5 [0, 5) STAGE_PARAMS=217329664 (217.330M) TOTAL_PARAMS=1345423360 (1345.423M) UNIQUE_PARAMS=1345423360 (1345.423M)
[2023-08-11 08:57:19,169] [INFO] [engine.py:138:init] RANK=2 STAGE=2 LAYERS=3 [8, 11) STAGE_PARAMS=151793664 (151.794M) TOTAL_PARAMS=1345423360 (1345.423M) UNIQUE_PARAMS=1345423360 (1345.423M)
[2023-08-11 08:57:19,169] [INFO] [engine.py:138:init] RANK=3 STAGE=3 LAYERS=3 [11, 14) STAGE_PARAMS=151793664 (151.794M) TOTAL_PARAMS=1345423360 (1345.423M) UNIQUE_PARAMS=1345423360 (1345.423M)
[2023-08-11 08:57:19,169] [INFO] [engine.py:138:init] RANK=6 STAGE=6 LAYERS=3 [20, 23) STAGE_PARAMS=151793664 (151.794M) TOTAL_PARAMS=1345423360 (1345.423M) UNIQUE_PARAMS=1345423360 (1345.423M)
[2023-08-11 08:57:19,169] [INFO] [engine.py:138:init] RANK=4 STAGE=4 LAYERS=3 [14, 17) STAGE_PARAMS=151793664 (151.794M) TOTAL_PARAMS=1345423360 (1345.423M) UNIQUE_PARAMS=1345423360 (1345.423M)
[2023-08-11 08:57:19,169] [INFO] [engine.py:138:init] RANK=7 STAGE=7 LAYERS=6 [23, 29) STAGE_PARAMS=217331712 (217.332M) TOTAL_PARAMS=1345423360 (1345.423M) UNIQUE_PARAMS=1345423360 (1345.423M)
[2023-08-11 08:57:19,169] [INFO] [engine.py:138:init] RANK=5 STAGE=5 LAYERS=3 [17, 20) STAGE_PARAMS=151793664 (151.794M) TOTAL_PARAMS=1345423360 (1345.423M) UNIQUE_PARAMS=1345423360 (1345.423M)
[2023-08-11 08:57:19,169] [INFO] [engine.py:138:init] RANK=1 STAGE=1 LAYERS=3 [5, 8) STAGE_PARAMS=151793664 (151.794M) TOTAL_PARAMS=1345423360 (1345.423M) UNIQUE_PARAMS=1345423360 (1345.423M)
[2023-08-11 08:57:19,787] [WARNING] [engine.py:2595:load_checkpoint] Unable to find latest file at ./tmp/latest, if trying to load latest checkpoint please ensure this file exists or pass an explicit checkpoint tag when loading a checkpoint.
[2023-08-11 08:57:19,787] [WARNING] [engine.py:2595:load_checkpoint] Unable to find latest file at ./tmp/latest, if trying to load latest checkpoint please ensure this file exists or pass an explicit checkpoint tag when loading a checkpoint.
[2023-08-11 08:57:19,787] [WARNING] [engine.py:2595:load_checkpoint] Unable to find latest file at ./tmp/latest, if trying to load latest checkpoint please ensure this file exists or pass an explicit checkpoint tag when loading a checkpoint.
[2023-08-11 08:57:19,787] [WARNING] [engine.py:2595:load_checkpoint] Unable to find latest file at ./tmp/latest, if trying to load latest checkpoint please ensure this file exists or pass an explicit checkpoint tag when loading a checkpoint.
WARNING: could not find the metadata file ./tmp
will not load any checkpoints and will start from random
[2023-08-11 08:57:19,787] [WARNING] [engine.py:2595:load_checkpoint] Unable to find latest file at ./tmp/latest, if trying to load latest checkpoint please ensure this file exists or pass an explicit checkpoint tag when loading a checkpoint.
[2023-08-11 08:57:19,788] [WARNING] [engine.py:2595:load_checkpoint] Unable to find latest file at ./tmp/latest, if trying to load latest checkpoint please ensure this file exists or pass an explicit checkpoint tag when loading a checkpoint.
[2023-08-11 08:57:19,788] [WARNING] [engine.py:2595:load_checkpoint] Unable to find latest file at ./tmp/latest, if trying to load latest checkpoint please ensure this file exists or pass an explicit checkpoint tag when loading a checkpoint.
[2023-08-11 08:57:19,788] [WARNING] [engine.py:2595:load_checkpoint] Unable to find latest file at ./tmp/latest, if trying to load latest checkpoint please ensure this file exists or pass an explicit checkpoint tag when loading a checkpoint.
/opt/conda/lib/python3.8/site-packages/torch/distributed/distributed_c10d.py:2387: UserWarning: torch.distributed._all_gather_base is a private function and will be deprecated. Please use torch.distributed.all_gather_into_tensor instead.
warnings.warn(
/opt/conda/lib/python3.8/site-packages/torch/distributed/distributed_c10d.py:2387: UserWarning: torch.distributed._all_gather_base is a private function and will be deprecated. Please use torch.distributed.all_gather_into_tensor instead.
warnings.warn(
/opt/conda/lib/python3.8/site-packages/torch/distributed/distributed_c10d.py:2387: UserWarning: torch.distributed._all_gather_base is a private function and will be deprecated. Please use torch.distributed.all_gather_into_tensor instead.
warnings.warn(
/opt/conda/lib/python3.8/site-packages/torch/distributed/distributed_c10d.py:2387: UserWarning: torch.distributed._all_gather_base is a private function and will be deprecated. Please use torch.distributed.all_gather_into_tensor instead.
warnings.warn(
/opt/conda/lib/python3.8/site-packages/torch/distributed/distributed_c10d.py:2387: UserWarning: torch.distributed._all_gather_base is a private function and will be deprecated. Please use torch.distributed.all_gather_into_tensor instead.
warnings.warn(
/opt/conda/lib/python3.8/site-packages/torch/distributed/distributed_c10d.py:2387: UserWarning: torch.distributed._all_gather_base is a private function and will be deprecated. Please use torch.distributed.all_gather_into_tensor instead.
warnings.warn(
/opt/conda/lib/python3.8/site-packages/torch/distributed/distributed_c10d.py:2387: UserWarning: torch.distributed._all_gather_base is a private function and will be deprecated. Please use torch.distributed.all_gather_into_tensor instead.
warnings.warn(
/opt/conda/lib/python3.8/site-packages/torch/distributed/distributed_c10d.py:2387: UserWarning: torch.distributed._all_gather_base is a private function and will be deprecated. Please use torch.distributed.all_gather_into_tensor instead.
warnings.warn(
(min, max) time across ranks (ms):
load-checkpoint ................................: (0.42, 0.64)
[after model, optimizer, and learning rate scheduler are built] datetime: 2023-08-11 08:57:19
building train, validation, and test datasets ...
datasets target sizes (minimum size):
train: 8000000
validation: 80320
test: 320
building train, validation, and test datasets for GPT ...
Single data path provided for train, valid & test
building dataset index ...
reading sizes...
reading pointers...
reading document index...
creating numpy buffer of mmap...
creating memory view of numpy buffer...
finished creating indexed dataset in 0.000227 seconds
number of documents: 52049
dataset split:
train:
document indices in [0, 49395) total of 49395 documents
validation:
document indices in [49395, 51997) total of 2602 documents
test:
document indices in [51997, 52049) total of 52 documents
loading doc-idx mapping from /data/datasets/test_pretrain_data1/index-cache/ab50dd08d4a82e3719d634b06b0a9a3b_doc_idx.npy
loading sample-idx mapping from /data/datasets/test_pretrain_data1/index-cache/ab50dd08d4a82e3719d634b06b0a9a3b_sample_idx.npy
loading shuffle-idx mapping from /data/datasets/test_pretrain_data1/index-cache/ab50dd08d4a82e3719d634b06b0a9a3b_shuffle_idx.npy
loaded indexed file in 0.001 seconds
total number of samples: 8042705
total number of epochs: 2035
loading doc-idx mapping from /data/datasets/test_pretrain_data1/index-cache/7ced05b660798319657a6d4fec0584e9_doc_idx.npy
loading sample-idx mapping from /data/datasets/test_pretrain_data1/index-cache/7ced05b660798319657a6d4fec0584e9_sample_idx.npy
loading shuffle-idx mapping from /data/datasets/test_pretrain_data1/index-cache/7ced05b660798319657a6d4fec0584e9_shuffle_idx.npy
loaded indexed file in 0.001 seconds
total number of samples: 80746
total number of epochs: 390
loading doc-idx mapping from /data/datasets/test_pretrain_data1/index-cache/1b6d68d05dc6e63cf785020d31e4ba95_doc_idx.npy
loading sample-idx mapping from /data/datasets/test_pretrain_data1/index-cache/1b6d68d05dc6e63cf785020d31e4ba95_sample_idx.npy
loading shuffle-idx mapping from /data/datasets/test_pretrain_data1/index-cache/1b6d68d05dc6e63cf785020d31e4ba95_shuffle_idx.npy
loaded indexed file in 0.050 seconds
total number of samples: 326
total number of epochs: 77
building indices for blendable datasets ...
sample ratios:
dataset 0, input: 1, achieved: 1
elapsed time for building blendable dataset indices: 0.04 (sec)
size of blendable dataset: 8040000 samples
building indices for blendable datasets ...
sample ratios:
dataset 0, input: 1, achieved: 1
elapsed time for building blendable dataset indices: 0.00 (sec)
size of blendable dataset: 80722 samples
building indices for blendable datasets ...
sample ratios:
dataset 0, input: 1, achieved: 1
elapsed time for building blendable dataset indices: 0.00 (sec)
size of blendable dataset: 322 samples
finished creating GPT datasets ...
[after dataloaders are built] datetime: 2023-08-11 08:57:22
done with setup ...
(min, max) time across ranks (ms):
model-and-optimizer-setup ......................: (3420.94, 3421.25)
train/valid/test-data-iterators-setup ..........: (2519.59, 2571.22)
training ...
[before the start of training step] datetime: 2023-08-11 08:57:22
torch.Size([2048, 4, 2048])
torch.Size([2048, 4, 2048]) torch.Size([6144, 2048]) ../aten/src/ATen/native/cuda/Indexing.cu:1141: indexSelectLargeIndex: block: [615,0,0], thread: [64,0,0] AssertionsrcIndex < srcSelectDimSizefailed.
../aten/src/ATen/native/cuda/Indexing.cu:1141: indexSelectLargeIndex: block: [615,0,0], thread: [65,0,0] AssertionsrcIndex < srcSelectDimSizefailed.
../aten/src/ATen/native/cuda/Indexing.cu:1141: indexSelectLargeIndex: block: [615,0,0], thread: [66,0,0] AssertionsrcIndex < srcSelectDimSizefailed.
../aten/src/ATen/native/cuda/Indexing.cu:1141: indexSelectLargeIndex: block: [615,0,0], thread: [67,0,0] AssertionsrcIndex < srcSelectDimSizefailed.
../aten/src/ATen/native/cuda/Indexing.cu:1141: indexSelectLargeIndex: block: [615,0,0], thread: [68,0,0] AssertionsrcIndex < srcSelectDimSizefailed.
../aten/src/ATen/native/cuda/Indexing.cu:1141: indexSelectLargeIndex: block: [615,0,0], thread: [69,0,0] AssertionsrcIndex < srcSelectDimSizefailed.
../aten/src/ATen/native/cuda/Indexing.cu:1141: indexSelectLargeIndex: block: [615,0,0], thread: [70,0,0] AssertionsrcIndex < srcSelectDimSizefailed.
../aten/src/ATen/native/cuda/Indexing.cu:1141: indexSelectLargeIndex: block: [615,0,0], thread: [71,0,0] AssertionsrcIndex < srcSelectDimSizefailed.
../aten/src/ATen/native/cuda/Indexing.cu:1141: indexSelectLargeIndex: block: [615,0,0], thread: [72,0,0] AssertionsrcIndex < srcSelectDimSizefailed.
../aten/src/ATen/native/cuda/Indexing.cu:1141: indexSelectLargeIndex: block: [615,0,0], thread: [73,0,0] AssertionsrcIndex < srcSelectDimSizefailed.
../aten/src/ATen/native/cuda/Indexing.cu:1141: indexSelectLargeIndex: block: [615,0,0], thread: [74,0,0] AssertionsrcIndex < srcSelectDimSizefailed.
../aten/src/ATen/native/cuda/Indexing.cu:1141: indexSelectLargeIndex: block: [615,0,0], thread: [75,0,0] AssertionsrcIndex < srcSelectDimSizefailed.
../aten/src/ATen/native/cuda/Indexing.cu:1141: indexSelectLargeIndex: block: [615,0,0], thread: [76,0,0] AssertionsrcIndex < srcSelectDimSizefailed.
../aten/src/ATen/native/cuda/Indexing.cu:1141: indexSelectLargeIndex: block: [615,0,0], thread: [77,0,0] AssertionsrcIndex < srcSelectDimSizefailed.
../aten/src/ATen/native/cuda/Indexing.cu:1141: indexSelectLargeIndex: block: [615,0,0], thread: [78,0,0] AssertionsrcIndex < srcSelectDimSizefailed.
../aten/src/ATen/native/cuda/Indexing.cu:1141: indexSelectLargeIndex: block: [615,0,0], thread: [79,0,0] AssertionsrcIndex < srcSelectDimSizefailed.
../aten/src/ATen/native/cuda/Indexing.cu:1141: indexSelectLargeIndex: block: [615,0,0], thread: [80,0,0] AssertionsrcIndex < srcSelectDimSizefailed.
../aten/src/ATen/native/cuda/Indexing.cu:1141: indexSelectLargeIndex: block: [615,0,0], thread: [81,0,0] AssertionsrcIndex < srcSelectDimSizefailed.
../aten/src/ATen/native/cuda/Indexing.cu:1141: indexSelectLargeIndex: block: [615,0,0], thread: [82,0,0] AssertionsrcIndex < srcSelectDimSizefailed.
../aten/src/ATen/native/cuda/Indexing.cu:1141: indexSelectLargeIndex: block: [615,0,0], thread: [83,0,0] AssertionsrcIndex < srcSelectDimSizefailed.
../aten/src/ATen/native/cuda/Indexing.cu:1141: indexSelectLargeIndex: block: [615,0,0], thread: [84,0,0] AssertionsrcIndex < srcSelectDimSizefailed.
../aten/src/ATen/native/cuda/Indexing.cu:1141: indexSelectLargeIndex: block: [615,0,0], thread: [85,0,0] AssertionsrcIndex < srcSelectDimSizefailed.
../aten/src/ATen/native/cuda/Indexing.cu:1141: indexSelectLargeIndex: block: [615,0,0], thread: [86,0,0] AssertionsrcIndex < srcSelectDimSizefailed.
../aten/src/ATen/native/cuda/Indexing.cu:1141: indexSelectLargeIndex: block: [615,0,0], thread: [87,0,0] AssertionsrcIndex < srcSelectDimSizefailed.
../aten/src/ATen/native/cuda/Indexing.cu:1141: indexSelectLargeIndex: block: [615,0,0], thread: [88,0,0] AssertionsrcIndex < srcSelectDimSizefailed.
../aten/src/ATen/native/cuda/Indexing.cu:1141: indexSelectLargeIndex: block: [615,0,0], thread: [89,0,0] AssertionsrcIndex < srcSelectDimSizefailed.
../aten/src/ATen/native/cuda/Indexing.cu:1141: indexSelectLargeIndex: block: [615,0,0], thread: [90,0,0] AssertionsrcIndex < srcSelectDimSizefailed.
../aten/src/ATen/native/cuda/Indexing.cu:1141: indexSelectLargeIndex: block: [615,0,0], thread: [91,0,0] AssertionsrcIndex < srcSelectDimSizefailed.
../aten/src/ATen/native/cuda/Indexing.cu:1141: indexSelectLargeIndex: block: [615,0,0], thread: [92,0,0] AssertionsrcIndex < srcSelectDimSizefailed.
../aten/src/ATen/native/cuda/Indexing.cu:1141: indexSelectLargeIndex: block: [615,0,0], thread: [93,0,0] AssertionsrcIndex < srcSelectDimSizefailed.
../aten/src/ATen/native/cuda/Indexing.cu:1141: indexSelectLargeIndex: block: [615,0,0], thread: [94,0,0] AssertionsrcIndex < srcSelectDimSizefailed.
../aten/src/ATen/native/cuda/Indexing.cu:1141: indexSelectLargeIndex: block: [615,0,0], thread: [95,0,0] AssertionsrcIndex < srcSelectDimSizefailed.
../aten/src/ATen/native/cuda/Indexing.cu:1141: indexSelectLargeIndex: block: [615,0,0], thread: [32,0,0] AssertionsrcIndex < srcSelectDimSizefailed.
../aten/src/ATen/native/cuda/Indexing.cu:1141: indexSelectLargeIndex: block: [615,0,0], thread: [33,0,0] AssertionsrcIndex < srcSelectDimSizefailed.
../aten/src/ATen/native/cuda/Indexing.cu:1141: indexSelectLargeIndex: block: [615,0,0], thread: [34,0,0] AssertionsrcIndex < srcSelectDimSizefailed.
../aten/src/ATen/native/cuda/Indexing.cu:1141: indexSelectLargeIndex: block: [615,0,0], thread: [35,0,0] AssertionsrcIndex < srcSelectDimSizefailed.
../aten/src/ATen/native/cuda/Indexing.cu:1141: indexSelectLargeIndex: block: [615,0,0], thread: [36,0,0] AssertionsrcIndex < srcSelectDimSizefailed.
../aten/src/ATen/native/cuda/Indexing.cu:1141: indexSelectLargeIndex: block: [615,0,0], thread: [37,0,0] AssertionsrcIndex < srcSelectDimSizefailed.
../aten/src/ATen/native/cuda/Indexing.cu:1141: indexSelectLargeIndex: block: [615,0,0], thread: [38,0,0] AssertionsrcIndex < srcSelectDimSizefailed.
../aten/src/ATen/native/cuda/Indexing.cu:1141: indexSelectLargeIndex: block: [615,0,0], thread: [39,0,0] AssertionsrcIndex < srcSelectDimSizefailed.
../aten/src/ATen/native/cuda/Indexing.cu:1141: indexSelectLargeIndex: block: [615,0,0], thread: [40,0,0] AssertionsrcIndex < srcSelectDimSizefailed.
../aten/src/ATen/native/cuda/Indexing.cu:1141: indexSelectLargeIndex: block: [615,0,0], thread: [41,0,0] AssertionsrcIndex < srcSelectDimSizefailed.
../aten/src/ATen/native/cuda/Indexing.cu:1141: indexSelectLargeIndex: block: [615,0,0], thread: [42,0,0] AssertionsrcIndex < srcSelectDimSizefailed.
../aten/src/ATen/native/cuda/Indexing.cu:1141: indexSelectLargeIndex: block: [615,0,0], thread: [43,0,0] AssertionsrcIndex < srcSelectDimSizefailed.
../aten/src/ATen/native/cuda/Indexing.cu:1141: indexSelectLargeIndex: block: [615,0,0], thread: [44,0,0] AssertionsrcIndex < srcSelectDimSizefailed.
../aten/src/ATen/native/cuda/Indexing.cu:1141: indexSelectLargeIndex: block: [615,0,0], thread: [45,0,0] AssertionsrcIndex < srcSelectDimSizefailed.
../aten/src/ATen/native/cuda/Indexing.cu:1141: indexSelectLargeIndex: block: [615,0,0], thread: [46,0,0] AssertionsrcIndex < srcSelectDimSizefailed.
../aten/src/ATen/native/cuda/Indexing.cu:1141: indexSelectLargeIndex: block: [615,0,0], thread: [47,0,0] AssertionsrcIndex < srcSelectDimSizefailed.
../aten/src/ATen/native/cuda/Indexing.cu:1141: indexSelectLargeIndex: block: [615,0,0], thread: [48,0,0] AssertionsrcIndex < srcSelectDimSizefailed.
../aten/src/ATen/native/cuda/Indexing.cu:1141: indexSelectLargeIndex: block: [615,0,0], thread: [49,0,0] AssertionsrcIndex < srcSelectDimSizefailed.
../aten/src/ATen/native/cuda/Indexing.cu:1141: indexSelectLargeIndex: block: [615,0,0], thread: [50,0,0] AssertionsrcIndex < srcSelectDimSizefailed.
../aten/src/ATen/native/cuda/Indexing.cu:1141: indexSelectLargeIndex: block: [615,0,0], thread: [51,0,0] AssertionsrcIndex < srcSelectDimSizefailed.
../aten/src/ATen/native/cuda/Indexing.cu:1141: indexSelectLargeIndex: block: [615,0,0], thread: [52,0,0] AssertionsrcIndex < srcSelectDimSizefailed.
../aten/src/ATen/native/cuda/Indexing.cu:1141: indexSelectLargeIndex: block: [615,0,0], thread: [53,0,0] AssertionsrcIndex < srcSelectDimSizefailed.
../aten/src/ATen/native/cuda/Indexing.cu:1141: indexSelectLargeIndex: block: [615,0,0], thread: [54,0,0] AssertionsrcIndex < srcSelectDimSizefailed.
../aten/src/ATen/native/cuda/Indexing.cu:1141: indexSelectLargeIndex: block: [615,0,0], thread: [55,0,0] AssertionsrcIndex < srcSelectDimSizefailed.
../aten/src/ATen/native/cuda/Indexing.cu:1141: indexSelectLargeIndex: block: [615,0,0], thread: [56,0,0] AssertionsrcIndex < srcSelectDimSizefailed.
../aten/src/ATen/native/cuda/Indexing.cu:1141: indexSelectLargeIndex: block: [615,0,0], thread: [57,0,0] AssertionsrcIndex < srcSelectDimSizefailed.
../aten/src/ATen/native/cuda/Indexing.cu:1141: indexSelectLargeIndex: block: [615,0,0], thread: [58,0,0] AssertionsrcIndex < srcSelectDimSizefailed.
../aten/src/ATen/native/cuda/Indexing.cu:1141: indexSelectLargeIndex: block: [615,0,0], thread: [59,0,0] AssertionsrcIndex < srcSelectDimSizefailed.
../aten/src/ATen/native/cuda/Indexing.cu:1141: indexSelectLargeIndex: block: [615,0,0], thread: [60,0,0] AssertionsrcIndex < srcSelectDimSizefailed.
../aten/src/ATen/native/cuda/Indexing.cu:1141: indexSelectLargeIndex: block: [615,0,0], thread: [61,0,0] AssertionsrcIndex < srcSelectDimSizefailed.
../aten/src/ATen/native/cuda/Indexing.cu:1141: indexSelectLargeIndex: block: [615,0,0], thread: [62,0,0] AssertionsrcIndex < srcSelectDimSizefailed.
../aten/src/ATen/native/cuda/Indexing.cu:1141: indexSelectLargeIndex: block: [615,0,0], thread: [63,0,0] AssertionsrcIndex < srcSelectDimSizefailed.
../aten/src/ATen/native/cuda/Indexing.cu:1141: indexSelectLargeIndex: block: [615,0,0], thread: [96,0,0] AssertionsrcIndex < srcSelectDimSizefailed.
../aten/src/ATen/native/cuda/Indexing.cu:1141: indexSelectLargeIndex: block: [615,0,0], thread: [97,0,0] AssertionsrcIndex < srcSelectDimSizefailed.
../aten/src/ATen/native/cuda/Indexing.cu:1141: indexSelectLargeIndex: block: [615,0,0], thread: [98,0,0] AssertionsrcIndex < srcSelectDimSizefailed.
../aten/src/ATen/native/cuda/Indexing.cu:1141: indexSelectLargeIndex: block: [615,0,0], thread: [99,0,0] AssertionsrcIndex < srcSelectDimSizefailed.
../aten/src/ATen/native/cuda/Indexing.cu:1141: indexSelectLargeIndex: block: [615,0,0], thread: [100,0,0] AssertionsrcIndex < srcSelectDimSizefailed.
../aten/src/ATen/native/cuda/Indexing.cu:1141: indexSelectLargeIndex: block: [615,0,0], thread: [101,0,0] AssertionsrcIndex < srcSelectDimSizefailed.
../aten/src/ATen/native/cuda/Indexing.cu:1141: indexSelectLargeIndex: block: [615,0,0], thread: [102,0,0] AssertionsrcIndex < srcSelectDimSizefailed.
../aten/src/ATen/native/cuda/Indexing.cu:1141: indexSelectLargeIndex: block: [615,0,0], thread: [103,0,0] AssertionsrcIndex < srcSelectDimSizefailed.
../aten/src/ATen/native/cuda/Indexing.cu:1141: indexSelectLargeIndex: block: [615,0,0], thread: [104,0,0] AssertionsrcIndex < srcSelectDimSizefailed.
../aten/src/ATen/native/cuda/Indexing.cu:1141: indexSelectLargeIndex: block: [615,0,0], thread: [105,0,0] AssertionsrcIndex < srcSelectDimSizefailed.
../aten/src/ATen/native/cuda/Indexing.cu:1141: indexSelectLargeIndex: block: [615,0,0], thread: [106,0,0] AssertionsrcIndex < srcSelectDimSizefailed.
../aten/src/ATen/native/cuda/Indexing.cu:1141: indexSelectLargeIndex: block: [615,0,0], thread: [107,0,0] AssertionsrcIndex < srcSelectDimSizefailed.
../aten/src/ATen/native/cuda/Indexing.cu:1141: indexSelectLargeIndex: block: [615,0,0], thread: [108,0,0] AssertionsrcIndex < srcSelectDimSizefailed.
../aten/src/ATen/native/cuda/Indexing.cu:1141: indexSelectLargeIndex: block: [615,0,0], thread: [109,0,0] AssertionsrcIndex < srcSelectDimSizefailed.
../aten/src/ATen/native/cuda/Indexing.cu:1141: indexSelectLargeIndex: block: [615,0,0], thread: [110,0,0] AssertionsrcIndex < srcSelectDimSizefailed.
../aten/src/ATen/native/cuda/Indexing.cu:1141: indexSelectLargeIndex: block: [615,0,0], thread: [111,0,0] AssertionsrcIndex < srcSelectDimSizefailed.
../aten/src/ATen/native/cuda/Indexing.cu:1141: indexSelectLargeIndex: block: [615,0,0], thread: [112,0,0] AssertionsrcIndex < srcSelectDimSizefailed.
../aten/src/ATen/native/cuda/Indexing.cu:1141: indexSelectLargeIndex: block: [615,0,0], thread: [113,0,0] AssertionsrcIndex < srcSelectDimSizefailed.
../aten/src/ATen/native/cuda/Indexing.cu:1141: indexSelectLargeIndex: block: [615,0,0], thread: [114,0,0] AssertionsrcIndex < srcSelectDimSizefailed.
../aten/src/ATen/native/cuda/Indexing.cu:1141: indexSelectLargeIndex: block: [615,0,0], thread: [115,0,0] AssertionsrcIndex < srcSelectDimSizefailed.
../aten/src/ATen/native/cuda/Indexing.cu:1141: indexSelectLargeIndex: block: [615,0,0], thread: [116,0,0] AssertionsrcIndex < srcSelectDimSizefailed.
../aten/src/ATen/native/cuda/Indexing.cu:1141: indexSelectLargeIndex: block: [615,0,0], thread: [117,0,0] AssertionsrcIndex < srcSelectDimSizefailed.
../aten/src/ATen/native/cuda/Indexing.cu:1141: indexSelectLargeIndex: block: [615,0,0], thread: [118,0,0] AssertionsrcIndex < srcSelectDimSizefailed.
../aten/src/ATen/native/cuda/Indexing.cu:1141: indexSelectLargeIndex: block: [615,0,0], thread: [119,0,0] AssertionsrcIndex < srcSelectDimSizefailed.
../aten/src/ATen/native/cuda/Indexing.cu:1141: indexSelectLargeIndex: block: [615,0,0], thread: [120,0,0] AssertionsrcIndex < srcSelectDimSizefailed.
../aten/src/ATen/native/cuda/Indexing.cu:1141: indexSelectLargeIndex: block: [615,0,0], thread: [121,0,0] AssertionsrcIndex < srcSelectDimSizefailed.
../aten/src/ATen/native/cuda/Indexing.cu:1141: indexSelectLargeIndex: block: [615,0,0], thread: [122,0,0] AssertionsrcIndex < srcSelectDimSizefailed.
../aten/src/ATen/native/cuda/Indexing.cu:1141: indexSelectLargeIndex: block: [615,0,0], thread: [123,0,0] AssertionsrcIndex < srcSelectDimSizefailed.
../aten/src/ATen/native/cuda/Indexing.cu:1141: indexSelectLargeIndex: block: [615,0,0], thread: [124,0,0] AssertionsrcIndex < srcSelectDimSizefailed.
../aten/src/ATen/native/cuda/Indexing.cu:1141: indexSelectLargeIndex: block: [615,0,0], thread: [125,0,0] AssertionsrcIndex < srcSelectDimSizefailed.
../aten/src/ATen/native/cuda/Indexing.cu:1141: indexSelectLargeIndex: block: [615,0,0], thread: [126,0,0] AssertionsrcIndex < srcSelectDimSizefailed.
../aten/src/ATen/native/cuda/Indexing.cu:1141: indexSelectLargeIndex: block: [615,0,0], thread: [127,0,0] AssertionsrcIndex < srcSelectDimSizefailed.
../aten/src/ATen/native/cuda/Indexing.cu:1141: indexSelectLargeIndex: block: [615,0,0], thread: [0,0,0] AssertionsrcIndex < srcSelectDimSizefailed.
../aten/src/ATen/native/cuda/Indexing.cu:1141: indexSelectLargeIndex: block: [615,0,0], thread: [1,0,0] AssertionsrcIndex < srcSelectDimSizefailed.
../aten/src/ATen/native/cuda/Indexing.cu:1141: indexSelectLargeIndex: block: [615,0,0], thread: [2,0,0] AssertionsrcIndex < srcSelectDimSizefailed.
../aten/src/ATen/native/cuda/Indexing.cu:1141: indexSelectLargeIndex: block: [615,0,0], thread: [3,0,0] AssertionsrcIndex < srcSelectDimSizefailed.
../aten/src/ATen/native/cuda/Indexing.cu:1141: indexSelectLargeIndex: block: [615,0,0], thread: [4,0,0] AssertionsrcIndex < srcSelectDimSizefailed.
../aten/src/ATen/native/cuda/Indexing.cu:1141: indexSelectLargeIndex: block: [615,0,0], thread: [5,0,0] AssertionsrcIndex < srcSelectDimSizefailed.
../aten/src/ATen/native/cuda/Indexing.cu:1141: indexSelectLargeIndex: block: [615,0,0], thread: [6,0,0] AssertionsrcIndex < srcSelectDimSizefailed.
../aten/src/ATen/native/cuda/Indexing.cu:1141: indexSelectLargeIndex: block: [615,0,0], thread: [7,0,0] AssertionsrcIndex < srcSelectDimSizefailed.
../aten/src/ATen/native/cuda/Indexing.cu:1141: indexSelectLargeIndex: block: [615,0,0], thread: [8,0,0] AssertionsrcIndex < srcSelectDimSizefailed.
../aten/src/ATen/native/cuda/Indexing.cu:1141: indexSelectLargeIndex: block: [615,0,0], thread: [9,0,0] AssertionsrcIndex < srcSelectDimSizefailed.
../aten/src/ATen/native/cuda/Indexing.cu:1141: indexSelectLargeIndex: block: [615,0,0], thread: [10,0,0] AssertionsrcIndex < srcSelectDimSizefailed.
../aten/src/ATen/native/cuda/Indexing.cu:1141: indexSelectLargeIndex: block: [615,0,0], thread: [11,0,0] AssertionsrcIndex < srcSelectDimSizefailed.
../aten/src/ATen/native/cuda/Indexing.cu:1141: indexSelectLargeIndex: block: [615,0,0], thread: [12,0,0] AssertionsrcIndex < srcSelectDimSizefailed.
../aten/src/ATen/native/cuda/Indexing.cu:1141: indexSelectLargeIndex: block: [615,0,0], thread: [13,0,0] AssertionsrcIndex < srcSelectDimSizefailed.
../aten/src/ATen/native/cuda/Indexing.cu:1141: indexSelectLargeIndex: block: [615,0,0], thread: [14,0,0] AssertionsrcIndex < srcSelectDimSizefailed.
../aten/src/ATen/native/cuda/Indexing.cu:1141: indexSelectLargeIndex: block: [615,0,0], thread: [15,0,0] AssertionsrcIndex < srcSelectDimSizefailed.
../aten/src/ATen/native/cuda/Indexing.cu:1141: indexSelectLargeIndex: block: [615,0,0], thread: [16,0,0] AssertionsrcIndex < srcSelectDimSizefailed.
../aten/src/ATen/native/cuda/Indexing.cu:1141: indexSelectLargeIndex: block: [615,0,0], thread: [17,0,0] AssertionsrcIndex < srcSelectDimSizefailed.
../aten/src/ATen/native/cuda/Indexing.cu:1141: indexSelectLargeIndex: block: [615,0,0], thread: [18,0,0] AssertionsrcIndex < srcSelectDimSizefailed.
../aten/src/ATen/native/cuda/Indexing.cu:1141: indexSelectLargeIndex: block: [615,0,0], thread: [19,0,0] AssertionsrcIndex < srcSelectDimSizefailed.
../aten/src/ATen/native/cuda/Indexing.cu:1141: indexSelectLargeIndex: block: [615,0,0], thread: [20,0,0] AssertionsrcIndex < srcSelectDimSizefailed.
../aten/src/ATen/native/cuda/Indexing.cu:1141: indexSelectLargeIndex: block: [615,0,0], thread: [21,0,0] AssertionsrcIndex < srcSelectDimSizefailed.
../aten/src/ATen/native/cuda/Indexing.cu:1141: indexSelectLargeIndex: block: [615,0,0], thread: [22,0,0] AssertionsrcIndex < srcSelectDimSizefailed.
../aten/src/ATen/native/cuda/Indexing.cu:1141: indexSelectLargeIndex: block: [615,0,0], thread: [23,0,0] AssertionsrcIndex < srcSelectDimSizefailed.
../aten/src/ATen/native/cuda/Indexing.cu:1141: indexSelectLargeIndex: block: [615,0,0], thread: [24,0,0] AssertionsrcIndex < srcSelectDimSizefailed.
../aten/src/ATen/native/cuda/Indexing.cu:1141: indexSelectLargeIndex: block: [615,0,0], thread: [25,0,0] AssertionsrcIndex < srcSelectDimSizefailed.
../aten/src/ATen/native/cuda/Indexing.cu:1141: indexSelectLargeIndex: block: [615,0,0], thread: [26,0,0] AssertionsrcIndex < srcSelectDimSizefailed.
../aten/src/ATen/native/cuda/Indexing.cu:1141: indexSelectLargeIndex: block: [615,0,0], thread: [27,0,0] AssertionsrcIndex < srcSelectDimSizefailed.
../aten/src/ATen/native/cuda/Indexing.cu:1141: indexSelectLargeIndex: block: [615,0,0], thread: [28,0,0] AssertionsrcIndex < srcSelectDimSizefailed.
../aten/src/ATen/native/cuda/Indexing.cu:1141: indexSelectLargeIndex: block: [615,0,0], thread: [29,0,0] AssertionsrcIndex < srcSelectDimSizefailed.
../aten/src/ATen/native/cuda/Indexing.cu:1141: indexSelectLargeIndex: block: [615,0,0], thread: [30,0,0] AssertionsrcIndex < srcSelectDimSizefailed.
../aten/src/ATen/native/cuda/Indexing.cu:1141: indexSelectLargeIndex: block: [615,0,0], thread: [31,0,0] AssertionsrcIndex < srcSelectDimSizefailed.
../aten/src/ATen/native/cuda/Indexing.cu:1141: indexSelectLargeIndex: block: [118,0,0], thread: [0,0,0] AssertionsrcIndex < srcSelectDimSizefailed.
../aten/src/ATen/native/cuda/Indexing.cu:1141: indexSelectLargeIndex: block: [118,0,0], thread: [1,0,0] AssertionsrcIndex < srcSelectDimSizefailed.
../aten/src/ATen/native/cuda/Indexing.cu:1141: indexSelectLargeIndex: block: [118,0,0], thread: [2,0,0] AssertionsrcIndex < srcSelectDimSizefailed.
../aten/src/ATen/native/cuda/Indexing.cu:1141: indexSelectLargeIndex: block: [118,0,0], thread: [3,0,0] AssertionsrcIndex < srcSelectDimSizefailed.
../aten/src/ATen/native/cuda/Indexing.cu:1141: indexSelectLargeIndex: block: [118,0,0], thread: [4,0,0] AssertionsrcIndex < srcSelectDimSizefailed.
../aten/src/ATen/native/cuda/Indexing.cu:1141: indexSelectLargeIndex: block: [118,0,0], thread: [5,0,0] AssertionsrcIndex < srcSelectDimSizefailed.
../aten/src/ATen/native/cuda/Indexing.cu:1141: indexSelectLargeIndex: block: [118,0,0], thread: [6,0,0] AssertionsrcIndex < srcSelectDimSizefailed.
../aten/src/ATen/native/cuda/Indexing.cu:1141: indexSelectLargeIndex: block: [118,0,0], thread: [7,0,0] AssertionsrcIndex < srcSelectDimSizefailed.
../aten/src/ATen/native/cuda/Indexing.cu:1141: indexSelectLargeIndex: block: [118,0,0], thread: [8,0,0] AssertionsrcIndex < srcSelectDimSizefailed.
../aten/src/ATen/native/cuda/Indexing.cu:1141: indexSelectLargeIndex: block: [118,0,0], thread: [9,0,0] AssertionsrcIndex < srcSelectDimSizefailed.
../aten/src/ATen/native/cuda/Indexing.cu:1141: indexSelectLargeIndex: block: [118,0,0], thread: [10,0,0] AssertionsrcIndex < srcSelectDimSizefailed.
../aten/src/ATen/native/cuda/Indexing.cu:1141: indexSelectLargeIndex: block: [118,0,0], thread: [11,0,0] AssertionsrcIndex < srcSelectDimSizefailed.
../aten/src/ATen/native/cuda/Indexing.cu:1141: indexSelectLargeIndex: block: [118,0,0], thread: [12,0,0] AssertionsrcIndex < srcSelectDimSizefailed.
../aten/src/ATen/native/cuda/Indexing.cu:1141: indexSelectLargeIndex: block: [118,0,0], thread: [13,0,0] AssertionsrcIndex < srcSelectDimSizefailed.
../aten/src/ATen/native/cuda/Indexing.cu:1141: indexSelectLargeIndex: block: [118,0,0], thread: [14,0,0] AssertionsrcIndex < srcSelectDimSizefailed.
../aten/src/ATen/native/cuda/Indexing.cu:1141: indexSelectLargeIndex: block: [118,0,0], thread: [15,0,0] AssertionsrcIndex < srcSelectDimSizefailed.
../aten/src/ATen/native/cuda/Indexing.cu:1141: indexSelectLargeIndex: block: [118,0,0], thread: [16,0,0] AssertionsrcIndex < srcSelectDimSizefailed.
../aten/src/ATen/native/cuda/Indexing.cu:1141: indexSelectLargeIndex: block: [118,0,0], thread: [17,0,0] AssertionsrcIndex < srcSelectDimSizefailed.
../aten/src/ATen/native/cuda/Indexing.cu:1141: indexSelectLargeIndex: block: [118,0,0], thread: [18,0,0] AssertionsrcIndex < srcSelectDimSizefailed.
../aten/src/ATen/native/cuda/Indexing.cu:1141: indexSelectLargeIndex: block: [118,0,0], thread: [19,0,0] AssertionsrcIndex < srcSelectDimSizefailed.
../aten/src/ATen/native/cuda/Indexing.cu:1141: indexSelectLargeIndex: block: [118,0,0], thread: [20,0,0] AssertionsrcIndex < srcSelectDimSizefailed.
../aten/src/ATen/native/cuda/Indexing.cu:1141: indexSelectLargeIndex: block: [118,0,0], thread: [21,0,0] AssertionsrcIndex < srcSelectDimSizefailed.
../aten/src/ATen/native/cuda/Indexing.cu:1141: indexSelectLargeIndex: block: [118,0,0], thread: [22,0,0] AssertionsrcIndex < srcSelectDimSizefailed.
../aten/src/ATen/native/cuda/Indexing.cu:1141: indexSelectLargeIndex: block: [118,0,0], thread: [23,0,0] AssertionsrcIndex < srcSelectDimSizefailed.
../aten/src/ATen/native/cuda/Indexing.cu:1141: indexSelectLargeIndex: block: [118,0,0], thread: [24,0,0] AssertionsrcIndex < srcSelectDimSizefailed.
../aten/src/ATen/native/cuda/Indexing.cu:1141: indexSelectLargeIndex: block: [118,0,0], thread: [25,0,0] AssertionsrcIndex < srcSelectDimSizefailed.
../aten/src/ATen/native/cuda/Indexing.cu:1141: indexSelectLargeIndex: block: [118,0,0], thread: [26,0,0] AssertionsrcIndex < srcSelectDimSizefailed.
../aten/src/ATen/native/cuda/Indexing.cu:1141: indexSelectLargeIndex: block: [118,0,0], thread: [27,0,0] AssertionsrcIndex < srcSelectDimSizefailed.
../aten/src/ATen/native/cuda/Indexing.cu:1141: indexSelectLargeIndex: block: [118,0,0], thread: [28,0,0] AssertionsrcIndex < srcSelectDimSizefailed.
../aten/src/ATen/native/cuda/Indexing.cu:1141: indexSelectLargeIndex: block: [118,0,0], thread: [29,0,0] AssertionsrcIndex < srcSelectDimSizefailed.
../aten/src/ATen/native/cuda/Indexing.cu:1141: indexSelectLargeIndex: block: [118,0,0], thread: [30,0,0] AssertionsrcIndex < srcSelectDimSizefailed.
../aten/src/ATen/native/cuda/Indexing.cu:1141: indexSelectLargeIndex: block: [118,0,0], thread: [31,0,0] AssertionsrcIndex < srcSelectDimSizefailed.
Traceback (most recent call last):
File "pretrain_gpt.py", line 342, in
pretrain(train_valid_test_datasets_provider,
File "/data/guoqiang/models/Megatron-DeepSpeed/megatron/training.py", line 198, in pretrain
iteration = train(forward_step_func,
File "/data/guoqiang/models/Megatron-DeepSpeed/megatron/training.py", line 1123, in train
train_step(forward_step_func,
File "/data/guoqiang/models/Megatron-DeepSpeed/megatron/training.py", line 639, in train_step
loss = model[0].train_batch(data_iter=data_iterator)
File "/opt/conda/lib/python3.8/site-packages/deepspeed/runtime/pipe/engine.py", line 336, in train_batch
self._exec_schedule(sched)
File "/opt/conda/lib/python3.8/site-packages/deepspeed/runtime/pipe/engine.py", line 1307, in _exec_schedule
self._exec_instr(**cmd.kwargs)
File "/opt/conda/lib/python3.8/site-packages/deepspeed/runtime/pipe/engine.py", line 627, in _exec_forward_pass
outputs = super().forward(inputs)
File "/opt/conda/lib/python3.8/site-packages/deepspeed/utils/nvtx.py", line 15, in wrapped_fn
ret_val = func(*args, **kwargs)
File "/opt/conda/lib/python3.8/site-packages/deepspeed/runtime/engine.py", line 1736, in forward
loss = self.module(*inputs, **kwargs)
File "/opt/conda/lib/python3.8/site-packages/torch/nn/modules/module.py", line 1194, in _call_impl
return forward_call(*input, **kwargs)
File "/opt/conda/lib/python3.8/site-packages/deepspeed/runtime/pipe/module.py", line 334, in forward
x = func(forward_input)
File "/opt/conda/lib/python3.8/site-packages/deepspeed/runtime/pipe/module.py", line 327, in exec_func
inputs = layer(inputs)
File "/opt/conda/lib/python3.8/site-packages/torch/nn/modules/module.py", line 1194, in _call_impl
return forward_call(*input, **kwargs)
File "/data/guoqiang/models/Megatron-DeepSpeed/megatron/model/transformer.py", line 1321, in forward
return super().forward(hidden_states, attention_mask, **kwargs, rotary_pos_emb=rotary_pos_emb)[0]
File "/data/guoqiang/models/Megatron-DeepSpeed/megatron/model/transformer.py", line 1167, in forward
self.self_attention(
File "/opt/conda/lib/python3.8/site-packages/torch/nn/modules/module.py", line 1194, in _call_impl
return forward_call(*input, **kwargs)
File "/data/guoqiang/models/Megatron-DeepSpeed/megatron/model/transformer.py", line 629, in forward
mixed_x_layer, _ = self.query_key_value(hidden_states)
File "/opt/conda/lib/python3.8/site-packages/torch/nn/modules/module.py", line 1194, in _call_impl
return forward_call(*input, **kwargs)
File "/data/guoqiang/models/Megatron-DeepSpeed/megatron/core/tensor_parallel/layers.py", line 598, in forward
output_parallel = linear_with_grad_accumulation_and_async_allreduce(
File "/data/guoqiang/models/Megatron-DeepSpeed/megatron/core/tensor_parallel/layers.py", line 425, in linear_with_grad_accumulation_and_async_allreduce
return LinearWithGradAccumulationAndAsyncCommunication.apply(args)
File "/opt/conda/lib/python3.8/site-packages/torch/cuda/amp/autocast_mode.py", line 97, in decorate_fwd
return fwd(args, kwargs)
File "/data/guoqiang/models/Megatron-DeepSpeed/megatron/core/tensor_parallel/layers.py", line 244, in forward
print(total_input.shape,weight.shape, weight.t())
File "/opt/conda/lib/python3.8/site-packages/torch/_tensor.py", line 427, in repr
return torch._tensor_str._str(self, tensor_contents=tensor_contents)
File "/opt/conda/lib/python3.8/site-packages/torch/_tensor_str.py", line 637, in _str
return _str_intern(self, tensor_contents=tensor_contents)
File "/opt/conda/lib/python3.8/site-packages/torch/_tensor_str.py", line 568, in _str_intern
tensor_str = _tensor_str(self, indent)
File "/opt/conda/lib/python3.8/site-packages/torch/_tensor_str.py", line 328, in _tensor_str
formatter = _Formatter(get_summarized_data(self) if summarize else self)
File "/opt/conda/lib/python3.8/site-packages/torch/_tensor_str.py", line 115, in init
nonzero_finite_vals = torch.masked_select(
RuntimeError: numel: integer multiplication overflow
terminate called after throwing an instance of 'c10::Error'
what(): CUDA error: device-side assert triggered
CUDA kernel errors might be asynchronously reported at some other API call,so the stacktrace below might be incorrect.
For debugging consider passing CUDA_LAUNCH_BLOCKING=1.
Exception raised from c10_cuda_check_implementation at ../c10/cuda/CUDAException.cpp:31 (most recent call first):
frame #0: c10::Error::Error(c10::SourceLocation, std::string) + 0x57 (0x7f27d45db457 in /opt/conda/lib/python3.8/site-packages/torch/lib/libc10.so)
frame #1: c10::detail::torchCheckFail(char const, char const, unsigned int, std::string const&) + 0x64 (0x7f27d45a53ec in /opt/conda/lib/python3.8/site-packages/torch/lib/libc10.so)
frame #2: c10::cuda::c10_cuda_check_implementation(std::string const&, std::string const&, int, bool) + 0xb4 (0x7f27ff64fc64 in /opt/conda/lib/python3.8/site-packages/torch/lib/libc10_cuda.so)
frame #3: + 0x1e0dc (0x7f27ff6270dc in /opt/conda/lib/python3.8/site-packages/torch/lib/libc10_cuda.so)
frame #4: c10::cuda::CUDACachingAllocator::raw_delete(void) + 0x244 (0x7f27ff62a054 in /opt/conda/lib/python3.8/site-packages/torch/lib/libc10_cuda.so)
frame #5: + 0x4d6e23 (0x7f282a514e23 in /opt/conda/lib/python3.8/site-packages/torch/lib/libtorch_python.so)
frame #6: c10::TensorImpl::~TensorImpl() + 0x1a0 (0x7f27d45bb9e0 in /opt/conda/lib/python3.8/site-packages/torch/lib/libc10.so)
frame #7: c10::TensorImpl::~TensorImpl() + 0x9 (0x7f27d45bbaf9 in /opt/conda/lib/python3.8/site-packages/torch/lib/libc10.so)
frame #8: + 0x734c68 (0x7f282a772c68 in /opt/conda/lib/python3.8/site-packages/torch/lib/libtorch_python.so)
frame #9: THPVariable_subclass_dealloc(_object) + 0x2d5 (0x7f282a772f85 in /opt/conda/lib/python3.8/site-packages/torch/lib/libtorch_python.so)
frame #10: /opt/conda/bin/python() [0x4e0970]
frame #11: /opt/conda/bin/python() [0x4f1848]
frame #12: /opt/conda/bin/python() [0x4f1831]
frame #13: /opt/conda/bin/python() [0x4c9370]
frame #14: PyDict_SetItemString + 0x52 (0x581af2 in /opt/conda/bin/python)
frame #15: PyImport_Cleanup + 0x93 (0x5a6d03 in /opt/conda/bin/python)
frame #16: Py_FinalizeEx + 0x71 (0x5a5e31 in /opt/conda/bin/python)
frame #17: Py_RunMain + 0x112 (0x5a1b02 in /opt/conda/bin/python)
frame #18: Py_BytesMain + 0x39 (0x579ef9 in /opt/conda/bin/python)
frame #19: __libc_start_main + 0xf3 (0x7f284c25d083 in /usr/lib/x86_64-linux-gnu/libc.so.6)
frame #20: /opt/conda/bin/python() [0x579dad]
WARNING:torch.distributed.elastic.multiprocessing.api:Sending process 142438 closing signal SIGTERM
WARNING:torch.distributed.elastic.multiprocessing.api:Sending process 142439 closing signal SIGTERM
WARNING:torch.distributed.elastic.multiprocessing.api:Sending process 142440 closing signal SIGTERM
WARNING:torch.distributed.elastic.multiprocessing.api:Sending process 142441 closing signal SIGTERM
WARNING:torch.distributed.elastic.multiprocessing.api:Sending process 142442 closing signal SIGTERM
WARNING:torch.distributed.elastic.multiprocessing.api:Sending process 142443 closing signal SIGTERM
WARNING:torch.distributed.elastic.multiprocessing.api:Sending process 142444 closing signal SIGTERM
ERROR:torch.distributed.elastic.multiprocessing.api:failed (exitcode: -6) local_rank: 0 (pid: 142437) of binary: /opt/conda/bin/python
Traceback (most recent call last):
File "/opt/conda/bin/torchrun", line 8, in
sys.exit(main())
File "/opt/conda/lib/python3.8/site-packages/torch/distributed/elastic/multiprocessing/errors/init.py", line 346, in wrapper
return f(*args, **kwargs)
File "/opt/conda/lib/python3.8/site-packages/torch/distributed/run.py", line 762, in main
run(args)
File "/opt/conda/lib/python3.8/site-packages/torch/distributed/run.py", line 753, in run
elastic_launch(
File "/opt/conda/lib/python3.8/site-packages/torch/distributed/launcher/api.py", line 132, in call
return launch_agent(self._config, self._entrypoint, list(args))
File "/opt/conda/lib/python3.8/site-packages/torch/distributed/launcher/api.py", line 246, in launch_agent
raise ChildFailedError(
torch.distributed.elastic.multiprocessing.errors.ChildFailedError:
pretrain_gpt.py FAILED
Failures:
<NO_OTHER_FAILURES>
Root Cause (first observed failure):
[0]:
time : 2023-08-11_08:57:25
host : deepspeed
rank : 0 (local_rank: 0)
exitcode : -6 (pid: 142437)
error_file: <N/A>
traceback : Signal 6 (SIGABRT) received by PID 142437