-
Notifications
You must be signed in to change notification settings - Fork 78
Closed
Description
Hi there, thanks for offering this interesting project! I have trouble when conducting the Self-Supervised Instruction Tuning. Specifically, the error goes as follows:
../aten/src/ATen/native/cuda/ScatterGatherKernel.cu:144: operator(): block: [29074,0,0], thread: [31,0,0] Assertion idx_dim >= 0 && idx_dim < index_size && "index out of bounds" failed. terminate called after throwing an instance of 'c10::Error'
what(): CUDA error: device-side assert triggered
Exception raised from c10_cuda_check_implementation at ../c10/cuda/CUDAException.cpp:31 (most recent call first):
Traceback (most recent call last):
File "graphgpt/train/train_mem.py", line 15, in <module>
train()
File "/home/k/lgm/graphGPT-main/graphgpt/train/train_graph.py", line 943, in train
trainer.train()
File "/home/k/anaconda3/envs/graphgpt/lib/python3.8/site-packages/transformers/trainer.py", line 1555, in train
return inner_training_loop(
File "/home/k/anaconda3/envs/graphgpt/lib/python3.8/site-packages/transformers/trainer.py", line 1860, in _inner_training_loop
tr_loss_step = self.training_step(model, inputs)
File "/home/k/anaconda3/envs/graphgpt/lib/python3.8/site-packages/transformers/trainer.py", line 2725, in training_step
loss = self.compute_loss(model, inputs)
File "/home/k/anaconda3/envs/graphgpt/lib/python3.8/site-packages/transformers/trainer.py", line 2748, in compute_loss
outputs = model(**inputs)
File "/home/k/anaconda3/envs/graphgpt/lib/python3.8/site-packages/torch/nn/modules/module.py", line 1190, in _call_impl
return forward_call(*input, **kwargs)
File "/home/k/anaconda3/envs/graphgpt/lib/python3.8/site-packages/torch/nn/parallel/distributed.py", line 1040, in forward
output = self._run_ddp_forward(*inputs, **kwargs)
File "/home/k/anaconda3/envs/graphgpt/lib/python3.8/site-packages/torch/nn/parallel/distributed.py", line 1000, in _run_ddp_forward
return module_to_run(*inputs[0], **kwargs[0])
File "/home/k/anaconda3/envs/graphgpt/lib/python3.8/site-packages/torch/nn/modules/module.py", line 1190, in _call_impl
return forward_call(*input, **kwargs)
File "/home/k/anaconda3/envs/graphgpt/lib/python3.8/site-packages/accelerate/utils/operations.py", line 581, in forward
return model_forward(*args, **kwargs)
File "/home/k/anaconda3/envs/graphgpt/lib/python3.8/site-packages/accelerate/utils/operations.py", line 569, in __call__
return convert_to_fp32(self.model_forward(*args, **kwargs))
File "/home/k/anaconda3/envs/graphgpt/lib/python3.8/site-packages/torch/amp/autocast_mode.py", line 14, in decorate_autocast
return func(*args, **kwargs)
File "/home/k/lgm/graphGPT-main/graphgpt/model/GraphLlama.py", line 332, in forward
outputs = self.model(
File "/home/k/anaconda3/envs/graphgpt/lib/python3.8/site-packages/torch/nn/modules/module.py", line 1190, in _call_impl
return forward_call(*input, **kwargs)
File "/home/k/lgm/graphGPT-main/graphgpt/model/GraphLlama.py", line 277, in forward
return super(GraphLlamaModel, self).forward(
File "/home/k/anaconda3/envs/graphgpt/lib/python3.8/site-packages/transformers/models/llama/modeling_llama.py", line 912, in forward
layer_outputs = self._gradient_checkpointing_func(
File "/home/k/anaconda3/envs/graphgpt/lib/python3.8/site-packages/torch/utils/checkpoint.py", line 249, in checkpoint
return CheckpointFunction.apply(function, preserve, *args)
File "/home/k/anaconda3/envs/graphgpt/lib/python3.8/site-packages/torch/utils/checkpoint.py", line 107, in forward
outputs = run_function(*args)
File "/home/k/anaconda3/envs/graphgpt/lib/python3.8/site-packages/torch/nn/modules/module.py", line 1190, in _call_impl
return forward_call(*input, **kwargs)
File "/home/k/anaconda3/envs/graphgpt/lib/python3.8/site-packages/transformers/models/llama/modeling_llama.py", line 672, in forward
hidden_states, self_attn_weights, present_key_value = self.self_attn(
File "/home/k/anaconda3/envs/graphgpt/lib/python3.8/site-packages/torch/nn/modules/module.py", line 1190, in _call_impl
return forward_call(*input, **kwargs)
File "/home/k/lgm/graphGPT-main/graphgpt/train/llama_flash_attn_monkey_patch.py", line 88, in forward
output_unpad = flash_attn_unpadded_qkvpacked_func(
File "/home/k/anaconda3/envs/graphgpt/lib/python3.8/site-packages/flash_attn/flash_attn_interface.py", line 256, in flash_attn_unpadded_qkvpacked_func
return FlashAttnQKVPackedFunc.apply(qkv, cu_seqlens, max_seqlen, dropout_p, softmax_scale,
File "/home/k/anaconda3/envs/graphgpt/lib/python3.8/site-packages/flash_attn/flash_attn_interface.py", line 59, in forward
qkv[:, 0], qkv[:, 1], qkv[:, 2], torch.empty_like(qkv[:, 0]), cu_seqlens, cu_seqlens,
RuntimeError: CUDA error: device-side assert triggered
WARNING:torch.distributed.elastic.multiprocessing.api:Sending process 510117 closing signal SIGTERM
WARNING:torch.distributed.elastic.multiprocessing.api:Sending process 510118 closing signal SIGTERM
WARNING:torch.distributed.elastic.multiprocessing.api:Sending process 510120 closing signal SIGTERM
I use the suggested configurations (environments, scripts) and conduct the tuning on a Linux server equipped with 4 A100 in a distributed manner. Still, I have also tried to conduct the tuning on one GPU merely. To avoid CUDA OOM error, I have modified the train/eval batch size to 1. However, I have encountered another error as follows:
Token indices sequence length is longer than the specified maximum sequence length for this model (3338 > 2048). Running this sequence through the model will result in indexing errors
Traceback (most recent call last):
File "/home/k/anaconda3/envs/graphgpt/lib/python3.8/runpy.py", line 194, in _run_module_as_main
return _run_code(code, main_globals, None,
File "/home/k/anaconda3/envs/graphgpt/lib/python3.8/runpy.py", line 87, in _run_code
exec(code, run_globals)
File "/home/k/lgm/graphGPT-main/graphgpt/train/train_mem.py", line 15, in <module>
train()
File "/home/k/lgm/graphGPT-main/graphgpt/train/train_graph.py", line 943, in train
trainer.train()
File "/home/k/anaconda3/envs/graphgpt/lib/python3.8/site-packages/transformers/trainer.py", line 1555, in train
return inner_training_loop(
File "/home/k/anaconda3/envs/graphgpt/lib/python3.8/site-packages/transformers/trainer.py", line 1860, in _inner_training_loop
tr_loss_step = self.training_step(model, inputs)
File "/home/k/anaconda3/envs/graphgpt/lib/python3.8/site-packages/transformers/trainer.py", line 2725, in training_step
loss = self.compute_loss(model, inputs)
File "/home/k/anaconda3/envs/graphgpt/lib/python3.8/site-packages/transformers/trainer.py", line 2748, in compute_loss
outputs = model(**inputs)
File "/home/k/anaconda3/envs/graphgpt/lib/python3.8/site-packages/torch/nn/modules/module.py", line 1190, in _call_impl
return forward_call(*input, **kwargs)
File "/home/k/anaconda3/envs/graphgpt/lib/python3.8/site-packages/torch/nn/parallel/data_parallel.py", line 171, in forward
outputs = self.parallel_apply(replicas, inputs, kwargs)
File "/home/k/anaconda3/envs/graphgpt/lib/python3.8/site-packages/torch/nn/parallel/data_parallel.py", line 181, in parallel_apply
return parallel_apply(replicas, inputs, kwargs, self.device_ids[:len(replicas)])
File "/home/k/anaconda3/envs/graphgpt/lib/python3.8/site-packages/torch/nn/parallel/parallel_apply.py", line 89, in parallel_apply
output.reraise()
File "/home/k/anaconda3/envs/graphgpt/lib/python3.8/site-packages/torch/_utils.py", line 543, in reraise
raise exception
StopIteration: Caught StopIteration in replica 0 on device 0.
Original Traceback (most recent call last):
File "/home/k/anaconda3/envs/graphgpt/lib/python3.8/site-packages/torch/nn/parallel/parallel_apply.py", line 64, in _worker
output = module(*input, **kwargs)
File "/home/k/anaconda3/envs/graphgpt/lib/python3.8/site-packages/torch/nn/modules/module.py", line 1190, in _call_impl
return forward_call(*input, **kwargs)
File "/home/k/lgm/graphGPT-main/graphgpt/model/GraphLlama.py", line 332, in forward
outputs = self.model(
File "/home/k/anaconda3/envs/graphgpt/lib/python3.8/site-packages/torch/nn/modules/module.py", line 1190, in _call_impl
return forward_call(*input, **kwargs)
File "/home/k/lgm/graphGPT-main/graphgpt/model/GraphLlama.py", line 209, in forward
node_forward_out = graph_tower(g)
File "/home/k/anaconda3/envs/graphgpt/lib/python3.8/site-packages/torch/nn/modules/module.py", line 1190, in _call_impl
return forward_call(*input, **kwargs)
File "/home/k/lgm/graphGPT-main/graphgpt/model/graph_layers/graph_transformer.py", line 64, in forward
device = self.parameters().__next__().device
StopIteration
Therefore, the tuning process can not be reproduced on either single or multiple GPUs. Any suggestions for troubleshooting would be appreciated. Looking forward to your kind reply!
Metadata
Metadata
Assignees
Labels
No labels