-
Notifications
You must be signed in to change notification settings - Fork 30.7k
Closed
Description
🐛 Bug
I'm getting an out-of-memory error
while trianing gpt2-large
using batch_size=1
. I'm using the examples/run_language_modeling.py script. I'm using a custom dataset with varied length examples, maximum block_size
is 1024.
This is the command I'm using:
python -m torch.distributed.launch --nproc_per_node 8 run_language_modeling.py --output_dir=./output_attention_mask_padding/ --model_type=gpt2 --model_name_or_path=gpt2-large --do_train --train_data_file=./data/training.txt --line_by_line --per_gpu_train_batch_size 1 --num_train_epochs 3 --fp16
I tried changing args.gradient_accumulation_steps
but to no success.
Here's the traceback:
Traceback (most recent call last): | 9/213 [00:45<09:51, 2.90s/it]
File "run_language_modeling.py", line 988, in <module>
main()
File "run_language_modeling.py", line 938, in main
global_step, tr_loss = train(args, train_dataset, model, tokenizer)
File "run_language_modeling.py", line 506, in train
outputs = model(inputs, masked_lm_labels=labels, attention_mask=attention_mask) if args.mlm else model(inputs, labels=labels, attention_mask=attention_mask)
File "/usr/local/lib/python3.6/dist-packages/torch/nn/modules/module.py", line 547, in __call__
result = self.forward(*input, **kwargs)
File "/usr/local/lib/python3.6/dist-packages/torch/nn/parallel/distributed.py", line 442, in forward
output = self.module(*inputs[0], **kwargs[0])
File "/usr/local/lib/python3.6/dist-packages/torch/nn/modules/module.py", line 547, in __call__
result = self.forward(*input, **kwargs)
File "/home/deepspeed/.local/lib/python3.6/site-packages/transformers/modeling_gpt2.py", line 612, in forward
loss = loss_fct(shift_logits.view(-1, shift_logits.size(-1)), shift_labels.view(-1))
File "/usr/local/lib/python3.6/dist-packages/torch/nn/modules/module.py", line 547, in __call__
result = self.forward(*input, **kwargs)
File "/usr/local/lib/python3.6/dist-packages/torch/nn/modules/loss.py", line 916, in forward
ignore_index=self.ignore_index, reduction=self.reduction)
File "/usr/local/lib/python3.6/dist-packages/apex/amp/wrap.py", line 27, in wrapper
kwargs)
File "/usr/local/lib/python3.6/dist-packages/apex/amp/utils.py", line 78, in casted_args
new_args.append(cast_fn(x))
File "/usr/local/lib/python3.6/dist-packages/apex/amp/utils.py", line 71, in maybe_float
return x.float()
RuntimeError: CUDA out of memory. Tried to allocate 190.00 MiB (GPU 2; 31.72 GiB total capacity; 28.71 GiB already allocated; 135.88 MiB free; 1.66 GiB cached)
Traceback (most recent call last):
File "run_language_modeling.py", line 988, in <module>
main()
File "run_language_modeling.py", line 938, in main
global_step, tr_loss = train(args, train_dataset, model, tokenizer)
File "run_language_modeling.py", line 523, in train
scaled_loss.backward()
File "/usr/local/lib/python3.6/dist-packages/torch/tensor.py", line 118, in backward
torch.autograd.backward(self, gradient, retain_graph, create_graph)
File "/usr/local/lib/python3.6/dist-packages/torch/autograd/__init__.py", line 93, in backward
allow_unreachable=True) # allow_unreachable flag
RuntimeError: CUDA out of memory. Tried to allocate 194.00 MiB (GPU 4; 31.72 GiB total capacity; 29.42 GiB already allocated; 155.88 MiB free; 951.73 MiB cached)
Environment info
transformers
version: 2.6.0- Platform: Linux
- Using distributed or parallel set-up in script?: Yes