Out of memory error while training GPT2-large on 8x32GB Nvidia Volta

# 🐛 Bug

I'm getting an `out-of-memory error` while trianing `gpt2-large` using `batch_size=1`. I'm using the [examples/run_language_modeling.py](https://github.com/huggingface/transformers/blob/master/examples/run_language_modeling.py) script. I'm using a custom dataset with varied length examples, maximum `block_size` is 1024.

This is the command I'm using:
```
python -m torch.distributed.launch --nproc_per_node 8 run_language_modeling.py --output_dir=./output_attention_mask_padding/ --model_type=gpt2 --model_name_or_path=gpt2-large --do_train --train_data_file=./data/training.txt --line_by_line --per_gpu_train_batch_size 1 --num_train_epochs 3 --fp16
```

I tried changing `args.gradient_accumulation_steps` but to no success.

Here's the traceback:

```python
                                                                                                                 Traceback (most recent call last):                                                 | 9/213 [00:45<09:51,  2.90s/it]
  File "run_language_modeling.py", line 988, in <module>
    main()
  File "run_language_modeling.py", line 938, in main
    global_step, tr_loss = train(args, train_dataset, model, tokenizer)
  File "run_language_modeling.py", line 506, in train
    outputs = model(inputs, masked_lm_labels=labels, attention_mask=attention_mask) if args.mlm else model(inputs, labels=labels, attention_mask=attention_mask)
  File "/usr/local/lib/python3.6/dist-packages/torch/nn/modules/module.py", line 547, in __call__
    result = self.forward(*input, **kwargs)
  File "/usr/local/lib/python3.6/dist-packages/torch/nn/parallel/distributed.py", line 442, in forward
    output = self.module(*inputs[0], **kwargs[0])
  File "/usr/local/lib/python3.6/dist-packages/torch/nn/modules/module.py", line 547, in __call__
    result = self.forward(*input, **kwargs)
  File "/home/deepspeed/.local/lib/python3.6/site-packages/transformers/modeling_gpt2.py", line 612, in forward
    loss = loss_fct(shift_logits.view(-1, shift_logits.size(-1)), shift_labels.view(-1))
  File "/usr/local/lib/python3.6/dist-packages/torch/nn/modules/module.py", line 547, in __call__
    result = self.forward(*input, **kwargs)
  File "/usr/local/lib/python3.6/dist-packages/torch/nn/modules/loss.py", line 916, in forward
    ignore_index=self.ignore_index, reduction=self.reduction)
  File "/usr/local/lib/python3.6/dist-packages/apex/amp/wrap.py", line 27, in wrapper
    kwargs)
  File "/usr/local/lib/python3.6/dist-packages/apex/amp/utils.py", line 78, in casted_args
    new_args.append(cast_fn(x))
  File "/usr/local/lib/python3.6/dist-packages/apex/amp/utils.py", line 71, in maybe_float
    return x.float()
RuntimeError: CUDA out of memory. Tried to allocate 190.00 MiB (GPU 2; 31.72 GiB total capacity; 28.71 GiB already allocated; 135.88 MiB free; 1.66 GiB cached)
Traceback (most recent call last):
  File "run_language_modeling.py", line 988, in <module>
    main()
  File "run_language_modeling.py", line 938, in main
    global_step, tr_loss = train(args, train_dataset, model, tokenizer)
  File "run_language_modeling.py", line 523, in train
    scaled_loss.backward()
  File "/usr/local/lib/python3.6/dist-packages/torch/tensor.py", line 118, in backward
    torch.autograd.backward(self, gradient, retain_graph, create_graph)
  File "/usr/local/lib/python3.6/dist-packages/torch/autograd/__init__.py", line 93, in backward
    allow_unreachable=True)  # allow_unreachable flag
RuntimeError: CUDA out of memory. Tried to allocate 194.00 MiB (GPU 4; 31.72 GiB total capacity; 29.42 GiB already allocated; 155.88 MiB free; 951.73 MiB cached)

```

## Environment info

     
- `transformers` version: 2.6.0
- Platform: Linux
- Using distributed or parallel set-up in script?: Yes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Out of memory error while training GPT2-large on 8x32GB Nvidia Volta #3616

🐛 Bug

Environment info

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Out of memory error while training GPT2-large on 8x32GB Nvidia Volta #3616

Description

🐛 Bug

Environment info

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions