This repository has been archived by the owner on Nov 3, 2023. It is now read-only.
This repository has been archived by the owner on Nov 3, 2023. It is now read-only.
Is T5 3B training properly parallelizing? #4559
Open
Description
I am trying to train a T5 model on empathetic dialogues. I am running into cuda OOM errors when training my model with the following command. When training the BlenderBot 3B model, I ran into this issue until I parallelized my training across two GPUs. However, it seems that parallelizing T5 3B doesn't resolve the issue. Also, I've reduced the batchsize to 1 and the truncate to 128 (truncate at 64 also doesn't work). Any suggestions to resolve the issue?
Command
parlai train_model -t empathetic_dialogues -m hugging_face/t5 --t5-model-arch t5-3b --t5-model-parallel True --fp16 True --optimizer adam --batchsize 1 --skip-generation True -vmt ppl -tr 64 --model-file ./chatbot_models/3B/testdebugT5/model --tstep 100
Error message
/home/rg4312/ParlAI/parlai/utils/fp16.py:85: FutureWarning: Non-finite norm encountered in torch.nn.utils.clip_grad_norm_; continuing anyway. Note that the default behavior will change in a future release to error out if a non-finite total norm is encountered. At that point, setting error_if_nonfinite=false will be required to retain the old behavior.
return torch.nn.utils.clip_grad_norm_(params, max_norm)
09:27:14 | Ran out of memory, skipping batch. if this happens frequently, decrease batchsize or truncate the inputs to the model.
Traceback (most recent call last):
File "/home/rg4312/ParlAI/parlai/core/torch_generator_agent.py", line 603, in _fake_forward_backward_pass
loss = 0 * self.compute_loss(self._dummy_batch)
File "/home/rg4312/ParlAI/parlai/core/torch_generator_agent.py", line 693, in compute_loss
model_output = self.model(*self._model_input(batch), ys=batch.label_vec)
File "/ext3/miniconda3/envs/chatbot/lib/python3.8/site-packages/torch/nn/modules/module.py", line 1051, in _call_impl
return forward_call(*input, **kwargs)
File "/home/rg4312/ParlAI/parlai/core/torch_generator_agent.py", line 312, in forward
scores, preds = self.decode_forced(encoder_states, ys)
File "/home/rg4312/ParlAI/parlai/core/torch_generator_agent.py", line 181, in decode_forced
latent, _ = self.decoder(inputs, encoder_states)
File "/ext3/miniconda3/envs/chatbot/lib/python3.8/site-packages/torch/nn/modules/module.py", line 1051, in _call_impl
return forward_call(*input, **kwargs)
File "/home/rg4312/ParlAI/parlai/agents/hugging_face/t5.py", line 59, in wrap
ret = func(*args, **kwargs)
File "/home/rg4312/ParlAI/parlai/agents/hugging_face/t5.py", line 274, in forward
outputs = self.stack(
File "/ext3/miniconda3/envs/chatbot/lib/python3.8/site-packages/torch/nn/modules/module.py", line 1051, in _call_impl
return forward_call(*input, **kwargs)
File "/ext3/miniconda3/envs/chatbot/lib/python3.8/site-packages/transformers/models/t5/modeling_t5.py", line 985, in forward
layer_outputs = layer_module(
File "/ext3/miniconda3/envs/chatbot/lib/python3.8/site-packages/torch/nn/modules/module.py", line 1051, in _call_impl
return forward_call(*input, **kwargs)
File "/ext3/miniconda3/envs/chatbot/lib/python3.8/site-packages/transformers/models/t5/modeling_t5.py", line 663, in forward
cross_attention_outputs = self.layer[1](
File "/ext3/miniconda3/envs/chatbot/lib/python3.8/site-packages/torch/nn/modules/module.py", line 1051, in _call_impl
return forward_call(*input, **kwargs)
File "/ext3/miniconda3/envs/chatbot/lib/python3.8/site-packages/transformers/models/t5/modeling_t5.py", line 578, in forward
attention_output = self.EncDecAttention(
File "/ext3/miniconda3/envs/chatbot/lib/python3.8/site-packages/torch/nn/modules/module.py", line 1051, in _call_impl
return forward_call(*input, **kwargs)
File "/ext3/miniconda3/envs/chatbot/lib/python3.8/site-packages/transformers/models/t5/modeling_t5.py", line 470, in forward
query_states = shape(self.q(hidden_states)) # (batch_size, n_heads, seq_length, dim_per_head)
File "/ext3/miniconda3/envs/chatbot/lib/python3.8/site-packages/torch/nn/modules/module.py", line 1051, in _call_impl
return forward_call(*input, **kwargs)
File "/ext3/miniconda3/envs/chatbot/lib/python3.8/site-packages/torch/nn/modules/linear.py", line 96, in forward
return F.linear(input, self.weight, self.bias)
File "/ext3/miniconda3/envs/chatbot/lib/python3.8/site-packages/torch/nn/functional.py", line 1847, in linear
return torch._C._nn.linear(input, weight, bias)
RuntimeError: CUDA out of memory. Tried to allocate 2.00 MiB (GPU 0; 44.49 GiB total capacity; 43.46 GiB already allocated; 2.00 MiB free; 43.48 GiB reserved in total by PyTorch)
During handling of the above exception, another exception occurred:
Traceback (most recent call last):
File "/ext3/miniconda3/envs/chatbot/bin/parlai", line 33, in <module>
sys.exit(load_entry_point('parlai', 'console_scripts', 'parlai')())
File "/home/rg4312/ParlAI/parlai/__main__.py", line 14, in main
superscript_main()
File "/home/rg4312/ParlAI/parlai/core/script.py", line 325, in superscript_main
return SCRIPT_REGISTRY[cmd].klass._run_from_parser_and_opt(opt, parser)
File "/home/rg4312/ParlAI/parlai/core/script.py", line 108, in _run_from_parser_and_opt
return script.run()
File "/home/rg4312/ParlAI/parlai/scripts/train_model.py", line 998, in run
return self.train_loop.train()
File "/home/rg4312/ParlAI/parlai/scripts/train_model.py", line 950, in train
for _train_log in self.train_steps():
File "/home/rg4312/ParlAI/parlai/scripts/train_model.py", line 857, in train_steps
world.parley()
File "/home/rg4312/ParlAI/parlai/core/worlds.py", line 370, in parley
acts[1] = agents[1].act()
File "/home/rg4312/ParlAI/parlai/core/torch_agent.py", line 2143, in act
response = self.batch_act([self.observation])[0]
File "/home/rg4312/ParlAI/parlai/core/torch_agent.py", line 2234, in batch_act
output = self.train_step(batch)
File "/home/rg4312/ParlAI/parlai/core/torch_generator_agent.py", line 759, in train_step
self._fake_forward_backward_pass()
File "/home/rg4312/ParlAI/parlai/core/torch_generator_agent.py", line 614, in _fake_forward_backward_pass
raise RuntimeError(m)
RuntimeError: CUDA OOM: Lower batch size (-bs) from 1 or lower max sequence length (-tr) from 128