Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Failed to train CodeT5p-2b on multi-gpus card #156

Open
zhuxunyu opened this issue Nov 2, 2023 · 0 comments
Open

Failed to train CodeT5p-2b on multi-gpus card #156

zhuxunyu opened this issue Nov 2, 2023 · 0 comments

Comments

@zhuxunyu
Copy link

zhuxunyu commented Nov 2, 2023

Hello, I tried to fine-tune codet5p-2b. I loaded the model from huggingface and I got an error saying CUDA out of memory, then I tried to load the model into multiple GPUs by adding device_map = 'auto' when load the model. But I got another error:

The model weights are not tied. Please use the `tie_weights` method before using the `infer_auto_device` function.
  ==> Loaded model from Salesforce/codet5p-2b, model size 3112427008
Starting main loop
/home/zhuxunyu/miniconda3/envs/openai/lib/python3.8/site-packages/transformers/optimization.py:407: FutureWarning: This implementation of AdamW is deprecated and will be removed in a future version. Use the PyTorch implementation torch.optim.AdamW instead, or set `no_deprecation_warning=True` to disable this warning
  warnings.warn(
  0%|                                                  | 0/1760 [00:00<?, ?it/s]Traceback (most recent call last):
  File "/home/zhuxunyu/miniconda3/envs/openai/lib/python3.8/site-packages/transformers/trainer.py", line 2735, in training_step
    loss = self.compute_loss(model, inputs)
  File "/home/zhuxunyu/miniconda3/envs/openai/lib/python3.8/site-packages/transformers/trainer.py", line 2767, in compute_loss
    outputs = model(**inputs)
  File "/home/zhuxunyu/miniconda3/envs/openai/lib/python3.8/site-packages/torch/nn/modules/module.py", line 1501, in _call_impl
    return forward_call(*args, **kwargs)
  File "/home/zhuxunyu/miniconda3/envs/openai/lib/python3.8/site-packages/accelerate/hooks.py", line 165, in new_forward
    output = old_forward(*args, **kwargs)
  File "/home/zhuxunyu/.cache/huggingface/modules/transformers_modules/codet5p-2b/modeling_codet5p.py", line 936, in forward
    loss = loss_fct(logits.reshape(-1, self.decoder.config.vocab_size), labels.view(-1))
  File "/home/zhuxunyu/miniconda3/envs/openai/lib/python3.8/site-packages/torch/nn/modules/module.py", line 1501, in _call_impl
    return forward_call(*args, **kwargs)
  File "/home/zhuxunyu/miniconda3/envs/openai/lib/python3.8/site-packages/torch/nn/modules/loss.py", line 1174, in forward
    return F.cross_entropy(input, target, weight=self.weight,
  File "/home/zhuxunyu/miniconda3/envs/openai/lib/python3.8/site-packages/torch/nn/functional.py", line 3029, in cross_entropy
    return torch._C._nn.cross_entropy_loss(input, target, weight, _Reduction.get_enum(reduction), ignore_index, label_smoothing)
RuntimeError: Expected all tensors to be on the same device, but found at least two devices, cuda:7 and cuda:0! (when checking argument for argument target in method wrapper_CUDA_nll_loss_forward)
python-BaseException
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

1 participant