Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

RuntimeError: Expected all tensors to be on the same device, but found at least two devices, cuda:0 and cpu! (when checking argument for argument mat2 in method wrapper_CUDA_mm) #612

Closed
FryLcm opened this issue Nov 23, 2023 · 15 comments

Comments

@FryLcm
Copy link

FryLcm commented Nov 23, 2023

File "C:\Users\zzx\Desktop\PFL-Non-IID-231119\system\flcore\clients\clientavg.py", line 45, in train
self.optimizer.step()
File "C:\Users\zzx\anaconda3\envs\opacus\lib\site-packages\opacus\optimizers\optimizer.py", line 513, in step
if self.pre_step():
File "C:\Users\zzx\anaconda3\envs\opacus\lib\site-packages\opacus\optimizers\optimizer.py", line 494, in pre_step
self.clip_and_accumulate()
File "C:\Users\zzx\anaconda3\envs\opacus\lib\site-packages\opacus\optimizers\optimizer.py", line 412, in clip_and_accumulate
grad = contract("i,i...", per_sample_clip_factor, grad_sample)
File "C:\Users\zzx\anaconda3\envs\opacus\lib\site-packages\opt_einsum\contract.py", line 507, in contract
return _core_contract(operands, contraction_list, backend=backend, **einsum_kwargs)
File "C:\Users\zzx\anaconda3\envs\opacus\lib\site-packages\opt_einsum\contract.py", line 573, in _core_contract
new_view = _tensordot(*tmp_operands, axes=(tuple(left_pos), tuple(right_pos)), backend=backend)
File "C:\Users\zzx\anaconda3\envs\opacus\lib\site-packages\opt_einsum\sharing.py", line 131, in cached_tensordot
return tensordot(x, y, axes, backend=backend)
File "C:\Users\zzx\anaconda3\envs\opacus\lib\site-packages\opt_einsum\contract.py", line 374, in _tensordot
return fn(x, y, axes=axes)
File "C:\Users\zzx\anaconda3\envs\opacus\lib\site-packages\opt_einsum\backends\torch.py", line 54, in tensordot
return torch.tensordot(x, y, dims=axes)
File "C:\Users\zzx\anaconda3\envs\opacus\lib\site-packages\torch\functional.py", line 1193, in tensordot
return _VF.tensordot(a, b, dims_a, dims_b) # type: ignore[attr-defined]
RuntimeError: Expected all tensors to be on the same device, but found at least two devices, cuda:0 and cpu! (when checking argument for argument mat2 in method wrapper_CUDA_mm)

进程已结束,退出代码1

@FryLcm
Copy link
Author

FryLcm commented Nov 23, 2023

It runs normally when differential privacy is not turned on, but this error occurs as soon as DP is turned on.

@calibretaliation
Copy link

Hi, Im seeing the same problem with you, can you please show me how to turn off DP ? Im using text-to-image-lora script for stable diffusion

@HuanyuZhang
Copy link
Contributor

For Opacus, we need the full model to be on the same device (for one sample). In other words, we do not support model slicing to different machines since we need to clip per sample gradient. We only allow batch slicing across different devices. Could you check whether this is the case for your code?

@gauriprdhn
Copy link

Hi, I am experiencing the same issue but there is a twist for me: For 1 random seed, the code works without a hitch but for another it yields this error. Why does seeding affect whether I see the error or not?

@javismiles
Copy link

javismiles commented Nov 28, 2023

I have the same issue running the lora script of diffusers, did you find a solution? Im using Linux Ubuntu 20 with an RTX 3090, I get this error: "RuntimeError: Expected all tensors to be on the same device, but found at least two devices, cuda:0 and cpu! (when checking argument for argument mat2 in method wrapper_CUDA_mm)"

@HuanyuZhang
Copy link
Contributor

Could anyone share the code (using our template)? There is very little we can do without seeing your code. Thanks!

@gauriprdhn
Copy link

I can't provide you with code that can reproduce the error (it's prohibited and convoluted) but here's the snippet of full error:

Traceback (most recent call last):
  File "/projappl/project_2003275/gpradhan/dp_hp_tuning/src/train_lira.py", line 449, in <module>
    main()
  File "/projappl/project_2003275/gpradhan/dp_hp_tuning/src/train_lira.py", line 26, in main
    learner.run()
  File "/projappl/project_2003275/gpradhan/dp_hp_tuning/src/train_lira.py", line 178, in run
    self.run_lira(
  File "/projappl/project_2003275/gpradhan/dp_hp_tuning/src/train_lira.py", line 322, in run_lira
    accuracy, eps = self.train_test(
  File "/projappl/project_2003275/gpradhan/dp_hp_tuning/src/train_lira.py", line 201, in train_test
    self.eps, self.delta = self.fine_tune_batch(model=model, train_loader=train_loader)
  File "/projappl/project_2003275/gpradhan/dp_hp_tuning/src/train_lira.py", line 257, in fine_tune_batch
    optimizer.step()
  File "/CSC_CONTAINER/miniconda/envs/env1/lib/python3.9/site-packages/opacus-1.4.1-py3.9.egg/opacus/optimizers/optimizer.py", line 513, in step
  File "/CSC_CONTAINER/miniconda/envs/env1/lib/python3.9/site-packages/opacus-1.4.1-py3.9.egg/opacus/optimizers/optimizer.py", line 494, in pre_step
  File "/CSC_CONTAINER/miniconda/envs/env1/lib/python3.9/site-packages/opacus-1.4.1-py3.9.egg/opacus/optimizers/optimizer.py", line 412, in clip_and_accumulate
  File "/CSC_CONTAINER/miniconda/envs/env1/lib/python3.9/site-packages/opt_einsum/contract.py", line 507, in contract
    return _core_contract(operands, contraction_list, backend=backend, **einsum_kwargs)
  File "/CSC_CONTAINER/miniconda/envs/env1/lib/python3.9/site-packages/opt_einsum/contract.py", line 573, in _core_contract
    new_view = _tensordot(*tmp_operands, axes=(tuple(left_pos), tuple(right_pos)), backend=backend)
  File "/CSC_CONTAINER/miniconda/envs/env1/lib/python3.9/site-packages/opt_einsum/sharing.py", line 131, in cached_tensordot
    return tensordot(x, y, axes, backend=backend)
  File "/CSC_CONTAINER/miniconda/envs/env1/lib/python3.9/site-packages/opt_einsum/contract.py", line 374, in _tensordot
    return fn(x, y, axes=axes)
  File "/CSC_CONTAINER/miniconda/envs/env1/lib/python3.9/site-packages/opt_einsum/backends/torch.py", line 54, in tensordot
    return torch.tensordot(x, y, dims=axes)
  File "/CSC_CONTAINER/miniconda/envs/env1/lib/python3.9/site-packages/torch/functional.py", line 1100, in tensordot
    return _VF.tensordot(a, b, dims_a, dims_b)  # type: ignore[attr-defined]
RuntimeError: Expected all tensors to be on the same device, but found at least two devices, cuda:0 and cpu! (when checking argument for argument mat2 in method wrapper_CUDA_mm)

My friend suggested it could be an issue with the PyTorch version (I am using 2.0.0).

@javismiles
Copy link

I found a solution in my case, very simple,

for epoch in range(first_epoch, args.num_train_epochs):
    unet.to("cuda")
    unet.train()

I added a to.("cuda") to the unet model before the .train()

and that fixed it, it works now

this is in the train_text_to_image_lora.py

@HuanyuZhang
Copy link
Contributor

hey @javismiles could you share your code, or at least the logic where you add opacus to train_text_to_image_lora.py (https://github.com/huggingface/diffusers/blob/main/examples/text_to_image/train_text_to_image_lora.py)? Thanks!

@vaibhav0195
Copy link

Hi, I also got the same error. I have double checked that both the data, targets and the model are on the same GPU.

@gauriprdhn
Copy link

I figured out the issue a couple of days ago. I could narrow down the error to this function in opacus/optimizer.py called [clip_and_accumulate] and if you go to line 399 (https://github.com/pytorch/opacus/blob/main/opacus/optimizers/optimizer.py#L399) you'll find that in case of an empty batch, the per_sample_clip_factor is initialised as torch. zeros((0,)), an empty tensor that is NOT on the GPU. You'll need to change that line of code to ensure that this zero-tensor is also on the same device as the empty batch (which even though its empty is still on GPU).

@Tian99Yu
Copy link

I had the exact same issue. @gauriprdhn thank you so much for pointing it out.

One possible solution is modifying per_sample_clip_factor = torch.zeros((0,))
into
per_sample_clip_factor = torch.zeros((0,), device=self.grad_samples[0].device)

@HuanyuZhang
Copy link
Contributor

Thanks all for valuable feedback and comments. Will launch a fix soon (special thanks to @gauriprdhn ! Please lmk if you want to submit a PR by yourself).

@HuanyuZhang
Copy link
Contributor

HuanyuZhang commented Feb 21, 2024

Closed the issue, since we launched a fix in PR #631.

@L7c8ana
Copy link

L7c8ana commented May 27, 2024

if you use docker, and last error is about linear.py will include in /opt/venv/lib/python3.10/site-packages/torch/nn/modules/linear.py line 104:
self.device = device
and will change line 116
return F.linear(input, self.weight, self.bias)
to:
return F.linear(input.to(self.device), self.weight.to(self.device), self.bias.to(self.device))

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

8 participants