-
Notifications
You must be signed in to change notification settings - Fork 340
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
RuntimeError: Expected all tensors to be on the same device, but found at least two devices, cuda:0 and cpu! (when checking argument for argument mat2 in method wrapper_CUDA_mm) #612
Comments
It runs normally when differential privacy is not turned on, but this error occurs as soon as DP is turned on. |
Hi, Im seeing the same problem with you, can you please show me how to turn off DP ? Im using text-to-image-lora script for stable diffusion |
For Opacus, we need the full model to be on the same device (for one sample). In other words, we do not support model slicing to different machines since we need to clip per sample gradient. We only allow batch slicing across different devices. Could you check whether this is the case for your code? |
Hi, I am experiencing the same issue but there is a twist for me: For 1 random seed, the code works without a hitch but for another it yields this error. Why does seeding affect whether I see the error or not? |
I have the same issue running the lora script of diffusers, did you find a solution? Im using Linux Ubuntu 20 with an RTX 3090, I get this error: "RuntimeError: Expected all tensors to be on the same device, but found at least two devices, cuda:0 and cpu! (when checking argument for argument mat2 in method wrapper_CUDA_mm)" |
Could anyone share the code (using our template)? There is very little we can do without seeing your code. Thanks! |
I can't provide you with code that can reproduce the error (it's prohibited and convoluted) but here's the snippet of full error:
My friend suggested it could be an issue with the PyTorch version (I am using 2.0.0). |
I found a solution in my case, very simple,
I added a to.("cuda") to the unet model before the .train() and that fixed it, it works now this is in the train_text_to_image_lora.py |
hey @javismiles could you share your code, or at least the logic where you add opacus to train_text_to_image_lora.py (https://github.com/huggingface/diffusers/blob/main/examples/text_to_image/train_text_to_image_lora.py)? Thanks! |
Hi, I also got the same error. I have double checked that both the data, targets and the model are on the same GPU. |
I figured out the issue a couple of days ago. I could narrow down the error to this function in |
I had the exact same issue. @gauriprdhn thank you so much for pointing it out. One possible solution is modifying per_sample_clip_factor = torch.zeros((0,)) |
Thanks all for valuable feedback and comments. Will launch a fix soon (special thanks to @gauriprdhn ! Please lmk if you want to submit a PR by yourself). |
Closed the issue, since we launched a fix in PR #631. |
if you use docker, and last error is about linear.py will include in /opt/venv/lib/python3.10/site-packages/torch/nn/modules/linear.py line 104: |
File "C:\Users\zzx\Desktop\PFL-Non-IID-231119\system\flcore\clients\clientavg.py", line 45, in train
self.optimizer.step()
File "C:\Users\zzx\anaconda3\envs\opacus\lib\site-packages\opacus\optimizers\optimizer.py", line 513, in step
if self.pre_step():
File "C:\Users\zzx\anaconda3\envs\opacus\lib\site-packages\opacus\optimizers\optimizer.py", line 494, in pre_step
self.clip_and_accumulate()
File "C:\Users\zzx\anaconda3\envs\opacus\lib\site-packages\opacus\optimizers\optimizer.py", line 412, in clip_and_accumulate
grad = contract("i,i...", per_sample_clip_factor, grad_sample)
File "C:\Users\zzx\anaconda3\envs\opacus\lib\site-packages\opt_einsum\contract.py", line 507, in contract
return _core_contract(operands, contraction_list, backend=backend, **einsum_kwargs)
File "C:\Users\zzx\anaconda3\envs\opacus\lib\site-packages\opt_einsum\contract.py", line 573, in _core_contract
new_view = _tensordot(*tmp_operands, axes=(tuple(left_pos), tuple(right_pos)), backend=backend)
File "C:\Users\zzx\anaconda3\envs\opacus\lib\site-packages\opt_einsum\sharing.py", line 131, in cached_tensordot
return tensordot(x, y, axes, backend=backend)
File "C:\Users\zzx\anaconda3\envs\opacus\lib\site-packages\opt_einsum\contract.py", line 374, in _tensordot
return fn(x, y, axes=axes)
File "C:\Users\zzx\anaconda3\envs\opacus\lib\site-packages\opt_einsum\backends\torch.py", line 54, in tensordot
return torch.tensordot(x, y, dims=axes)
File "C:\Users\zzx\anaconda3\envs\opacus\lib\site-packages\torch\functional.py", line 1193, in tensordot
return _VF.tensordot(a, b, dims_a, dims_b) # type: ignore[attr-defined]
RuntimeError: Expected all tensors to be on the same device, but found at least two devices, cuda:0 and cpu! (when checking argument for argument mat2 in method wrapper_CUDA_mm)
进程已结束,退出代码1
The text was updated successfully, but these errors were encountered: