Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

RuntimeError: indices should be either on cpu or on the same device as the indexed tensor (cpu) #131

Open
kosmels opened this issue Jun 6, 2024 · 3 comments

Comments

@kosmels
Copy link

kosmels commented Jun 6, 2024

Hello,

I am trying to train on custom dataset (where I have already prepared 1 - 1 image pairs and my seeds.json looks like this [["0000000", ["0"]], ["0000001", ["1"]], ... ) with 3x NVIDIA TITAN RTX 24GB. Initialization of all the models works fine but during validation sanity check I am getting this error:

...
[rank0]:   File "/code/instruct-pix2pix/./stable_diffusion/ldm/models/diffusion/ddpm_edit.py", line 892, in forward
[rank0]:     return self.p_losses(x, c, t, *args, **kwargs)
[rank0]:   File "/code/instruct-pix2pix/./stable_diffusion/ldm/models/diffusion/ddpm_edit.py", line 1043, in p_losses
[rank0]:     logvar_t = self.logvar[t].to(self.device)
[rank0]: RuntimeError: indices should be either on cpu or on the same device as the indexed tensor (cpu)

Do you know where it can come from? I did not change anything in source code. Just prepared the data and updated paths in train config.

Thanks in advance!

@kosmels
Copy link
Author

kosmels commented Jun 7, 2024

UPDATE: Solved here CompVis/stable-diffusion#851

After few steps of debugging I have found out that self.logvar has device==cpu (initialized here https://github.com/timothybrooks/instruct-pix2pix/blob/main/stable_diffusion/ldm/models/diffusion/ddpm_edit.py#L123) but t has device==cuda.

I made a small workaround and moved t to cpu during this indexing:

logvar_t = self.logvar[t.to(self.logvar.device)].to(self.device)

but I am not sure if this is ok. If yes, self.logvar should be somewhere moved to cuda, because it seems that during initialization self.device==cpu.

Another question of course is why am I getting this error at all? You did not have this type of issue during development?

@Evangade
Copy link

Evangade commented Jul 4, 2024

Same Problem, thank you very much!

@LitaoLiu01
Copy link

Actually, I may know the reason, I think this error may not happen in specified cuda env, such as cu113. I reproduced ip2p last two month based on cu113, there is no such error, but recently I use H100 to train the data, and must use cu118+ in H100 gpus, and the error comes.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants