You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
I’m having issues with the sampling script.
First, there seems to be an issue parsing the split, ie: SPLITS="\"[test, ]\""
The script crashes with the following error: https://gist.github.com/roudimit/5c76893a380b999508401820c3fb00e4 (put in a gist since it’s so long).
I guess OmegaConf can't parse this environment variable:
The next problem seems to be with the GPU assignment. Here is the error I get after hard coding the test set.
Traceback (most recent call last):
File "evaluation/generate_samples.py", line 303, in <module>
Traceback (most recent call last):
File "evaluation/generate_samples.py", line 303, in <module>
main()
File "evaluation/generate_samples.py", line 299, in main
sample(local_rank, cfg, samples_split_dirs, is_ddp)
File "evaluation/generate_samples.py", line 262, in sample
main()
File "evaluation/generate_samples.py", line 299, in main
torch.cuda.set_device(device)
File "/home/gridsan/roudi/.local/lib/python3.8/site-packages/torch/cuda/__init__.py", line 313, in set_device
sample(local_rank, cfg, samples_split_dirs, is_ddp)
File "evaluation/generate_samples.py", line 262, in sample
torch._C._cuda_setDevice(device)
RuntimeError: CUDA error: invalid device ordinal
CUDA kernel errors might be asynchronously reported at some other API call,so the stacktrace below might be incorrect.
For debugging consider passing CUDA_LAUNCH_BLOCKING=1.
torch.cuda.set_device(device)
File "/home/gridsan/roudi/.local/lib/python3.8/site-packages/torch/cuda/__init__.py", line 313, in set_device
torch._C._cuda_setDevice(device)
RuntimeError: CUDA error: invalid device ordinal
CUDA kernel errors might be asynchronously reported at some other API call,so the stacktrace below might be incorrect.
For debugging consider passing CUDA_LAUNCH_BLOCKING=1.
WARNING:torch.distributed.elastic.multiprocessing.api:Sending process `9479` closing signal SIGTERM
ERROR:torch.distributed.elastic.multiprocessing.api:failed (exitcode: 1) local_rank: 1 (pid: 9480) of binary: /home/gridsan/roudi/.conda/envs/specvqgan/bin/python
I tried both the model that I trained and your model, ie EXPERIMENT_PATH="./logs/2021-07-30T21-34-25_vggsound_transformer".
I also tried with 1 and 2 GPUs, and on different machines, and the same error comes up. I am using the specvqgan conda environment. Do you have any idea about this? Thanks!
The text was updated successfully, but these errors were encountered:
Thanks a lot for filing the issue and providing a thorough traceback!
I don't have much time to look into this at the moment but I will do it maybe next week or so. Here are a few comments:
Yes, I must admit those evaluation bits don't look good as I was mostly relying on the batch file! Thanks for the gist by the way. In the last line, there is evidence that you are right and SPLIT was not parsed correctly it seems. I think I would tweak the cmd until cfg.sampler.splits would give me a list as I expect it to be.
Could you check which workers return which GPU id? Print the device before torch.cuda.set_device(device).
You are using the chunks from the README to run these, aren't you?
Hi Vladimir, thanks for the great project / repo!
I’m having issues with the sampling script.
First, there seems to be an issue parsing the split, ie:
SPLITS="\"[test, ]\""
The script crashes with the following error: https://gist.github.com/roudimit/5c76893a380b999508401820c3fb00e4 (put in a gist since it’s so long).
I guess OmegaConf can't parse this environment variable:
SpecVQGAN/evaluation/generate_samples.py
Line 32 in f209a5a
I made a temporary workaround by setting
SPLITS=
and hardcoding the test set hereSpecVQGAN/evaluation/generate_samples.py
Line 70 in f209a5a
SpecVQGAN/evaluation/generate_samples.py
Line 90 in f209a5a
The next problem seems to be with the GPU assignment. Here is the error I get after hard coding the test set.
I tried both the model that I trained and your model, ie
EXPERIMENT_PATH="./logs/2021-07-30T21-34-25_vggsound_transformer"
.I also tried with 1 and 2 GPUs, and on different machines, and the same error comes up. I am using the specvqgan conda environment. Do you have any idea about this? Thanks!
The text was updated successfully, but these errors were encountered: