Issues with the sampling script #21

roudimit · 2022-07-25T13:11:51Z

Hi Vladimir, thanks for the great project / repo!

I’m having issues with the sampling script.
First, there seems to be an issue parsing the split, ie: SPLITS="\"[test, ]\""
The script crashes with the following error: https://gist.github.com/roudimit/5c76893a380b999508401820c3fb00e4 (put in a gist since it’s so long).
I guess OmegaConf can't parse this environment variable:

SpecVQGAN/evaluation/generate_samples.py

Line 32 in f209a5a

config_cli = OmegaConf.from_cli()

I made a temporary workaround by setting SPLITS= and hardcoding the test set here

SpecVQGAN/evaluation/generate_samples.py

Line 70 in f209a5a

for split in cfg.sampler.splits:

and here

SpecVQGAN/evaluation/generate_samples.py

Line 90 in f209a5a

    
           dsets.datasets = {split: dset for split, dset in dsets.datasets.items() if split in cfg.sampler.splits}

The next problem seems to be with the GPU assignment. Here is the error I get after hard coding the test set.

Traceback (most recent call last):
  File "evaluation/generate_samples.py", line 303, in <module>
Traceback (most recent call last):
  File "evaluation/generate_samples.py", line 303, in <module>
    main()
  File "evaluation/generate_samples.py", line 299, in main
    sample(local_rank, cfg, samples_split_dirs, is_ddp)
  File "evaluation/generate_samples.py", line 262, in sample
    main()
  File "evaluation/generate_samples.py", line 299, in main
    torch.cuda.set_device(device)
  File "/home/gridsan/roudi/.local/lib/python3.8/site-packages/torch/cuda/__init__.py", line 313, in set_device
    sample(local_rank, cfg, samples_split_dirs, is_ddp)
  File "evaluation/generate_samples.py", line 262, in sample
    torch._C._cuda_setDevice(device)
RuntimeError: CUDA error: invalid device ordinal
CUDA kernel errors might be asynchronously reported at some other API call,so the stacktrace below might be incorrect.
For debugging consider passing CUDA_LAUNCH_BLOCKING=1.
    torch.cuda.set_device(device)
  File "/home/gridsan/roudi/.local/lib/python3.8/site-packages/torch/cuda/__init__.py", line 313, in set_device
    torch._C._cuda_setDevice(device)
RuntimeError: CUDA error: invalid device ordinal
CUDA kernel errors might be asynchronously reported at some other API call,so the stacktrace below might be incorrect.
For debugging consider passing CUDA_LAUNCH_BLOCKING=1.
WARNING:torch.distributed.elastic.multiprocessing.api:Sending process `9479` closing signal SIGTERM
ERROR:torch.distributed.elastic.multiprocessing.api:failed (exitcode: 1) local_rank: 1 (pid: 9480) of binary: /home/gridsan/roudi/.conda/envs/specvqgan/bin/python

I tried both the model that I trained and your model, ie EXPERIMENT_PATH="./logs/2021-07-30T21-34-25_vggsound_transformer".
I also tried with 1 and 2 GPUs, and on different machines, and the same error comes up. I am using the specvqgan conda environment. Do you have any idea about this? Thanks!

The text was updated successfully, but these errors were encountered:

v-iashin · 2022-07-25T13:33:25Z

Hi,

Thanks a lot for filing the issue and providing a thorough traceback!

I don't have much time to look into this at the moment but I will do it maybe next week or so. Here are a few comments:

Yes, I must admit those evaluation bits don't look good as I was mostly relying on the batch file! Thanks for the gist by the way. In the last line, there is evidence that you are right and SPLIT was not parsed correctly it seems. I think I would tweak the cmd until cfg.sampler.splits would give me a list as I expect it to be.
Could you check which workers return which GPU id? Print the device before torch.cuda.set_device(device).

You are using the chunks from the README to run these, aren't you?

roudimit · 2022-07-25T19:17:50Z

Thanks for suggestion 2. I didn't realize that --nproc_per_node should be set to the number of GPUs on the node. It worked once I did that.

This is resolved for me since I got the script working, but I'll keep this issue open for visibility on the SPLITS= parsing issue.

By the way, there's a small typo in the readme for VGGSound variables:

NOW=`date +"%Y-%m-%dT%H-%M-%S" the`

should be

NOW=`date +"%Y-%m-%dT%H-%M-%S"`

v-iashin · 2022-07-25T19:38:25Z

Ok great! If you have a moment, feel free to create a PR. I would happy to merge it.

v-iashin added the good first issue Good for newcomers label Jul 25, 2022

v-iashin mentioned this issue Apr 11, 2023

sample error: KeyError: 'test' #28

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Issues with the sampling script #21

Issues with the sampling script #21

roudimit commented Jul 25, 2022

v-iashin commented Jul 25, 2022

roudimit commented Jul 25, 2022

v-iashin commented Jul 25, 2022

Issues with the sampling script #21

Issues with the sampling script #21

Comments

roudimit commented Jul 25, 2022

v-iashin commented Jul 25, 2022

roudimit commented Jul 25, 2022

v-iashin commented Jul 25, 2022