Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Issues with the sampling script #21

Open
roudimit opened this issue Jul 25, 2022 · 3 comments
Open

Issues with the sampling script #21

roudimit opened this issue Jul 25, 2022 · 3 comments
Labels
good first issue Good for newcomers

Comments

@roudimit
Copy link

Hi Vladimir, thanks for the great project / repo!

I’m having issues with the sampling script.
First, there seems to be an issue parsing the split, ie: SPLITS="\"[test, ]\""
The script crashes with the following error: https://gist.github.com/roudimit/5c76893a380b999508401820c3fb00e4 (put in a gist since it’s so long).
I guess OmegaConf can't parse this environment variable:

config_cli = OmegaConf.from_cli()

I made a temporary workaround by setting SPLITS= and hardcoding the test set here
for split in cfg.sampler.splits:
and here
dsets.datasets = {split: dset for split, dset in dsets.datasets.items() if split in cfg.sampler.splits}

The next problem seems to be with the GPU assignment. Here is the error I get after hard coding the test set.

Traceback (most recent call last):
  File "evaluation/generate_samples.py", line 303, in <module>
Traceback (most recent call last):
  File "evaluation/generate_samples.py", line 303, in <module>
    main()
  File "evaluation/generate_samples.py", line 299, in main
    sample(local_rank, cfg, samples_split_dirs, is_ddp)
  File "evaluation/generate_samples.py", line 262, in sample
    main()
  File "evaluation/generate_samples.py", line 299, in main
    torch.cuda.set_device(device)
  File "/home/gridsan/roudi/.local/lib/python3.8/site-packages/torch/cuda/__init__.py", line 313, in set_device
    sample(local_rank, cfg, samples_split_dirs, is_ddp)
  File "evaluation/generate_samples.py", line 262, in sample
    torch._C._cuda_setDevice(device)
RuntimeError: CUDA error: invalid device ordinal
CUDA kernel errors might be asynchronously reported at some other API call,so the stacktrace below might be incorrect.
For debugging consider passing CUDA_LAUNCH_BLOCKING=1.
    torch.cuda.set_device(device)
  File "/home/gridsan/roudi/.local/lib/python3.8/site-packages/torch/cuda/__init__.py", line 313, in set_device
    torch._C._cuda_setDevice(device)
RuntimeError: CUDA error: invalid device ordinal
CUDA kernel errors might be asynchronously reported at some other API call,so the stacktrace below might be incorrect.
For debugging consider passing CUDA_LAUNCH_BLOCKING=1.
WARNING:torch.distributed.elastic.multiprocessing.api:Sending process `9479` closing signal SIGTERM
ERROR:torch.distributed.elastic.multiprocessing.api:failed (exitcode: 1) local_rank: 1 (pid: 9480) of binary: /home/gridsan/roudi/.conda/envs/specvqgan/bin/python

I tried both the model that I trained and your model, ie EXPERIMENT_PATH="./logs/2021-07-30T21-34-25_vggsound_transformer".
I also tried with 1 and 2 GPUs, and on different machines, and the same error comes up. I am using the specvqgan conda environment. Do you have any idea about this? Thanks!

@v-iashin
Copy link
Owner

Hi,

Thanks a lot for filing the issue and providing a thorough traceback!

I don't have much time to look into this at the moment but I will do it maybe next week or so. Here are a few comments:

  1. Yes, I must admit those evaluation bits don't look good as I was mostly relying on the batch file! Thanks for the gist by the way. In the last line, there is evidence that you are right and SPLIT was not parsed correctly it seems. I think I would tweak the cmd until cfg.sampler.splits would give me a list as I expect it to be.
  2. Could you check which workers return which GPU id? Print the device before torch.cuda.set_device(device).

You are using the chunks from the README to run these, aren't you?

@v-iashin v-iashin added the good first issue Good for newcomers label Jul 25, 2022
@roudimit
Copy link
Author

Thanks for suggestion 2. I didn't realize that --nproc_per_node should be set to the number of GPUs on the node. It worked once I did that.

This is resolved for me since I got the script working, but I'll keep this issue open for visibility on the SPLITS= parsing issue.

By the way, there's a small typo in the readme for VGGSound variables:

NOW=`date +"%Y-%m-%dT%H-%M-%S" the`

should be

NOW=`date +"%Y-%m-%dT%H-%M-%S"`

@v-iashin
Copy link
Owner

Ok great! If you have a moment, feel free to create a PR. I would happy to merge it.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
good first issue Good for newcomers
Projects
None yet
Development

No branches or pull requests

2 participants