CUDA_VISIBLE_DEVICES isn't being respected / hostfile doesn't quite work for one node #662
Closed
Description
opened on Jan 12, 2021
I'm trying to experiment with DS/1gpu and it's not respecting CUDA_VISIBLE_DEVICES
I run the script as:
CUDA_VISIBLE_DEVICES=1 deepspeed --num_gpus=1 ./finetune_trainer.py ...
but it runs on GPU 0 ignoring CUDA_VISIBLE_DEVICES=1
Then I tried to use deepspeed launcher flags as explained here: https://www.deepspeed.ai/getting-started/#resource-configuration-multi-node and encountered multiple issues there:
- I think the
--hostfile
cl arg in the example are in the wrong place, shouldn't they be right afterdeepspeed
and not in the client's args? that is instead of:
deepspeed <client_entry.py> <client args> \
--deepspeed --deepspeed_config ds_config.json --hostfile=myhostfile
deepspeed --hostfile=myhostfile <client_entry.py> <client args> \
--deepspeed --deepspeed_config ds_config.json
This is a launcher arg and not client arg.
- it can't handle hostfile with 1 entry:
$ cat hostfile
worker-1 slots=2
$ deepspeed --hostfile hostfile ./finetune_trainer.py ...
Traceback (most recent call last):
File "/home/stas/anaconda3/envs/main-38/bin/deepspeed", line 7, in <module>
exec(compile(f.read(), __file__, 'exec'))
File "/mnt/nvme1/code/github/00optimize/DeepSpeed/bin/deepspeed", line 6, in <module>
main()
File "/mnt/nvme1/code/github/00optimize/DeepSpeed/deepspeed/launcher/runner.py", line 259, in main
resource_pool = fetch_hostfile(args.hostfile)
File "/mnt/nvme1/code/github/00optimize/DeepSpeed/deepspeed/launcher/runner.py", line 133, in fetch_hostfile
raise err
File "/mnt/nvme1/code/github/00optimize/DeepSpeed/deepspeed/launcher/runner.py", line 127, in fetch_hostfile
hostname, slots = line.split()
ValueError: not enough values to unpack (expected 2, got 0)
- it can't handle exclusion or inclusions w/o the hostfile (misleading docs)
Copy-n-pasting from docs - very last code example:
$ deepspeed --exclude="worker-1:0" ./finetune_trainer.py
Traceback (most recent call last):
File "/home/stas/anaconda3/envs/main-38/bin/deepspeed", line 7, in <module>
exec(compile(f.read(), __file__, 'exec'))
File "/mnt/nvme1/code/github/00optimize/DeepSpeed/bin/deepspeed", line 6, in <module>
main()
File "/mnt/nvme1/code/github/00optimize/DeepSpeed/deepspeed/launcher/runner.py", line 272, in main
active_resources = parse_inclusion_exclusion(resource_pool,
File "/mnt/nvme1/code/github/00optimize/DeepSpeed/deepspeed/launcher/runner.py", line 240, in parse_inclusion_exclusion
return parse_resource_filter(active_resources,
File "/mnt/nvme1/code/github/00optimize/DeepSpeed/deepspeed/launcher/runner.py", line 187, in parse_resource_filter
raise ValueError("Hostname '{}' not found in hostfile".format(hostname))
ValueError: Hostname 'worker-1' not found in hostfile
I think the docs are wrong/misleading - they suggest:
You can instead include or exclude specific resources using the --include and --exclude flags. For example, to use all available resources except GPU 0 on node worker-2 and GPUs 0 and 1 on worker-3:
- but they don't specify that the hostfile is actually needed.
- and the error message is misleading since what
hostfile
is it talking about? I haven't passed it any hostfiles in this experiment and if it found it in the current dir, that hostfile does haveworker-1
in it - seecat hostfile
earlier. So it should not just say "in hostfile" butin /path/to/hostfile
- I think in this particular situation it should say: "hostfile hasn't been provided and it's required"
- this is not the right solution since it tries to ssh to worker-1
subprocess.CalledProcessError: Command '['ssh worker-1 hostname -I']' returned non-zero exit status 255.
So how does one configure deepspeed to use a specific GPU on a single node?
Thank you!
Metadata
Assignees
Labels
No labels
Activity