Skip to content

CUDA_VISIBLE_DEVICES isn't being respected / hostfile doesn't quite work for one node #662

Closed
@stas00

Description

I'm trying to experiment with DS/1gpu and it's not respecting CUDA_VISIBLE_DEVICES

I run the script as:

CUDA_VISIBLE_DEVICES=1 deepspeed --num_gpus=1 ./finetune_trainer.py ...

but it runs on GPU 0 ignoring CUDA_VISIBLE_DEVICES=1

Then I tried to use deepspeed launcher flags as explained here: https://www.deepspeed.ai/getting-started/#resource-configuration-multi-node and encountered multiple issues there:

  1. I think the --hostfile cl arg in the example are in the wrong place, shouldn't they be right after deepspeed and not in the client's args? that is instead of:
deepspeed <client_entry.py> <client args> \
  --deepspeed --deepspeed_config ds_config.json --hostfile=myhostfile
deepspeed  --hostfile=myhostfile <client_entry.py> <client args> \
  --deepspeed --deepspeed_config ds_config.json

This is a launcher arg and not client arg.

  1. it can't handle hostfile with 1 entry:
$ cat hostfile
worker-1 slots=2
$ deepspeed --hostfile hostfile  ./finetune_trainer.py ...
Traceback (most recent call last):
  File "/home/stas/anaconda3/envs/main-38/bin/deepspeed", line 7, in <module>
    exec(compile(f.read(), __file__, 'exec'))
  File "/mnt/nvme1/code/github/00optimize/DeepSpeed/bin/deepspeed", line 6, in <module>
    main()
  File "/mnt/nvme1/code/github/00optimize/DeepSpeed/deepspeed/launcher/runner.py", line 259, in main
    resource_pool = fetch_hostfile(args.hostfile)
  File "/mnt/nvme1/code/github/00optimize/DeepSpeed/deepspeed/launcher/runner.py", line 133, in fetch_hostfile
    raise err
  File "/mnt/nvme1/code/github/00optimize/DeepSpeed/deepspeed/launcher/runner.py", line 127, in fetch_hostfile
    hostname, slots = line.split()
ValueError: not enough values to unpack (expected 2, got 0)
  1. it can't handle exclusion or inclusions w/o the hostfile (misleading docs)
    Copy-n-pasting from docs - very last code example:
$ deepspeed --exclude="worker-1:0" ./finetune_trainer.py
Traceback (most recent call last):
  File "/home/stas/anaconda3/envs/main-38/bin/deepspeed", line 7, in <module>
    exec(compile(f.read(), __file__, 'exec'))
  File "/mnt/nvme1/code/github/00optimize/DeepSpeed/bin/deepspeed", line 6, in <module>
    main()
  File "/mnt/nvme1/code/github/00optimize/DeepSpeed/deepspeed/launcher/runner.py", line 272, in main
    active_resources = parse_inclusion_exclusion(resource_pool,
  File "/mnt/nvme1/code/github/00optimize/DeepSpeed/deepspeed/launcher/runner.py", line 240, in parse_inclusion_exclusion
    return parse_resource_filter(active_resources,
  File "/mnt/nvme1/code/github/00optimize/DeepSpeed/deepspeed/launcher/runner.py", line 187, in parse_resource_filter
    raise ValueError("Hostname '{}' not found in hostfile".format(hostname))
ValueError: Hostname 'worker-1' not found in hostfile

I think the docs are wrong/misleading - they suggest:

You can instead include or exclude specific resources using the --include and --exclude flags. For example, to use all available resources except GPU 0 on node worker-2 and GPUs 0 and 1 on worker-3:

  • but they don't specify that the hostfile is actually needed.
  • and the error message is misleading since what hostfile is it talking about? I haven't passed it any hostfiles in this experiment and if it found it in the current dir, that hostfile does have worker-1 in it - see cat hostfile earlier. So it should not just say "in hostfile" but in /path/to/hostfile
  • I think in this particular situation it should say: "hostfile hasn't been provided and it's required"
  1. this is not the right solution since it tries to ssh to worker-1
subprocess.CalledProcessError: Command '['ssh worker-1 hostname -I']' returned non-zero exit status 255.

So how does one configure deepspeed to use a specific GPU on a single node?

Thank you!

Activity

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Metadata

Assignees

Labels

No labels
No labels

Type

No type

Projects

No projects

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions