HPUAccelerator: remove support in set_visible_devices_envs #5929

nelyahu · 2024-08-13T05:12:06Z

The way deepspeed sets it is not correct with all HPU instances and may lead to incorrect behavior.

delock · 2024-08-13T08:03:20Z

Hi @nelyahu in the link https://docs.habana.ai/en/latest/PyTorch/Reference/PT_Multiple_Tenants_on_HPU/Multiple_Workloads_Single_Docker.html specificed in the code. It mentioned HABANA_VISIBLE_MODULES is used for multi-tenant on HPU, how does this usage conflict with set_visible_devices_envs() in DeepSpeed? Thanks!

nelyahu · 2024-08-13T08:12:37Z

@delock on Intel Gaudi accelerator (HPU), the visible module ids are determines by the hl-smi -L command, it does not always 0,1,2,3,4,5,6,7 and specially when running multiple VMs on the same server each VM can get different module
For example, the case of 2 VMs
VM#1 can get module ids 1,4,5,2
VM#2 can get 7,0,3,6

delock · 2024-08-13T14:38:03Z

Hi @nelyahu From the title and description this PR intends to avoid set_visible_devices_envs on this line:

DeepSpeed/deepspeed/launcher/launch.py

Line 166 in ffe0af2

get_accelerator().set_visible_devices_envs(current_env, local_accelerator_ids)

This PR also nullify this line which does not set_visible_devices_envs, is it also intended?

DeepSpeed/deepspeed/launcher/runner.py

Line 408 in ffe0af2

visible_devices = os.environ.get(visible_devices_env, "")

@delock on Intel Gaudi accelerator (HPU), the visible module ids are determines by the hl-smi -L command, it does not always 0,1,2,3,4,5,6,7 and specially when running multiple VMs on the same server each VM can get different module For example, the case of 2 VMs VM#1 can get module ids 1,4,5,2 VM#2 can get 7,0,3,6

nelyahu · 2024-08-13T18:33:16Z

@delock i want to avoid DeepSpeed accessing to this env var at all.
indeed causing index out of range issue there.
Maybe should replicate the CPU Accelerator behavior for now - which uses CUDA_VISIBLE_DEVICES

delock · 2024-08-14T02:24:29Z

Hi @nelyahu, the latest push added cosmetic change from other 8 files which is unrelated.

The way deepspeed sets it is not correct with all HPU instances and may lead to incorrect behavior.

nelyahu · 2024-08-14T06:36:42Z

Hi @nelyahu, the latest push added cosmetic change from other 8 files which is unrelated.

Yes, formatter version issue. fixed

nelyahu force-pushed the fix_hpu_module_ids branch from 1a3c304 to 5b9f50e Compare August 13, 2024 18:53

nelyahu requested review from awan-10 and arashb as code owners August 13, 2024 18:53

HPUAccelerator: remove support in set_visible_devices_envs

fc382db

The way deepspeed sets it is not correct with all HPU instances and may lead to incorrect behavior.

nelyahu force-pushed the fix_hpu_module_ids branch from 5b9f50e to fc382db Compare August 14, 2024 06:35

nelyahu and others added 2 commits August 14, 2024 13:22

Merge branch 'master' into fix_hpu_module_ids

9b7bebd

Merge branch 'master' into fix_hpu_module_ids

4847a98

loadams approved these changes Aug 14, 2024

View reviewed changes

loadams added this pull request to the merge queue Aug 14, 2024

loadams removed this pull request from the merge queue due to a manual request Aug 14, 2024

loadams merged commit a8d1b44 into microsoft:master Aug 14, 2024
12 checks passed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

HPUAccelerator: remove support in set_visible_devices_envs #5929

HPUAccelerator: remove support in set_visible_devices_envs #5929

nelyahu commented Aug 13, 2024

delock commented Aug 13, 2024

nelyahu commented Aug 13, 2024

delock commented Aug 13, 2024

nelyahu commented Aug 13, 2024 •

edited

Loading

delock commented Aug 14, 2024

nelyahu commented Aug 14, 2024

HPUAccelerator: remove support in set_visible_devices_envs #5929

HPUAccelerator: remove support in set_visible_devices_envs #5929

Conversation

nelyahu commented Aug 13, 2024

delock commented Aug 13, 2024

nelyahu commented Aug 13, 2024

delock commented Aug 13, 2024

nelyahu commented Aug 13, 2024 • edited Loading

delock commented Aug 14, 2024

nelyahu commented Aug 14, 2024

nelyahu commented Aug 13, 2024 •

edited

Loading