Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

HPUAccelerator: remove support in set_visible_devices_envs #5929

Merged
merged 3 commits into from
Aug 14, 2024

Conversation

nelyahu
Copy link
Contributor

@nelyahu nelyahu commented Aug 13, 2024

The way deepspeed sets it is not correct with all HPU instances and may lead to incorrect behavior.

@delock
Copy link
Collaborator

delock commented Aug 13, 2024

Hi @nelyahu in the link https://docs.habana.ai/en/latest/PyTorch/Reference/PT_Multiple_Tenants_on_HPU/Multiple_Workloads_Single_Docker.html specificed in the code. It mentioned HABANA_VISIBLE_MODULES is used for multi-tenant on HPU, how does this usage conflict with set_visible_devices_envs() in DeepSpeed? Thanks!

@nelyahu
Copy link
Contributor Author

nelyahu commented Aug 13, 2024

@delock on Intel Gaudi accelerator (HPU), the visible module ids are determines by the hl-smi -L command, it does not always 0,1,2,3,4,5,6,7 and specially when running multiple VMs on the same server each VM can get different module
For example, the case of 2 VMs
VM#1 can get module ids 1,4,5,2
VM#2 can get 7,0,3,6

@delock
Copy link
Collaborator

delock commented Aug 13, 2024

Hi @nelyahu From the title and description this PR intends to avoid set_visible_devices_envs on this line:

get_accelerator().set_visible_devices_envs(current_env, local_accelerator_ids)

This PR also nullify this line which does not set_visible_devices_envs, is it also intended?

visible_devices = os.environ.get(visible_devices_env, "")

@delock on Intel Gaudi accelerator (HPU), the visible module ids are determines by the hl-smi -L command, it does not always 0,1,2,3,4,5,6,7 and specially when running multiple VMs on the same server each VM can get different module For example, the case of 2 VMs VM#1 can get module ids 1,4,5,2 VM#2 can get 7,0,3,6

@nelyahu
Copy link
Contributor Author

nelyahu commented Aug 13, 2024

@delock i want to avoid DeepSpeed accessing to this env var at all.
indeed causing index out of range issue there.
Maybe should replicate the CPU Accelerator behavior for now - which uses CUDA_VISIBLE_DEVICES

@nelyahu nelyahu force-pushed the fix_hpu_module_ids branch from 1a3c304 to 5b9f50e Compare August 13, 2024 18:53
@nelyahu nelyahu requested review from awan-10 and arashb as code owners August 13, 2024 18:53
@delock
Copy link
Collaborator

delock commented Aug 14, 2024

Hi @nelyahu, the latest push added cosmetic change from other 8 files which is unrelated.

The way deepspeed sets it is not correct with all HPU instances
and may lead to incorrect behavior.
@nelyahu nelyahu force-pushed the fix_hpu_module_ids branch from 5b9f50e to fc382db Compare August 14, 2024 06:35
@nelyahu
Copy link
Contributor Author

nelyahu commented Aug 14, 2024

Hi @nelyahu, the latest push added cosmetic change from other 8 files which is unrelated.

Yes, formatter version issue. fixed

@loadams loadams added this pull request to the merge queue Aug 14, 2024
@loadams loadams removed this pull request from the merge queue due to a manual request Aug 14, 2024
@loadams loadams merged commit a8d1b44 into microsoft:master Aug 14, 2024
12 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants