-
Notifications
You must be signed in to change notification settings - Fork 58
[Hardware] broadcast support for Huawei Ascend NPU #39
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
Conversation
|
Plz fix the pre-commit lint error |
checkpoint_engine/worker.py
Outdated
| device_uuid = current_platform.get_device_uuid(self.device.index) | ||
| elif current_platform.device_type == "npu": | ||
| device_uuid = ( | ||
| f"NPU-{current_platform.get_device_name(self.device.index)!s}-{self.device.index}" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is this uuid unified for each device? Since when used by CUDA, we can set CUDA_DEVICE_DEVICES to override device index so that there may be two same device_uuid in difference processes. I'm not sure whether NPU has this problem
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
That's a valid concern. Torch_npu has not implemented get_device_uuid. I think that NPU has not this problem, so we are currently using the global rank id as the uuid.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
But the global rank may differ from inference and ps. Since if a machine has 8 NPU devices, ps will have rank from 0 to 7 so that . But if inference engine use TP1, there may be 8 independent inference engines which may see rank0 in each inference engine and get the same device_uuid. I think this may cause potential bug.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I attempted to set UUID on Ascend device. Without native API support, I haven't found an ideal approach yet—only two suboptimal solutions:
-
Use environment variables to obtain the rank ID on the ps, while using
torch.dist.get_rank()to get the rank ID in VLLM. Under default configurations, I think these two should be consistent. -
Use subprocess to query npu-smi info (which is the NPU equivalent of nvidia-smi for GPUs) and combine the PID to locate the physical ID. This physical ID can then be combined with the server IP to form a UUID. However, this approach incurs significant time overhead and is not concise.
|
Hi, I can't pass the test_update.py test. |
Modification overview
Add support for the broadcast model on Huawei Ascend NPU, with P2P mode currently under adaptation.
environment
check list
We use this PR testing on Ascend NPU. The device information is 8 Atlas 800T A2. We don’t have GPU, so did not do any related testing.