[Hardware] broadcast support for Huawei Ascend NPU #39

kip-cxj · 2025-10-21T12:14:47Z

Modification overview

Add support for the broadcast model on Huawei Ascend NPU, with P2P mode currently under adaptation.

environment

Software	version
npu-driver	25.3.rc1
cann	8.3.RC1
python	3.11
torch	2.7.1
torch_npu	2.7.1dev20251016
vllm	0.11.0
vllm-ascend	0.11.0rc0

check list

code has been self-tested
We use this PR testing on Ascend NPU. The device information is 8 Atlas 800T A2. We don’t have GPU, so did not do any related testing.

weixiao-huang · 2025-10-22T05:23:50Z

Plz fix the pre-commit lint error

weixiao-huang · 2025-10-22T06:28:47Z

checkpoint_engine/worker.py

+            device_uuid = current_platform.get_device_uuid(self.device.index)
+        elif current_platform.device_type == "npu":
+            device_uuid = (
+                f"NPU-{current_platform.get_device_name(self.device.index)!s}-{self.device.index}"


Is this uuid unified for each device? Since when used by CUDA, we can set CUDA_DEVICE_DEVICES to override device index so that there may be two same device_uuid in difference processes. I'm not sure whether NPU has this problem

That's a valid concern. Torch_npu has not implemented get_device_uuid. I think that NPU has not this problem, so we are currently using the global rank id as the uuid.

But the global rank may differ from inference and ps. Since if a machine has 8 NPU devices, ps will have rank from 0 to 7 so that . But if inference engine use TP1, there may be 8 independent inference engines which may see rank0 in each inference engine and get the same device_uuid. I think this may cause potential bug.

I attempted to set UUID on Ascend device. Without native API support, I haven't found an ideal approach yet—only two suboptimal solutions:

Use environment variables to obtain the rank ID on the ps, while using torch.dist.get_rank() to get the rank ID in VLLM. Under default configurations, I think these two should be consistent.

Use subprocess to query npu-smi info (which is the NPU equivalent of nvidia-smi for GPUs) and combine the PID to locate the physical ID. This physical ID can then be combined with the server IP to form a UUID. However, this approach incurs significant time overhead and is not concise.

checkpoint_engine/ps.py

ZSL98 · 2025-10-24T09:52:25Z

Hi, I can't pass the test_update.py test.

cuixiaojin added 2 commits October 21, 2025 19:28

[Hardware] broadcast support for Huawei Ascend NPU

adb4075

[modify] check npu is availble

f083dbd

MoonshotAI deleted a comment from specture724 Oct 22, 2025

[Fix] fix the pre-commit lint error

bed9862

weixiao-huang reviewed Oct 22, 2025

View reviewed changes

[modify] address code view feedback

c9d3d42

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

[Hardware] broadcast support for Huawei Ascend NPU #39

[Hardware] broadcast support for Huawei Ascend NPU #39

kip-cxj commented Oct 21, 2025 •

edited

Loading

Uh oh!

weixiao-huang commented Oct 22, 2025

Uh oh!

weixiao-huang Oct 22, 2025

Uh oh!

kip-cxj Oct 22, 2025

Uh oh!

weixiao-huang Oct 22, 2025

Uh oh!

kip-cxj Oct 23, 2025

Uh oh!

Uh oh!

Uh oh!

Uh oh!

ZSL98 commented Oct 24, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

[Hardware] broadcast support for Huawei Ascend NPU #39

Are you sure you want to change the base?

[Hardware] broadcast support for Huawei Ascend NPU #39

Conversation

kip-cxj commented Oct 21, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Modification overview

environment

check list

Uh oh!

weixiao-huang commented Oct 22, 2025

Uh oh!

weixiao-huang Oct 22, 2025

Choose a reason for hiding this comment

Uh oh!

kip-cxj Oct 22, 2025

Choose a reason for hiding this comment

Uh oh!

weixiao-huang Oct 22, 2025

Choose a reason for hiding this comment

Uh oh!

kip-cxj Oct 23, 2025

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

ZSL98 commented Oct 24, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

kip-cxj commented Oct 21, 2025 •

edited

Loading