Skip to content

Commit

Permalink
Fix RuntimeError for moe on XPU: tensors found at least two devices (d…
Browse files Browse the repository at this point in the history
…eepspeedai#5519)

There is following error on XPU while unit testing
"DeepSpeed/tests/unit/moe/test_moe.py"
DeepSpeed/deepspeed/moe/sharded_moe.py line 223, in top1gating
RuntimeError: Expected all tensors to be on the same device, but found
at least two devices, xpu:0 and cpu!

Fix it by device conversion.

---------

Co-authored-by: Logan Adams <114770087+loadams@users.noreply.github.com>
  • Loading branch information
shiyang-weng and loadams authored May 21, 2024
1 parent 1d81967 commit 695d79e
Showing 1 changed file with 1 addition and 1 deletion.
2 changes: 1 addition & 1 deletion deepspeed/moe/sharded_moe.py
Original file line number Diff line number Diff line change
Expand Up @@ -220,7 +220,7 @@ def top1gating(logits: Tensor,
tp = 1 if groups.mpu is None else bwc_tensor_model_parallel_world_size(mpu=groups.mpu)
new_capacity = torch.ceil(new_capacity / tp).mul(tp).to(new_capacity.dtype)
# Make sure the capacity value does not exceed the number of tokens.
capacity = min(new_capacity, torch.tensor(mask1.size(0)))
capacity = min(new_capacity, torch.tensor(mask1.size(0)).to(new_capacity.device))

# Compute l_aux
me = torch.mean(gates, dim=0)
Expand Down

0 comments on commit 695d79e

Please sign in to comment.