Skip to content

Commit 4e7edaa

Browse files
committed
fix FI all2all with FI cutlass moe
Summary: Running FI Cutlass moe with FI a2av backend runs into error: ``` �[1;36m(EngineCore_DP7 pid=104761)�[0;0m ERROR 11-05 14:09:51 [core.py:843] ) = self.prepare_finalize.prepare( �[1;36m(EngineCore_DP7 pid=104761)�[0;0m ERROR 11-05 14:09:51 [core.py:843] File "/data/users/mxz/fbsource/buck-out/v2/gen/fbcode/c9838acc51201940/smart/inference_platform_sp/llm_predictor_gpu/__service__/service#link-tree/vllm/model_executor/layers/fused_moe/flashinfer_cutlass_prepare_finalize.py", line 115, in prepare �[1;36m(EngineCore_DP7 pid=104761)�[0;0m ERROR 11-05 14:09:51 [core.py:843] flashinfer_alltoall_dispatch( �[1;36m(EngineCore_DP7 pid=104761)�[0;0m ERROR 11-05 14:09:51 [core.py:843] File "/data/users/mxz/fbsource/buck-out/v2/gen/fbcode/c9838acc51201940/smart/inference_platform_sp/llm_predictor_gpu/__service__/service#link-tree/vllm/model_executor/layers/fused_moe/flashinfer_cutlass_prepare_finalize.py", line 239, in flashinfer_alltoall_dispatch �[1;36m(EngineCore_DP7 pid=104761)�[0;0m ERROR 11-05 14:09:51 [core.py:843] all2all_manager.prepare_workspace, �[1;36m(EngineCore_DP7 pid=104761)�[0;0m ERROR 11-05 14:09:51 [core.py:843] AttributeError: 'FlashInferAllToAllManager' object has no attribute 'prepare_workspace'. Did you mean: 'prepare_workspace_tensor'? �[1;36m(EngineCore_DP5 pid=104759)�[0;0m ERROR 11-05 14:09:51 [core.py:843] EngineCore failed to start. ``` After fixing the error above, running into the following error: ``` �[1;36m(EngineCore_DP5 pid=821648)�[0;0m File "/data/users/mxz/fbsource/buck-out/v2/gen/fbcode/c9838acc51201940/smart/inference_platform_sp/llm_predictor_gpu/__service__/service#link-tree/flashinfer/fused_moe/core.py", line 817, in cutlass_fused_moe �[1;36m(EngineCore_DP5 pid=821648)�[0;0m return get_cutlass_fused_moe_module(device_arch).cutlass_fused_moe( �[1;36m(EngineCore_DP5 pid=821648)�[0;0m File "/data/users/mxz/fbsource/buck-out/v2/gen/fbcode/c9838acc51201940/smart/inference_platform_sp/llm_predictor_gpu/__service__/service#link-tree/flashinfer/fused_moe/core.py", line 537, in cutlass_fused_moe �[1;36m(EngineCore_DP5 pid=821648)�[0;0m run_moe( �[1;36m(EngineCore_DP5 pid=821648)�[0;0m File "tvm_ffi/function.pxi", line 814, in tvm_ffi.core.Function.__call__ �[1;36m(EngineCore_DP5 pid=821648)�[0;0m File "buck-out/v2/gen/fbcode/deeplearning/tvm_ffi/tvm_ffi/cython/__core__cython-lib__/19a62205b4ea2336/buck-headers/tvm_ffi_python_helpers.h", line 323, in _ZL43__pyx_pw_7tvm_ffi_4core_8Function_3__call__P7_objectS0_S0__tvm_ffi$core �[1;36m(EngineCore_DP5 pid=821648)�[0;0m File "fbcode/deeplearning/flashinfer/csrc/fused_moe/cutlass_backend/flashinfer_cutlass_fused_moe_sm100_binding.cu", line 706, in FusedMoeRunner::GetFunction(tvm::ffi::String const&)::{lambda(tvm::ffi::TensorView, tvm::ffi::TensorView, tvm::ffi::TensorView, tvm::ffi::Optional<tvm::ffi::TensorView, void>, tvm::ffi::TensorView, tvm::ffi::Optional<tvm::ffi::TensorView, void>, tvm::ffi::TensorView, tvm::ffi::Optional<tvm::ffi::TensorView, void>, tvm::ffi::Optional<tvm::ffi::Array<tvm::ffi::Tensor, void>, void>, tvm::ffi::Optional<tvm::ffi::TensorView, void>, tvm::ffi::Optional<tvm::ffi::TensorView, void>, tvm::ffi::Optional<tvm::ffi::TensorView, void>, tvm::ffi::Optional<tvm::ffi::TensorView, void>, long, long, long, long, long, long, bool, bool, tvm::ffi::Optional<tvm::ffi::Array<long, void>, void>, bool, long)vllm-project#1}::operator()(tvm::ffi::TensorView, tvm::ffi::TensorView, tvm::ffi::TensorView, tvm::ffi::Optional<tvm::ffi::TensorView, void>, tvm::ffi::TensorView, tvm::ffi::Optional<tvm::ffi::TensorView, void>, tvm::ffi::TensorView, tvm::ffi::Optional<tvm::ffi::TensorView, void>, tvm::ffi::Optional<tvm::ffi::Array<tvm::ffi::Tensor, void>, void>, tvm::ffi::Optional<tvm::ffi::TensorView, void>, tvm::ffi::Optional<tvm::ffi::TensorView, void>, tvm::ffi::Optional<tvm::ffi::TensorView, void>, tvm::ffi::Optional<tvm::ffi::TensorView, void>, long, long, long, long, long, long, bool, bool, tvm::ffi::Optional<tvm::ffi::Array<long, void>, void>, bool, long) const �[1;36m(EngineCore_DP5 pid=821648)�[0;0m File "fbcode/deeplearning/flashinfer/csrc/fused_moe/cutlass_backend/flashinfer_cutlass_fused_moe_sm100_binding.cu", line 248, in void FusedMoeRunner::runMoe(tvm::ffi::TensorView, tvm::ffi::TensorView, tvm::ffi::TensorView, tvm::ffi::Optional<tvm::ffi::TensorView>, tvm::ffi::TensorView, tvm::ffi::Optional<tvm::ffi::TensorView>, tvm::ffi::TensorView, tvm::ffi::Optional<tvm::ffi::TensorView>, tvm::ffi::Optional<tvm::ffi::Array<tvm::ffi::Tensor>>, tvm::ffi::Optional<tvm::ffi::TensorView>, tvm::ffi::Optional<tvm::ffi::TensorView>, tvm::ffi::Optional<tvm::ffi::TensorView>, tvm::ffi::Optional<tvm::ffi::TensorView>, int64_t, int64_t, int64_t, int64_t, int64_t, int64_t, bool, bool, tvm::ffi::Optional<tvm::ffi::Array<long>>, bool, ActivationType) �[1;36m(EngineCore_DP5 pid=821648)�[0;0m RuntimeError: Check failed: token_final_scales.value().dtype() == dl_float32 (int32 vs. float32) : Inconsistency of Tensor type: token_final_scales.value() I1105 14:19:35.039142 822035 HealthTracker.cpp:26 req:00007fd9d4e1b100] Mark connection as healthy. ``` It seems like flashinfer moe_prepare kernel always return int32 tensor, so convert the type accordingly Differential Revision: D86345110 Signed-off-by: Xiaozhu <mxz297@gmail.com>
1 parent e156017 commit 4e7edaa

File tree

1 file changed

+3
-1
lines changed

1 file changed

+3
-1
lines changed

vllm/model_executor/layers/fused_moe/flashinfer_cutlass_prepare_finalize.py

Lines changed: 3 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -233,12 +233,13 @@ def flashinfer_alltoall_dispatch(
233233
max_num_token = (
234234
max(global_num_tokens_cpu) if global_num_tokens_cpu is not None else x.shape[0]
235235
)
236+
orig_topk_weights_dtype = topk_weights.dtype
236237
alltoall_info, topk_ids, topk_weights, _ = (
237238
MnnvlMoe.mnnvl_moe_alltoallv_prepare_without_allgather(
238239
topk_ids,
239240
topk_weights,
240241
None,
241-
all2all_manager.prepare_workspace,
242+
all2all_manager.prepare_workspace_tensor,
242243
max_num_token,
243244
ep_rank,
244245
ep_size,
@@ -247,6 +248,7 @@ def flashinfer_alltoall_dispatch(
247248
top_k,
248249
)
249250
)
251+
topk_weights = topk_weights.view(dtype=orig_topk_weights_dtype)
250252

251253
x, x_sf = moe_kernel_quantize_input(
252254
x,

0 commit comments

Comments
 (0)