[Fix] fix_vllm_moe_quant #342

belovedxixi · 2024-12-20T10:24:45Z

Detail:
Alibaba benchmark test script quantizes a float model using vllm by giving in the parameters --quantization fp8 and --kv_cache_dtype fp8.
They don't use quark quantized model,
When using the 'moe_final_v0.6.0_Nov19' to do benchmark test for mixtral 8*7B, we got garbage output.
After checking, it was found that it was caused by a code missing in the vllm/model_executor/layers/quantization/fp8.py (For moe model,the func fuse_shuffle and moe_padding were not executed if using vllm to quantize model, and If passed to the vllm a quantized model, the two func can be executed correctly without errors ).

fix_vllm_quant

8ba3f0c

gshtras approved these changes Dec 20, 2024

View reviewed changes

gshtras merged commit d567353 into llama_fp8_12062024 Dec 20, 2024
3 of 4 checks passed

gshtras deleted the moe_final_v0.6.0_Nov19_fix_dynamic_quant branch December 20, 2024 15:23

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[Fix] fix_vllm_moe_quant #342

[Fix] fix_vllm_moe_quant #342

belovedxixi commented Dec 20, 2024 •

edited by github-actions bot

Loading

[Fix] fix_vllm_moe_quant #342

[Fix] fix_vllm_moe_quant #342

Conversation

belovedxixi commented Dec 20, 2024 • edited by github-actions bot Loading

belovedxixi commented Dec 20, 2024 •

edited by github-actions bot

Loading