Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
Detail:
Alibaba benchmark test script quantizes a float model using vllm by giving in the parameters --quantization fp8 and --kv_cache_dtype fp8.
They don't use quark quantized model,
When using the 'moe_final_v0.6.0_Nov19' to do benchmark test for mixtral 8*7B, we got garbage output.
After checking, it was found that it was caused by a code missing in the vllm/model_executor/layers/quantization/fp8.py (For moe model,the func fuse_shuffle and moe_padding were not executed if using vllm to quantize model, and If passed to the vllm a quantized model, the two func can be executed correctly without errors ).