-
-
Notifications
You must be signed in to change notification settings - Fork 11.1k
[Feature][Quantization] MXFP4 support for MOE models #17888
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
wip wip & debug update cleanup use quark realquantizer for pack/quant/dequant comment on cudagraph issue; remove prints Keep only 1 place importing quark cudagraph issue resolved; dq weight at load time for efficiency Signed-off-by: Bowen Bao <bowenbao@amd.com> lint Signed-off-by: Bowen Bao <bowenbao@amd.com> turn on emulation based on platform Signed-off-by: Bowen Bao <bowenbao@amd.com> add fused moe support - ugly wip running version Add envar if dequant weight at load time Signed-off-by: Bowen Bao <bowenbao@amd.com> Mxfp4 memory leak fixes (#2) Signed-off-by: Felix Marty <felmarty@amd.com>
Signed-off-by: Bowen Bao <bowenbao@amd.com>
|
👋 Hi! Thank you for contributing to the vLLM project. 💬 Join our developer Slack at https://slack.vllm.ai to discuss your PR in #pr-reviews, coordinate on features in #feat- channels, or join special interest groups in #sig- channels. Just a reminder: PRs would not trigger full CI run by default. Instead, it would only run Once the PR is approved and ready to go, your PR reviewer(s) can run CI to test the changes comprehensively before merging. To run CI, PR reviewers can either: Add 🚀 |
|
Can you merge from main to fix pre-commit? |
|
This pull request has merge conflicts that must be resolved before it can be |
wip & debug update cleanup use quark realquantizer for pack/quant/dequant comment on cudagraph issue; remove prints Keep only 1 place importing quark cudagraph issue resolved; dq weight at load time for efficiency Signed-off-by: Bowen Bao <bowenbao@amd.com> lint Signed-off-by: Bowen Bao <bowenbao@amd.com> turn on emulation based on platform Signed-off-by: Bowen Bao <bowenbao@amd.com> add fused moe support - ugly wip running version Add envar if dequant weight at load time Signed-off-by: Bowen Bao <bowenbao@amd.com> Mxfp4 memory leak fixes (#2) Fix VLLM_QUARK_EMU_MEM_OPT route Signed-off-by: Felix Marty <felmarty@amd.com>
Signed-off-by: Felix Marty <felmarty@amd.com>
Signed-off-by: Felix Marty <felmarty@amd.com>
Signed-off-by: Felix Marty <felmarty@amd.com>
Signed-off-by: Felix Marty <felmarty@amd.com>
wip & debug update cleanup use quark realquantizer for pack/quant/dequant comment on cudagraph issue; remove prints Keep only 1 place importing quark cudagraph issue resolved; dq weight at load time for efficiency Signed-off-by: Bowen Bao <bowenbao@amd.com> lint Signed-off-by: Bowen Bao <bowenbao@amd.com> turn on emulation based on platform Signed-off-by: Bowen Bao <bowenbao@amd.com> add fused moe support - ugly wip running version Add envar if dequant weight at load time Signed-off-by: Bowen Bao <bowenbao@amd.com> Mxfp4 memory leak fixes (#2) Fix VLLM_QUARK_EMU_MEM_OPT route Signed-off-by: Felix Marty <felmarty@amd.com>
Signed-off-by: Felix Marty <felmarty@amd.com>
… select the q/dq/qdq implem for mxfp4 Signed-off-by: Felix Marty <felmarty@amd.com>
Co-authored-by: Felix Marty <felmarty@amd.com> Signed-off-by: Felix Marty <felmarty@amd.com>
Signed-off-by: Felix Marty <felmarty@amd.com>
Signed-off-by: Felix Marty <felmarty@amd.com>
Signed-off-by: Felix Marty <Felix.Marty@amd.com>
|
Hi @bnellnm, I addressed your comments and also made this compatible with the recent changes in vllm for dynamo/inductor, guarding mxfp4 dequantization & QDQ in custom ops. Let me know if this looks good! |
Signed-off-by: Felix Marty <Felix.Marty@amd.com>
|
@bnellnm concerning the CI, the failing tests seem to be the previous bitsandbytes tests that were failing some weeks ago as well, I think it is unrelated: |
|
Thanks, I'll take a look now. Bill is OOO for a bit |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
A few comments left
| a1_scale=None, | ||
| a2_scale=None, | ||
| block_shape=None, | ||
| per_channel_quant=True, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It looks like you are still missing activation=activation here and why does per_channel_quant=True need to be set for mxfp4?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I added per_channel_quant=True to address #17888 (comment), see https://github.com/fxmarty-amd/vllm/blob/e570709cfe79c3a43d3e777bb34e0adfa22788f3/vllm/model_executor/layers/fused_moe/utils.py#L87.
Later on in fused_moe.py we have per_act_token_quant=per_channel_quant
vllm/vllm/model_executor/layers/fused_moe/fused_moe.py
Lines 1340 to 1345 in 5358cce
| qcurr_hidden_states, a1q_scale = moe_kernel_quantize_input( | |
| A=curr_hidden_states, | |
| A_scale=a1_scale, | |
| quant_dtype=qtype, | |
| per_act_token_quant=per_channel_quant, | |
| block_shape=block_shape) |
Actually you are right, this is not compatible with
vllm/vllm/model_executor/layers/fused_moe/utils.py
Lines 127 to 144 in 5358cce
| def _validate_scale_shape( | |
| a: torch.Tensor, | |
| a_scale: Optional[torch.Tensor], | |
| per_act_token_quant: bool, | |
| block_shape: Optional[list[int]], | |
| ) -> None: | |
| if a_scale is None: | |
| return | |
| if not per_act_token_quant and block_shape is None: | |
| assert a_scale.numel() == 1, f"{a_scale.shape}" | |
| elif per_act_token_quant: | |
| assert a_scale.shape[0] == a.shape[0] and a_scale.shape[1] == 1, ( | |
| f"{a_scale.shape[0]} == {a.shape[0]} and {a_scale.shape[1]} == 1") | |
| else: | |
| assert block_shape is not None | |
| expected = (a.shape[0], cdiv(a.shape[1], block_shape[1])) | |
| assert a_scale.shape == expected, f"{a_scale.shape} == {expected}" |
So I removed per_channel_quant=True in 4ffff1d and will leave #17888 (comment) open. Does that sound ok?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hmm sorry I'm not sure what is the "right" way here just looking at it quickly..
Signed-off-by: Felix Marty <Felix.Marty@amd.com>
Signed-off-by: Felix Marty <Felix.Marty@amd.com>
Signed-off-by: Felix Marty <Felix.Marty@amd.com>
|
@mgoin I reran tests in |
This PR follows #16943, and adds the possibility to load MOE models using MXFP4 weights with dynamic per-group MXFP4 quantization for activations.
We did not yet release such models publicly, but expect to release some soon.
At the moment, execution on MI300 runs a simulated scheme where weights are dequantized on the fly, and QDQ is done on activations on the fly, using HIP kernels
Left to do: