Add Initial Support for MXFP4 Linear #52

Ziyue-Intel · 2025-07-04T02:48:42Z

Essential Elements of an Effective PR Description Checklist

The purpose of the PR, such as "Fix some issue (link existing issues this PR will resolve)".
The test plan, such as providing test command.
The test results, such as pasting the results comparison before and after, or e2e results

PLEASE FILL IN THE PR DESCRIPTION HERE ENSURING ALL CHECKLIST ITEMS ABOVE HAVE BEEN CONSIDERED.

Purpose

Test Plan

Test Result

BEFORE SUBMITTING, PLEASE READ https://docs.vllm.ai/en/latest/contributing (anything written below this line will be removed by GitHub Actions)

Signed-off-by: yiliu30 <yi4.liu@intel.com>

gemini-code-assist

Summary of Changes

Hello @Ziyue-Intel, I'm Gemini Code Assist¹! I'm currently reviewing this pull request and will post my feedback shortly. In the meantime, here's a summary to help you and other reviewers quickly get up to speed!

This pull request introduces initial support for MXFP4 (Mixed Precision FP4) quantization, specifically for 4-bit weights and 4-bit activations, within the compressed tensors framework. It includes an emulation mode for broader hardware compatibility and integrates the new quantization scheme into the model loading process. Additionally, it adds some temporary debugging utilities for model inspection.

Highlights

MXFP4 Quantization Support: Introduced a new CompressedTensorsW4A4MXFp4 scheme to support Mixed Precision FP4 (MXFP4) quantization for 4-bit weights and 4-bit activations, utilizing a group size of 32.
Emulation Mode for MXFP4: Added a VLLM_USE_MXFP4_CT_EMULATIONS environment variable and corresponding logic to enable software emulation of MXFP4 operations using torchao for platforms with compute capability less than 100, providing broader compatibility.
Integration into Quantization Framework: The new MXFP4 scheme is now recognized and integrated into the compressed tensors framework, allowing the system to detect and attempt to load models quantized with this format.
Debugging Utilities: Included temporary print statements and a helper function (print_model_state_dict) in the model loader to aid in debugging and inspecting the model's state dictionary during the loading process.

Using Gemini Code Assist

The full guide for Gemini Code Assist can be found on our documentation page, here are some quick tips.

Invoking Gemini

You can request assistance from Gemini at any point in your pull request via creating an issue comment (i.e. comment on the pull request page) using either /gemini <command> or @gemini-code-assist <command>. Below is a summary of the supported commands.

Feature	Command	Description
Code Review	`/gemini review`	Performs a code review for the current pull request in its current state.
Pull Request Summary	`/gemini summary`	Provides a summary of the current pull request in its current state.
Comment	@gemini-code-assist	Responds in comments when explicitly tagged, both in issue comments and review comments.
Help	`/gemini help`	Displays a list of available commands.

Customization

To customize Gemini Code Assist for GitHub experience, repository maintainers can create a configuration file and/or provide a custom code review style guide (such as PEP-8 for Python) by creating and adding files to a .gemini/ folder in the base of the repository. Detailed instructions can be found here.

Limitations & Feedback

Gemini Code Assist is currently in preview and may make mistakes. Please leave feedback on any instances where its feedback is incorrect or counter productive. You can react with 👍 and 👎 on @gemini-code-assist comments to provide feedback.

You can also get AI-powered code generation, chat, as well as code reviews directly in the IDE at no cost with the Gemini Code Assist IDE Extension.

Review the Privacy Notices, Generative AI Prohibited Use Policy, Terms of Service, and learn how to configure Gemini Code Assist in GitHub here. Gemini can make mistakes, so double check it and use code with caution. ↩

gemini-code-assist

Code Review

This pull request adds support for MXFP4 quantization, primarily through an emulation path. While the core functionality is present, the PR includes a significant amount of debugging code, such as print statements and a helper function with non-English comments, which must be removed. There are also critical bugs, including one that creates dead code in a key function, and other issues like incorrect enum comparisons and copy-pasted warning messages. These issues need to be addressed to ensure code quality and correctness before merging.

...del_executor/layers/quantization/compressed_tensors/schemes/compressed_tensors_w4a4_mxfp4.py

vllm/model_executor/model_loader/base_loader.py

vllm/model_executor/layers/quantization/compressed_tensors/compressed_tensors.py

examples/offline_inference/basic/basic_local.py

vllm/model_executor/layers/quantization/compressed_tensors/compressed_tensors.py

...del_executor/layers/quantization/compressed_tensors/schemes/compressed_tensors_w4a4_mxfp4.py

vllm/model_executor/layers/quantization/utils/mxfp4_emulation_utils.py

Copilot

Pull Request Overview

This PR introduces support for MXFp4-based quantization by adding a new CompressedTensorsW4A4MXFp4 scheme, associated emulation utilities, and a feature flag. It also temporarily adds debugging output in the model loader and updates example scripts.

Added VLLM_USE_MXFP4_CT_EMULATIONS env var and flag handling
Introduced run_mxfp4_emulations and new CompressedTensorsW4A4MXFp4 class
Inserted debug print statements in base_loader.py and updated examples

Reviewed Changes

Copilot reviewed 9 out of 9 changed files in this pull request and generated 4 comments.

Show a summary per file

File	Description
vllm/model_executor/model_loader/base_loader.py	Added unguarded debug prints and `print_model_state_dict` utility
vllm/model_executor/layers/quantization/utils/mxfp4_emulation_utils.py	Introduced emulation helper and unused stub `ref_mxfp4_quant`
vllm/model_executor/layers/quantization/compressed_tensors/utils.py	Extended quant formats to include `mxfp4_pack_quantized`
vllm/model_executor/layers/quantization/compressed_tensors/schemes/...	Added `compressed_tensors_w4a4_mxfp4.py` and registered the scheme
vllm/model_executor/layers/quantization/compressed_tensors/compressed_tensors.py	Integrated `_is_fp4a4_mxfp4` check into scheme resolution
vllm/envs.py	Added `VLLM_USE_MXFP4_CT_EMULATIONS` environment variable
examples/offline_inference/basic/basic_local.py	Added multiple model path assignments in example script

Comments suppressed due to low confidence (1)

vllm/model_executor/model_loader/base_loader.py:53

[nitpick] The helper uses Chinese comments; replace with English or remove to maintain consistent project language and style.

def print_model_state_dict(model):

vllm/model_executor/model_loader/base_loader.py

vllm/model_executor/layers/quantization/utils/mxfp4_emulation_utils.py

vllm/model_executor/layers/quantization/compressed_tensors/compressed_tensors.py

examples/offline_inference/basic/basic_local.py

Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com>

Co-authored-by: gemini-code-assist[bot] <176961590+gemini-code-assist[bot]@users.noreply.github.com>

Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com>

...del_executor/layers/quantization/compressed_tensors/schemes/compressed_tensors_w4a4_mxfp4.py

vllm/model_executor/layers/quantization/utils/mxfp4_emulation_utils.py

vllm/model_executor/model_loader/base_loader.py

…emes/compressed_tensors_w4a4_mxfp4.py

…utils.py

Signed-off-by: yiliu30 <yi4.liu@intel.com> Co-authored-by: yiliu30 <yi4.liu@intel.com> Co-authored-by: He, Xin3 <xin3.he@intel.com> Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com> Co-authored-by: gemini-code-assist[bot] <176961590+gemini-code-assist[bot]@users.noreply.github.com>

yiliu30 and others added 2 commits July 2, 2025 09:42

mxfp4 WIP

d80a717

Signed-off-by: yiliu30 <yi4.liu@intel.com>

mxfp4-update-ziyue

9e2edc1

gemini-code-assist bot reviewed Jul 4, 2025

View reviewed changes

yiliu30 requested a review from Copilot July 4, 2025 02:50

gemini-code-assist bot reviewed Jul 4, 2025

View reviewed changes

Copilot AI reviewed Jul 4, 2025

View reviewed changes

yiliu30 and others added 3 commits July 4, 2025 11:11

Apply suggestion from @Copilot

be66a8c

Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com>

Apply suggestion from @gemini-code-assist[bot]

75091ea

Co-authored-by: gemini-code-assist[bot] <176961590+gemini-code-assist[bot]@users.noreply.github.com>

Apply suggestion from @Copilot

e35ba5e

Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com>

yiliu30 reviewed Jul 4, 2025

View reviewed changes

yiliu30 added 7 commits July 4, 2025 11:17

Update vllm/model_executor/layers/quantization/compressed_tensors/sch…

da52377

…emes/compressed_tensors_w4a4_mxfp4.py

Update vllm/model_executor/layers/quantization/utils/mxfp4_emulation_…

390c81c

…utils.py

Update vllm/model_executor/layers/quantization/utils/mxfp4_emulation_…

c7d677b

…utils.py

Update vllm/model_executor/model_loader/base_loader.py

237d90f

Update vllm/model_executor/model_loader/base_loader.py

5e9f838

Update vllm/model_executor/model_loader/base_loader.py

1697933

Update vllm/model_executor/model_loader/base_loader.py

bcafb21

yiliu30 changed the title ~~Mxfp4 ziyue~~ Add Initial Support for MXFP4 Linear Jul 4, 2025

yiliu30 self-requested a review July 4, 2025 03:20

yiliu30 approved these changes Jul 4, 2025

View reviewed changes

yiliu30 merged commit 0a408c7 into yiliu30:cuda-mxfp8-moe Jul 4, 2025

Uh oh!

Add Initial Support for MXFP4 Linear #52

Add Initial Support for MXFP4 Linear #52

Uh oh!

Conversation

Ziyue-Intel commented Jul 4, 2025

Essential Elements of an Effective PR Description Checklist

Purpose

Test Plan

Test Result

Uh oh!

gemini-code-assist bot left a comment

Choose a reason for hiding this comment

Summary of Changes

Highlights

Footnotes

Uh oh!

gemini-code-assist bot left a comment

Choose a reason for hiding this comment

Code Review

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull Request Overview

Reviewed Changes

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants