MXFP4 QDQ Support #55

Ziyue-Intel · 2025-07-07T02:30:11Z

Essential Elements of an Effective PR Description Checklist

The purpose of the PR, such as "Fix some issue (link existing issues this PR will resolve)".
The test plan, such as providing test command.
The test results, such as pasting the results comparison before and after, or e2e results

PLEASE FILL IN THE PR DESCRIPTION HERE ENSURING ALL CHECKLIST ITEMS ABOVE HAVE BEEN CONSIDERED.

Purpose

Test Plan

Test Result

BEFORE SUBMITTING, PLEASE READ https://docs.vllm.ai/en/latest/contributing (anything written below this line will be removed by GitHub Actions)

gemini-code-assist

Summary of Changes

Hello @Ziyue-Intel, I'm Gemini Code Assist¹! I'm currently reviewing this pull request and will post my feedback shortly. In the meantime, here's a summary to help you and other reviewers quickly get up to speed!

This pull request refines the MXFP4 emulation utilities by introducing a specialized FP4 dequantization function and clarifying the source of quantization/dequantization utilities within vllm/model_executor/layers/quantization/utils/mxfp4_emulation_utils.py. The changes aim to enhance the flexibility and precision of FP4 handling, potentially paving the way for improved performance or compatibility.

Highlights

New FP4 Dequantization Utility: A new to_dtype function has been added to vllm/model_executor/layers/quantization/utils/mxfp4_emulation_utils.py. This function is a customized version of torchao's to_dtype, specifically designed to use unpack_fp4_from_uint8 from compressed_tensors for FP4 data unpacking, offering an alternative dequantization path within the module.
Explicit torchao Function Calls: Existing calls to to_mx and to_dtype within the run_mxfp4_emulations function have been explicitly qualified to torchao.prototype.mx_formats.mx_tensor.to_mx and torchao.prototype.mx_formats.mx_tensor.to_dtype respectively. This clarifies that the current emulation flow continues to utilize the torchao library's implementations for these operations.
Dependency Updates: New imports from torchao and compressed_tensors have been added to support the newly introduced to_dtype utility and its dependencies, including specific FP4 unpacking functions and constants.

Using Gemini Code Assist

The full guide for Gemini Code Assist can be found on our documentation page, here are some quick tips.

Invoking Gemini

You can request assistance from Gemini at any point in your pull request via creating an issue comment (i.e. comment on the pull request page) using either /gemini <command> or @gemini-code-assist <command>. Below is a summary of the supported commands.

Feature	Command	Description
Code Review	`/gemini review`	Performs a code review for the current pull request in its current state.
Pull Request Summary	`/gemini summary`	Provides a summary of the current pull request in its current state.
Comment	@gemini-code-assist	Responds in comments when explicitly tagged, both in issue comments and review comments.
Help	`/gemini help`	Displays a list of available commands.

Customization

To customize Gemini Code Assist for GitHub experience, repository maintainers can create a configuration file and/or provide a custom code review style guide (such as PEP-8 for Python) by creating and adding files to a .gemini/ folder in the base of the repository. Detailed instructions can be found here.

Limitations & Feedback

Gemini Code Assist is currently in preview and may make mistakes. Please leave feedback on any instances where its feedback is incorrect or counter productive. You can react with 👍 and 👎 on @gemini-code-assist comments to provide feedback.

You can also get AI-powered code generation, chat, as well as code reviews directly in the IDE at no cost with the Gemini Code Assist IDE Extension.

Review the Privacy Notices, Generative AI Prohibited Use Policy, Terms of Service, and learn how to configure Gemini Code Assist in GitHub here. Gemini can make mistakes, so double check it and use code with caution. ↩

gemini-code-assist

Code Review

The pull request modifies mxfp4_emulation_utils.py by adding a local to_dtype function and related constants. However, the code is not used because the call sites explicitly use the torchao version. This review highlights the presence of dead code and recommends either integrating the new function or removing it for clarity.

gemini-code-assist · 2025-07-07T02:31:53Z

vllm/model_executor/layers/quantization/utils/mxfp4_emulation_utils.py

+def to_dtype(
+    data_lp,
+    scale_e8m0,
+    elem_dtype,
+    block_size,
+    target_dtype,
+    use_fp4_custom_triton_dequant_kernel,
+    pack_fp6,
+):
+    """
+    Based on the original implementation in torchao.prototype.mx_formats.mx_tensor.dtype()
+    
+    Modifications:
+    - Replaced [torchao.prototype.mx_formats.custom_cast.unpack_uint4()] with
+      [compressed_tensors.compressors.quantized_compressors.nvfp4_quantized.unpack_fp4_from_uint8()]
+    """
+    orig_shape = data_lp.shape
+    is_transposed = not data_lp.is_contiguous()
+    # if the underlying data is transposed, convert to row major before
+    # unpacking and unscaling
+    if is_transposed:
+        data_lp = data_lp.t()
+        assert data_lp.is_contiguous()
+        orig_shape = (orig_shape[1], orig_shape[0])
+
+    if elem_dtype == DTYPE_FP4:
+        if use_fp4_custom_triton_dequant_kernel:
+            data_hp_rescaled = triton_f4_to_scaled_bf16(
+                data_lp,
+                scale_e8m0,
+                block_size,
+            )
+            if is_transposed:
+                data_hp_rescaled = data_hp_rescaled.t()
+            return data_hp_rescaled.to(target_dtype)
+        else:
+            # fp4
+            m, n = data_lp.shape
+            f4_unpacked = unpack_fp4_from_uint8(data_lp, m, n*2)
+            # f4_unpacked = unpack_uint4(data_lp)
+            # for now we only have a cast to f32
+            # TODO(future PR): add cast directly to bf16
+            # f32 = f4_unpacked_to_f32(f4_unpacked)
+            data_hp = f4_unpacked.to(target_dtype)
+            # manually adjust shape to account for the unpacking
+            # TODO(future PR): clean up the shape code and remove the hack
+            # below
+        orig_shape = (*orig_shape[:-1], orig_shape[-1] * 2)
+    else:
+        raise AssertionError("unsupported")
+
+    data_hp = data_hp.reshape(-1, block_size)
+    s_fp = get_fp_scale(scale_e8m0).reshape(-1, 1).to(target_dtype)
+    data_hp = data_hp * s_fp
+    data_hp = data_hp.reshape(orig_shape)
+
+    # if we converted to row-major before unscaling convert back
+    if is_transposed:
+        data_hp = data_hp.t()
+
+    return data_hp


The function to_dtype and related constants are introduced but not utilized. The calls on lines 101 and 110 explicitly reference torchao.prototype.mx_formats.mx_tensor.to_mx and torchao.prototype.mx_formats.mx_tensor.to_dtype, respectively, rendering this new code redundant. This introduces unnecessary code complexity and potential maintenance overhead.

If the intention is to replace the torchao implementation, update the calls on lines 101 and 110 to use the local to_dtype function. Otherwise, remove this unused code to maintain code clarity.

gemini-code-assist · 2025-07-07T02:31:53Z

vllm/model_executor/layers/quantization/utils/mxfp4_emulation_utils.py

+    block_size,
+    target_dtype,
+    use_fp4_custom_triton_dequant_kernel,
+    pack_fp6,


The parameter pack_fp6 is unused within the to_dtype function. Remove it to simplify the function signature and avoid potential confusion.

vllm/model_executor/layers/quantization/utils/mxfp4_emulation_utils.py

gemini-code-assist · 2025-07-07T02:31:53Z

vllm/model_executor/layers/quantization/utils/mxfp4_emulation_utils.py

+            # f4_unpacked = unpack_uint4(data_lp)
+            # for now we only have a cast to f32
+            # TODO(future PR): add cast directly to bf16
+            # f32 = f4_unpacked_to_f32(f4_unpacked)


This line is commented out. Either remove it or uncomment it if it's needed.

vllm/model_executor/layers/quantization/utils/mxfp4_emulation_utils.py

Copilot

Pull Request Overview

Adds a custom to_dtype implementation in mxfp4_emulation_utils.py to support FP4 emulation and updates related imports and function calls.

Introduces a module‐local to_dtype that unpacks FP4 data and applies scaling, with an optional Triton kernel path.
Adds imports for FP4 pack/unpack and custom casting utilities.
Updates run_mxfp4_emulations to call the fully qualified to_mx and to_dtype from torchao.prototype.mx_formats.mx_tensor.

Comments suppressed due to low confidence (5)

vllm/model_executor/layers/quantization/utils/mxfp4_emulation_utils.py:28

[nitpick] Naming the local function to_dtype shadows the external to_dtype import; consider renaming it (e.g., custom_to_dtype) to avoid confusion.

def to_dtype(

vllm/model_executor/layers/quantization/utils/mxfp4_emulation_utils.py:37

[nitpick] Expand the docstring for to_dtype to describe all parameters (data_lp, elem_dtype, target_dtype, etc.) and the return value for better clarity.

"""

vllm/model_executor/layers/quantization/utils/mxfp4_emulation_utils.py:28

[nitpick] Add unit tests for the new to_dtype paths (both the custom Triton dequant kernel branch and the fallback unpack path) to ensure correctness.

def to_dtype(

vllm/model_executor/layers/quantization/utils/mxfp4_emulation_utils.py:5

The import pack_fp4_to_uint8 (and several others like f32_to_f4_unpacked) is unused in this module; consider removing unused imports to keep the code clean.

from compressed_tensors.compressors.quantized_compressors.nvfp4_quantized import unpack_fp4_from_uint8, pack_fp4_to_uint8

vllm/model_executor/layers/quantization/utils/mxfp4_emulation_utils.py:28

The parameter pack_fp6 in the to_dtype signature is not used anywhere; either remove it or implement its intended behavior.

def to_dtype(

yiliu30 · 2025-07-08T01:39:29Z

@Ziyue-Intel Please update the PR title and add end-to-end example test command in the description.

vllm/model_executor/layers/quantization/utils/mxfp4_emulation_utils.py

…utils.py

ziyue-mxfp4 only modify mxfp4_emulation_utils.py

4e9a1c3

gemini-code-assist bot reviewed Jul 7, 2025

View reviewed changes

yiliu30 requested changes Jul 8, 2025

View reviewed changes

yiliu30 requested a review from Copilot July 8, 2025 01:37

Copilot AI reviewed Jul 8, 2025

View reviewed changes

ziyue mxfp4 support

ef04670

yiliu30 requested changes Jul 9, 2025

View reviewed changes

vllm/model_executor/layers/quantization/utils/mxfp4_emulation_utils.py Outdated Show resolved Hide resolved

Ziyue-Intel changed the title ~~ziyue-mxfp4 only modify mxfp4_emulation_utils.py~~ mxfp4 support Jul 9, 2025

Update vllm/model_executor/layers/quantization/utils/mxfp4_emulation_…

6386d34

…utils.py

yiliu30 changed the title ~~mxfp4 support~~ MXFP4 QDQ Support Jul 9, 2025

yiliu30 approved these changes Jul 9, 2025

View reviewed changes

yiliu30 merged commit 1d503c2 into yiliu30:cuda-mxfp8-moe Jul 9, 2025

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Uh oh!

MXFP4 QDQ Support #55

MXFP4 QDQ Support #55

Uh oh!

Ziyue-Intel commented Jul 7, 2025

Uh oh!

gemini-code-assist bot left a comment

Uh oh!

gemini-code-assist bot left a comment

Uh oh!

gemini-code-assist bot Jul 7, 2025

Uh oh!

gemini-code-assist bot Jul 7, 2025

Uh oh!

Uh oh!

gemini-code-assist bot Jul 7, 2025

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Copilot AI left a comment

Uh oh!

yiliu30 commented Jul 8, 2025

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

Uh oh!

MXFP4 QDQ Support #55

MXFP4 QDQ Support #55

Uh oh!

Conversation

Ziyue-Intel commented Jul 7, 2025

Essential Elements of an Effective PR Description Checklist

Purpose

Test Plan

Test Result

Uh oh!

gemini-code-assist bot left a comment

Choose a reason for hiding this comment

Summary of Changes

Highlights

Footnotes

Uh oh!

gemini-code-assist bot left a comment

Choose a reason for hiding this comment

Code Review

Uh oh!

gemini-code-assist bot Jul 7, 2025

Choose a reason for hiding this comment

Uh oh!

gemini-code-assist bot Jul 7, 2025

Choose a reason for hiding this comment

Uh oh!

Uh oh!

gemini-code-assist bot Jul 7, 2025

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull Request Overview

Uh oh!

yiliu30 commented Jul 8, 2025

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants