Skip to content

Conversation

@yiliu30
Copy link
Owner

@yiliu30 yiliu30 commented Jul 10, 2025

Essential Elements of an Effective PR Description Checklist

  • The purpose of the PR, such as "Fix some issue (link existing issues this PR will resolve)".
  • The test plan, such as providing test command.
  • The test results, such as pasting the results comparison before and after, or e2e results

PLEASE FILL IN THE PR DESCRIPTION HERE ENSURING ALL CHECKLIST ITEMS ABOVE HAVE BEEN CONSIDERED.

Purpose

Test Plan

Test Result

BEFORE SUBMITTING, PLEASE READ https://docs.vllm.ai/en/latest/contributing (anything written below this line will be removed by GitHub Actions)

Signed-off-by: Yi Liu <yiliu4@habana.ai>
Copy link

@gemini-code-assist gemini-code-assist bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Summary of Changes

Hello @yiliu30, I'm Gemini Code Assist1! I'm currently reviewing this pull request and will post my feedback shortly. In the meantime, here's a summary to help you and other reviewers quickly get up to speed!

This pull request significantly enhances the vllm framework by adding full support for NVIDIA FP4 (NVFP4) quantization. It provides the necessary infrastructure for quantizing and dequantizing tensors, and seamlessly integrates this new low-precision format into the model executor's linear and Mixture-of-Experts (MoE) layers. The changes are designed to optimize model inference, especially on Habana Processing Units, by enabling efficient handling and computation with NVFP4 weights.

Highlights

  • New Quantization Scheme: Introduced comprehensive support for NVIDIA FP4 (NVFP4) quantization, enabling models to leverage this low-precision format for improved efficiency.
  • Core Quantization Utilities: Added a new utility file (nvfp4_qdq.py) containing the fundamental logic for NVFP4 quantization and dequantization, including functions for packing/unpacking, scaling, and casting to FP4.
  • MoE Layer Integration: Integrated NVFP4 support into the Mixture-of-Experts (MoE) layers, modifying weight processing to unpack NVFP4 weights into FP8 format for efficient execution, particularly on Habana Processing Units (HPUs).
  • Linear Layer Implementation: Updated the compressed_tensors_w4a4_nvfp4 scheme to directly handle NVFP4 linear layer operations, including weight unpacking and the application of quantization-aware matrix multiplications.
Using Gemini Code Assist

The full guide for Gemini Code Assist can be found on our documentation page, here are some quick tips.

Invoking Gemini

You can request assistance from Gemini at any point in your pull request via creating an issue comment (i.e. comment on the pull request page) using either /gemini <command> or @gemini-code-assist <command>. Below is a summary of the supported commands.

Feature Command Description
Code Review /gemini review Performs a code review for the current pull request in its current state.
Pull Request Summary /gemini summary Provides a summary of the current pull request in its current state.
Comment @gemini-code-assist Responds in comments when explicitly tagged, both in issue comments and review comments.
Help /gemini help Displays a list of available commands.

Customization

To customize Gemini Code Assist for GitHub experience, repository maintainers can create a configuration file and/or provide a custom code review style guide (such as PEP-8 for Python) by creating and adding files to a .gemini/ folder in the base of the repository. Detailed instructions can be found here.

Limitations & Feedback

Gemini Code Assist is currently in preview and may make mistakes. Please leave feedback on any instances where its feedback is incorrect or counter productive. You can react with 👍 and 👎 on @gemini-code-assist comments to provide feedback.

You can also get AI-powered code generation, chat, as well as code reviews directly in the IDE at no cost with the Gemini Code Assist IDE Extension.

Footnotes

  1. Review the Privacy Notices, Generative AI Prohibited Use Policy, Terms of Service, and learn how to configure Gemini Code Assist in GitHub here. Gemini can make mistakes, so double check it and use code with caution.

Copy link

@gemini-code-assist gemini-code-assist bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

The pull request introduces nvfp4 support, including a new utility file and modifications to MoE and linear layers. However, there are critical issues in the CUDA path of compressed_tensors_w4a4_nvfp4.py and in the new nvfp4_qdq.py file, which contains module-level side effects and unused/broken code. Addressing these points will greatly improve the robustness and maintainability of the new functionality.

Comment on lines 150 to +192
def process_weights_after_loading(self, layer) -> None:
from vllm.model_executor.layers.quantization.utils.nvfp4_qdq import (
unpacked_nvfp4_to_fp8,
)

logger.debug(
f"start processing weights for {getattr(layer, 'prefix', 'unknown')}"
)
global_input_scale = layer.input_global_scale.max().to(torch.float32)
layer.input_global_scale = Parameter(global_input_scale,
requires_grad=False)
layer.input_global_scale = Parameter(
global_input_scale, requires_grad=False
)

layer.weight_global_scale = Parameter(
layer.weight_global_scale.max().to(torch.float32),
requires_grad=False)

swizzled_weight_scale = self.swizzle_blockscale(layer.weight_scale)
layer.weight_scale_swizzled = Parameter(swizzled_weight_scale,
requires_grad=False)

# required by cutlass kernel; need Parameter, not ModelWeightParameter
layer.weight = Parameter(layer.weight_packed.data, requires_grad=False)

if self.cutlass_nvfp4_supported:
layer.alpha = Parameter(layer.input_global_scale *
layer.weight_global_scale,
requires_grad=False)
requires_grad=False,
)

# weight_global_scale_data = layer.weight_global_scale.data
weight_scale_data = layer.weight_scale.data
weight_packed_data = layer.weight_packed.data
weight_unpacked = unpacked_nvfp4_to_fp8(weight_packed_data)
layer.weight_scale = torch.nn.Parameter(
weight_scale_data, requires_grad=False
)

layer.weight_unpacked = torch.nn.Parameter(
weight_unpacked, requires_grad=False
)
del layer.weight_packed
torch.hpu.synchronize()

# swizzled_weight_scale = self.swizzle_blockscale(layer.weight_scale)
# layer.weight_scale_swizzled = Parameter(swizzled_weight_scale,
# requires_grad=False)

# # required by cutlass kernel; need Parameter, not ModelWeightParameter
# layer.weight = Parameter(layer.weight_packed.data, requires_grad=False)

# if self.cutlass_nvfp4_supported:
# layer.alpha = Parameter(layer.input_global_scale *
# layer.weight_global_scale,
# requires_grad=False)

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

critical

The original CUDA path in process_weights_after_loading has been commented out and replaced with HPU-specific logic, breaking functionality for CUDA devices using the nvfp4 quantization scheme. Preserve the CUDA path and add the new HPU logic within a conditional block, checking for self.cutlass_nvfp4_supported to execute the appropriate logic for each platform.

    def process_weights_after_loading(self, layer) -> None:
        global_input_scale = layer.input_global_scale.max().to(torch.float32)
        layer.input_global_scale = Parameter(
            global_input_scale, requires_grad=False
        )

        layer.weight_global_scale = Parameter(
            layer.weight_global_scale.max().to(torch.float32),
            requires_grad=False,
        )

        if self.cutlass_nvfp4_supported:
            swizzled_weight_scale = self.swizzle_blockscale(layer.weight_scale)
            layer.weight_scale_swizzled = Parameter(swizzled_weight_scale,
                                                    requires_grad=False)

            # required by cutlass kernel; need Parameter, not ModelWeightParameter
            layer.weight = Parameter(layer.weight_packed.data, requires_grad=False)

            layer.alpha = Parameter(layer.input_global_scale *
                                    layer.weight_global_scale,
                                    requires_grad=False)
        else:
            from vllm.model_executor.layers.quantization.utils.nvfp4_qdq import (
                unpacked_nvfp4_to_fp8,
            )

            logger.debug(
                f"start processing weights for {getattr(layer, 'prefix', 'unknown')}"
            )

            weight_scale_data = layer.weight_scale.data
            weight_packed_data = layer.weight_packed.data
            weight_unpacked = unpacked_nvfp4_to_fp8(weight_packed_data)
            layer.weight_scale = torch.nn.Parameter(
                weight_scale_data, requires_grad=False
            )

            layer.weight_unpacked = torch.nn.Parameter(
                weight_unpacked, requires_grad=False
            )
            del layer.weight_packed
            torch.hpu.synchronize()

Comment on lines +6 to +7
os.environ["PT_HPU_LAZY_MODE"] = "1"
os.environ["PT_HPU_ENABLE_LAZY_COLLECTIVES"] = "1"

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

critical

Setting environment variables at the module level is a side effect that can have unintended consequences across the entire application. These should be set at the application's entry point instead.

Comment on lines +24 to +26
import os

os.environ["PT_HPU_LAZY_MODE"] = "1"

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

high

This file has a duplicate os.environ["PT_HPU_LAZY_MODE"] = "1" setting and a duplicate import os. Clean these up to avoid redundancy.

Comment on lines +51 to +135
def unpack_fp4_from_uint8(
a: torch.Tensor,
m: int,
n: int,
dtype: Optional[torch.dtype] = torch.bfloat16,
) -> torch.Tensor:
"""
Unpacks uint8 values into fp4. Each uint8 consists of two fp4 values
(i.e. first four bits correspond to one fp4 value, last four corresond to a consecutive
fp4 value). The bits represent an index, which are mapped to an fp4 value.
:param a: tensor to unpack
:param m: original dim 0 size of the unpacked tensor
:param n: original dim 1 size of the unpacked tensor
:param dtype: dense dtype to cast the unpacked tensor to
"""
assert a.dtype == torch.uint8, f"Ex got{a.dtype}"

# Vectorized nibble processing
a_flat = a.flatten()
high = (a_flat & 0xF0) >> 4 # Upper nibbles
low = a_flat & 0x0F # Lower nibbles

# Combine nibbles for batch processing
combined = torch.stack((low, high), dim=1).flatten()

# Vectorized sign and magnitude extraction
signs = (combined & 0x08).to(torch.bool) # Sign bits
abs_vals = (combined & 0x07).to(torch.long) # Magnitude indices

# Device-aware lookup and sign application
kE2M1 = kE2M1ToFloat.to(device=a.device)
values = kE2M1[abs_vals] * torch.where(signs, -1.0, 1.0)
# breakpoint()
# Reshape to final form
return values.reshape(m, n).to(dtype=dtype)


# >>>>>>>>>>>>>>>>>>

SBITS, EBITS_F32, MBITS_F32 = 1, 8, 23
EBITS_BF16, MBITS_BF16 = 8, 7
EBITS_F4_E2M1, MBITS_F4_E2M1 = 2, 1
EBITS_F6_E2M3, MBITS_F6_E2M3 = 2, 3
EBITS_F6_E3M2, MBITS_F6_E3M2 = 3, 2
EBITS_F8_E4M3, MBITS_F8_E4M3 = 4, 3
EBITS_F8_E5M2, MBITS_F8_E5M2 = 5, 2
from torchao.prototype.mx_formats.mx_tensor import to_dtype as ao_to_dtype

import os

_USE_CT_UNPACK = os.getenv("USE_CT_UNPACK", "0").lower() in ("1", "true", "yes")

from enum import Enum, auto
from typing import Callable, Dict, Union

import torch
import os
from torchao.prototype.mx_formats.constants import (
F4_E2M1_MAX,
F8E4M3_MAX,
)


def per_tensor_amax_to_scale(amax: torch.Tensor) -> torch.Tensor:
"""Convert per-tensor amax to per-tensor scale.
Used to scale fp32 scales down to fp8 scales
Args:
amax: Per-tensor amax tensor
Returns:
torch.Tensor: Per-tensor scale tensor
"""
return torch.clamp(amax / F8E4M3_MAX, min=E4M3_EPS, max=F8E4M3_MAX).to(
torch.float32
)


E4M3_EPS = torch.finfo(torch.float8_e4m3fn).tiny
from compressed_tensors.quantization.quant_args import FP4_E2M1_DATA
from compressed_tensors.compressors.quantized_compressors.nvfp4_quantized import (
pack_fp4_to_uint8,
unpack_fp4_from_uint8,
)

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

high

This module contains several issues related to imports and unused code:

  1. The function unpack_fp4_from_uint8 is defined locally (line 51) but is then re-imported from compressed_tensors.compressors.quantized_compressors.nvfp4_quantized (line 134), shadowing the local definition. If the imported version is the one intended for use, the local definition should be removed.
  2. There are many unused imports, such as FP4_E2M1_DATA, ao_to_dtype, dequant_mx_fp8, quant_mx_fp8, etc. Remove these to clean up the code.
  3. _USE_CT_UNPACK is defined but never used.

Comment on lines +416 to +539
def test_qdq(args):
hidden_size = args.hidden_size
intermediate_size = args.intermediate_size
cpu_tensor = torch.randn(
hidden_size, intermediate_size, dtype=torch.bfloat16
)
hpu_tensor = cpu_tensor.to("hpu")
# scale_e8m0_biased, quant_tensor = quant_mx_fp8(hpu_tensor)
# dequant_tensor = dequant_mx_fp8(quant_tensor, scale_e8m0_biased, block_size=32)
# print(f"quant_tensor shape: {quant_tensor.shape}, dtype: {quant_tensor.dtype}")
dequant_tensor = quant_dequant_mxfp4(hpu_tensor)

diff = torch.abs(hpu_tensor - dequant_tensor)
print(f"diff: max: {diff.max()}, min: {diff.min()}, mean: {diff.mean()}")
ht.hpu.synchronize()


def time_fn(func, times, warmup=50):
torch.hpu.synchronize()
for _ in range(warmup):
func()
torch.hpu.synchronize()
gc.collect()

start = time.time()
for _ in range(times):
func()
torch.hpu.synchronize()
end = time.time()
return (end - start) / times


def test_linear(args):
hidden_size = args.hidden_size
intermediate_size = args.intermediate_size
batch_size = args.bs
cpu_tensor = torch.randn(batch_size, hidden_size, dtype=torch.bfloat16)
hpu_tensor = cpu_tensor.to("hpu")
linear = torch.nn.Linear(
in_features=hidden_size, out_features=intermediate_size, bias=False
).to("hpu")
nvfp4_linear = NVFP4Linear.from_linear(linear)
nvfp4_linear = nvfp4_linear.to("hpu")
nvfp4_unpacked_linear = NVFP4LinearUnpacked.from_linear(linear)
nvfp4_unpacked_linear = nvfp4_unpacked_linear.to("hpu")

linear = wrap_in_hpu_graph(linear)
nvfp4_linear = wrap_in_hpu_graph(nvfp4_linear)
nvfp4_unpacked_linear = wrap_in_hpu_graph(nvfp4_unpacked_linear)

out_ref = linear(hpu_tensor)
out1 = nvfp4_linear(hpu_tensor)
out2 = nvfp4_unpacked_linear(hpu_tensor)
print(
f"diff ref vs nvfp4: max: {torch.abs(out_ref - out1).max()}, min: {torch.abs(out_ref - out1).min()}, mean: {torch.abs(out_ref - out1).mean()}"
)
print(
f"diff ref vs nvfp4 unpacked: max: {torch.abs(out_ref - out2).max()}, min: {torch.abs(out_ref - out2).min()}, mean: {torch.abs(out_ref - out2).mean()}"
)
print(
f"diff nvfp4 vs nvfp4 unpacked: max: {torch.abs(out1 - out2).max()}, min: {torch.abs(out1 - out2).min()}, mean: {torch.abs(out1 - out2).mean()}"
)
torch.hpu.synchronize()
latency0 = time_fn(lambda: linear(hpu_tensor), args.bench_steps)
latency1 = time_fn(lambda: nvfp4_linear(hpu_tensor), args.bench_steps)
latency2 = time_fn(
lambda: nvfp4_unpacked_linear(hpu_tensor), args.bench_steps
)
print(
f"latency linear: {latency0:.6f} s, nvfp4_linear: {latency1:.6f} s, nvfp4_unpacked_linear: {latency2:.6f} s"
)

print(
f"speed up nvfp4_linear: {latency0 / latency1:.2f}x, nvfp4_unpacked_linear: {latency0 / latency2:.2f}x"
)
print(
f"speed up nvfp4_unpacked_linear vs nvfp4_linear: {latency1 / latency2:.2f}x"
)
exit()
profile_steps = args.profile_steps
warmup_steps = args.warmup_steps
if profile_steps > 0:
activities = [
torch.profiler.ProfilerActivity.CPU,
torch.profiler.ProfilerActivity.HPU,
]
schedule = torch.profiler.schedule(
wait=0, warmup=warmup_steps, active=profile_steps, repeat=1
)
print(f"Profiling steps {profile_steps} with warmup {warmup_steps}")
with torch.profiler.profile(
activities=activities,
schedule=schedule,
on_trace_ready=torch.profiler.tensorboard_trace_handler(
f"./profile_mxfp4/"
),
record_shapes=True,
with_stack=True,
) as profiler:
for i in range(warmup_steps + profile_steps):
out_ref = linear(hpu_tensor)
out1 = nvfp4_linear(hpu_tensor)
out2 = nvfp4_unpacked_linear(hpu_tensor)
# result2 = dist.all_reduce(result2, op=dist.ReduceOp.SUM)
ht.hpu.synchronize()
profiler.step()
profiler.stop()
ht.hpu.synchronize()


# if __name__ == "__main__":
# parser = argparse.ArgumentParser()
# parser.add_argument("--bs", default=32, type=int)
# parser.add_argument("--hidden_size", "-H", default=7168, type=int)
# parser.add_argument("--intermediate_size", "-I", default=2048, type=int)
# parser.add_argument("--num_total_experts", "-E", default=256, type=int)
# parser.add_argument("--ep_size", "-P", default=8, type=int)
# parser.add_argument("--topk", "-K", default=8, type=int)
# parser.add_argument("--warmup_steps", default=5, type=int)
# parser.add_argument("--bench_steps", "-S", default=1000, type=int)
# parser.add_argument("--profile_steps", default=0, type=int)
# args = parser.parse_args()

# test_linear(args)

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

high

The code inside the if __name__ == "__main__": block appears to be for testing or experimentation and is currently commented out. It also references an undefined function quant_dequant_mxfp4 and contains an exit() call. Move this block to a dedicated test file under the tests/ directory or remove it entirely if it's just a scratchpad.

model_path = "/software/users/yiliu4/HF_HOME/weiweiz1/DeepSeek-V2-Lite-MXFP4-autoround"
model_path = "/software/users/yiliu4/HF_HOME/Yi30/DeepSeek-V2-Lite-NVFP4-llm-compressor/"
model_path = "/software/users/yiliu4/HF_HOME/Yi30/Llama-3.2-1B-Instruct-NVFP4-llm-compressor"
# model_path = "/software/users/yiliu4/HF_HOME/Yi30/Llama-3.2-1B-Instruct-NVFP4-llm-compressor"

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

medium

This commented-out line is a leftover from local testing or debugging. It's best to remove such artifacts to maintain code clarity.

Comment on lines +90 to +114
def nvfp4_unpacked_weight_gemm(
x, weight_unpacked, weight_scale, weight_global_scale
):
# return self.run_nvfp4_emulations(x, layer)
from vllm.model_executor.layers.quantization.utils.nvfp4_qdq import (
unpacked_nvfp4_to_fp8,
dequant_nvfp4,
qdq_nvfp4,
)

# bs, seq_len, hidden_size = x.shape
# x = x.reshape(bs * seq_len, hidden_size)
hp_weight = dequant_nvfp4(
data_lp=weight_unpacked,
out_scales=weight_scale,
per_tensor_scale=weight_global_scale,
original_dtype=x.dtype,
packed=False,
)

# breakpoint()
x = qdq_nvfp4(x)
out = x @ hp_weight.t()
# out = out.reshape(bs, seq_len, -1)
return out

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

medium

This new function contains a few items that should be cleaned up:

  1. The from vllm.model_executor... import is inside the function. While sometimes necessary, it's generally better to have imports at the top of the file for readability.
  2. unpacked_nvfp4_to_fp8 is imported but not used within this function and should be removed.
  3. There are several commented-out lines (e.g., # breakpoint(), reshaping logic) that should be removed.
  4. A docstring explaining the function's purpose, parameters, and return values would improve maintainability.
def nvfp4_unpacked_weight_gemm(
    x, weight_unpacked, weight_scale, weight_global_scale
):
    """Performs GEMM for nvfp4 using unpacked weights.

    This is an emulation path for HPU.
    """
    from vllm.model_executor.layers.quantization.utils.nvfp4_qdq import (
        dequant_nvfp4,
        qdq_nvfp4,
    )

    hp_weight = dequant_nvfp4(
        data_lp=weight_unpacked,
        out_scales=weight_scale,
        per_tensor_scale=weight_global_scale,
        original_dtype=x.dtype,
        packed=False,
    )

    x = qdq_nvfp4(x)
    out = x @ hp_weight.t()
    return out

Comment on lines +248 to +258
w13_weight_unpacked_lst = []
w2_weight_unpacked_lst = []
for expert_index in range(num_experts):
w13_weight_unpacked = unpacked_nvfp4_to_fp8(
w13_weight_packed[expert_index]
)
w13_weight_unpacked_lst.append(w13_weight_unpacked)
w2_weight_unpacked = unpacked_nvfp4_to_fp8(
layer.w2_weight_packed.data[expert_index]
)
w2_weight_unpacked_lst.append(w2_weight_unpacked)

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

medium

The loops for creating w13_weight_unpacked_lst and w2_weight_unpacked_lst can be made more concise using list comprehensions to improve readability.

            w13_weight_unpacked_lst = [
                unpacked_nvfp4_to_fp8(w13_weight_packed[expert_index])
                for expert_index in range(num_experts)
            ]
            w2_weight_unpacked_lst = [
                unpacked_nvfp4_to_fp8(layer.w2_weight_packed.data[expert_index])
                for expert_index in range(num_experts)
            ]

packed=False,
)

# breakpoint()

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

medium

This commented-out breakpoint() seems to be a debugging artifact and should be removed.

Comment on lines +162 to +169
# x[(x >= 0.0) & (x <= 0.25)] = 0.0
# x[(x > 0.25) & (x < 0.75)] = 0.5
# x[(x >= 0.75) & (x <= 1.25)] = 1.0
# x[(x > 1.25) & (x < 1.75)] = 1.5
# x[(x >= 1.75) & (x <= 2.5)] = 2.0
# x[(x > 2.5) & (x < 3.5)] = 3.0
# x[(x >= 3.5) & (x <= 5.0)] = 4.0
# x[x > 5.0] = 6.0

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

medium

The cast_to_fp4 function contains a block of commented-out code that should be removed to improve clarity.

@yiliu30 yiliu30 marked this pull request as ready for review July 10, 2025 06:40
@yiliu30 yiliu30 requested a review from xuechendi as a code owner July 10, 2025 06:40
Signed-off-by: Yi Liu <yiliu4@habana.ai>
@yiliu30 yiliu30 merged commit 449515a into hpu-mxfp8-moe Jul 16, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants