Add nvfp4 support #56

yiliu30 · 2025-07-10T06:33:58Z

Essential Elements of an Effective PR Description Checklist

The purpose of the PR, such as "Fix some issue (link existing issues this PR will resolve)".
The test plan, such as providing test command.
The test results, such as pasting the results comparison before and after, or e2e results

PLEASE FILL IN THE PR DESCRIPTION HERE ENSURING ALL CHECKLIST ITEMS ABOVE HAVE BEEN CONSIDERED.

Purpose

Test Plan

Test Result

BEFORE SUBMITTING, PLEASE READ https://docs.vllm.ai/en/latest/contributing (anything written below this line will be removed by GitHub Actions)

Signed-off-by: Yi Liu <yiliu4@habana.ai>

gemini-code-assist

Summary of Changes

Hello @yiliu30, I'm Gemini Code Assist¹! I'm currently reviewing this pull request and will post my feedback shortly. In the meantime, here's a summary to help you and other reviewers quickly get up to speed!

This pull request significantly enhances the vllm framework by adding full support for NVIDIA FP4 (NVFP4) quantization. It provides the necessary infrastructure for quantizing and dequantizing tensors, and seamlessly integrates this new low-precision format into the model executor's linear and Mixture-of-Experts (MoE) layers. The changes are designed to optimize model inference, especially on Habana Processing Units, by enabling efficient handling and computation with NVFP4 weights.

Highlights

New Quantization Scheme: Introduced comprehensive support for NVIDIA FP4 (NVFP4) quantization, enabling models to leverage this low-precision format for improved efficiency.
Core Quantization Utilities: Added a new utility file (nvfp4_qdq.py) containing the fundamental logic for NVFP4 quantization and dequantization, including functions for packing/unpacking, scaling, and casting to FP4.
MoE Layer Integration: Integrated NVFP4 support into the Mixture-of-Experts (MoE) layers, modifying weight processing to unpack NVFP4 weights into FP8 format for efficient execution, particularly on Habana Processing Units (HPUs).
Linear Layer Implementation: Updated the compressed_tensors_w4a4_nvfp4 scheme to directly handle NVFP4 linear layer operations, including weight unpacking and the application of quantization-aware matrix multiplications.

Using Gemini Code Assist

The full guide for Gemini Code Assist can be found on our documentation page, here are some quick tips.

Invoking Gemini

You can request assistance from Gemini at any point in your pull request via creating an issue comment (i.e. comment on the pull request page) using either /gemini <command> or @gemini-code-assist <command>. Below is a summary of the supported commands.

Feature	Command	Description
Code Review	`/gemini review`	Performs a code review for the current pull request in its current state.
Pull Request Summary	`/gemini summary`	Provides a summary of the current pull request in its current state.
Comment	@gemini-code-assist	Responds in comments when explicitly tagged, both in issue comments and review comments.
Help	`/gemini help`	Displays a list of available commands.

Customization

To customize Gemini Code Assist for GitHub experience, repository maintainers can create a configuration file and/or provide a custom code review style guide (such as PEP-8 for Python) by creating and adding files to a .gemini/ folder in the base of the repository. Detailed instructions can be found here.

Limitations & Feedback

Gemini Code Assist is currently in preview and may make mistakes. Please leave feedback on any instances where its feedback is incorrect or counter productive. You can react with 👍 and 👎 on @gemini-code-assist comments to provide feedback.

You can also get AI-powered code generation, chat, as well as code reviews directly in the IDE at no cost with the Gemini Code Assist IDE Extension.

Review the Privacy Notices, Generative AI Prohibited Use Policy, Terms of Service, and learn how to configure Gemini Code Assist in GitHub here. Gemini can make mistakes, so double check it and use code with caution. ↩

gemini-code-assist

Code Review

The pull request introduces nvfp4 support, including a new utility file and modifications to MoE and linear layers. However, there are critical issues in the CUDA path of compressed_tensors_w4a4_nvfp4.py and in the new nvfp4_qdq.py file, which contains module-level side effects and unused/broken code. Addressing these points will greatly improve the robustness and maintainability of the new functionality.

gemini-code-assist · 2025-07-10T06:36:19Z

...del_executor/layers/quantization/compressed_tensors/schemes/compressed_tensors_w4a4_nvfp4.py

    def process_weights_after_loading(self, layer) -> None:
+        from vllm.model_executor.layers.quantization.utils.nvfp4_qdq import (
+            unpacked_nvfp4_to_fp8,
+        )

+        logger.debug(
+            f"start processing weights for {getattr(layer, 'prefix', 'unknown')}"
+        )
        global_input_scale = layer.input_global_scale.max().to(torch.float32)
-        layer.input_global_scale = Parameter(global_input_scale,
-                                             requires_grad=False)
+        layer.input_global_scale = Parameter(
+            global_input_scale, requires_grad=False
+        )

        layer.weight_global_scale = Parameter(
            layer.weight_global_scale.max().to(torch.float32),
-            requires_grad=False)
-
-        swizzled_weight_scale = self.swizzle_blockscale(layer.weight_scale)
-        layer.weight_scale_swizzled = Parameter(swizzled_weight_scale,
-                                                requires_grad=False)
-
-        # required by cutlass kernel; need Parameter, not ModelWeightParameter
-        layer.weight = Parameter(layer.weight_packed.data, requires_grad=False)
-
-        if self.cutlass_nvfp4_supported:
-            layer.alpha = Parameter(layer.input_global_scale *
-                                    layer.weight_global_scale,
-                                    requires_grad=False)
+            requires_grad=False,
+        )
+
+        # weight_global_scale_data = layer.weight_global_scale.data
+        weight_scale_data = layer.weight_scale.data
+        weight_packed_data = layer.weight_packed.data
+        weight_unpacked = unpacked_nvfp4_to_fp8(weight_packed_data)
+        layer.weight_scale = torch.nn.Parameter(
+            weight_scale_data, requires_grad=False
+        )
+
+        layer.weight_unpacked = torch.nn.Parameter(
+            weight_unpacked, requires_grad=False
+        )
+        del layer.weight_packed
+        torch.hpu.synchronize()
+
+        # swizzled_weight_scale = self.swizzle_blockscale(layer.weight_scale)
+        # layer.weight_scale_swizzled = Parameter(swizzled_weight_scale,
+        #                                         requires_grad=False)
+
+        # # required by cutlass kernel; need Parameter, not ModelWeightParameter
+        # layer.weight = Parameter(layer.weight_packed.data, requires_grad=False)
+
+        # if self.cutlass_nvfp4_supported:
+        #     layer.alpha = Parameter(layer.input_global_scale *
+        #                             layer.weight_global_scale,
+        #                             requires_grad=False)


The original CUDA path in process_weights_after_loading has been commented out and replaced with HPU-specific logic, breaking functionality for CUDA devices using the nvfp4 quantization scheme. Preserve the CUDA path and add the new HPU logic within a conditional block, checking for self.cutlass_nvfp4_supported to execute the appropriate logic for each platform.

def process_weights_after_loading(self, layer) -> None: global_input_scale = layer.input_global_scale.max().to(torch.float32) layer.input_global_scale = Parameter( global_input_scale, requires_grad=False ) layer.weight_global_scale = Parameter( layer.weight_global_scale.max().to(torch.float32), requires_grad=False, ) if self.cutlass_nvfp4_supported: swizzled_weight_scale = self.swizzle_blockscale(layer.weight_scale) layer.weight_scale_swizzled = Parameter(swizzled_weight_scale, requires_grad=False) # required by cutlass kernel; need Parameter, not ModelWeightParameter layer.weight = Parameter(layer.weight_packed.data, requires_grad=False) layer.alpha = Parameter(layer.input_global_scale * layer.weight_global_scale, requires_grad=False) else: from vllm.model_executor.layers.quantization.utils.nvfp4_qdq import ( unpacked_nvfp4_to_fp8, ) logger.debug( f"start processing weights for {getattr(layer, 'prefix', 'unknown')}" ) weight_scale_data = layer.weight_scale.data weight_packed_data = layer.weight_packed.data weight_unpacked = unpacked_nvfp4_to_fp8(weight_packed_data) layer.weight_scale = torch.nn.Parameter( weight_scale_data, requires_grad=False ) layer.weight_unpacked = torch.nn.Parameter( weight_unpacked, requires_grad=False ) del layer.weight_packed torch.hpu.synchronize()

gemini-code-assist · 2025-07-10T06:36:19Z

vllm/model_executor/layers/quantization/utils/nvfp4_qdq.py

+os.environ["PT_HPU_LAZY_MODE"] = "1"
+os.environ["PT_HPU_ENABLE_LAZY_COLLECTIVES"] = "1"


Setting environment variables at the module level is a side effect that can have unintended consequences across the entire application. These should be set at the application's entry point instead.

gemini-code-assist · 2025-07-10T06:36:19Z

vllm/model_executor/layers/quantization/utils/nvfp4_qdq.py

+import os
+
+os.environ["PT_HPU_LAZY_MODE"] = "1"


This file has a duplicate os.environ["PT_HPU_LAZY_MODE"] = "1" setting and a duplicate import os. Clean these up to avoid redundancy.

gemini-code-assist · 2025-07-10T06:36:19Z

vllm/model_executor/layers/quantization/utils/nvfp4_qdq.py

+def unpack_fp4_from_uint8(
+    a: torch.Tensor,
+    m: int,
+    n: int,
+    dtype: Optional[torch.dtype] = torch.bfloat16,
+) -> torch.Tensor:
+    """
+    Unpacks uint8 values into fp4. Each uint8 consists of two fp4 values
+    (i.e. first four bits correspond to one fp4 value, last four corresond to a consecutive
+    fp4 value). The bits represent an index, which are mapped to an fp4 value.
+
+    :param a: tensor to unpack
+    :param m: original dim 0 size of the unpacked tensor
+    :param n: original dim 1 size of the unpacked tensor
+    :param dtype: dense dtype to cast the unpacked tensor to
+    """
+    assert a.dtype == torch.uint8, f"Ex got{a.dtype}"
+
+    # Vectorized nibble processing
+    a_flat = a.flatten()
+    high = (a_flat & 0xF0) >> 4  # Upper nibbles
+    low = a_flat & 0x0F  # Lower nibbles
+
+    # Combine nibbles for batch processing
+    combined = torch.stack((low, high), dim=1).flatten()
+
+    # Vectorized sign and magnitude extraction
+    signs = (combined & 0x08).to(torch.bool)  # Sign bits
+    abs_vals = (combined & 0x07).to(torch.long)  # Magnitude indices
+
+    # Device-aware lookup and sign application
+    kE2M1 = kE2M1ToFloat.to(device=a.device)
+    values = kE2M1[abs_vals] * torch.where(signs, -1.0, 1.0)
+    # breakpoint()
+    # Reshape to final form
+    return values.reshape(m, n).to(dtype=dtype)
+
+
+# >>>>>>>>>>>>>>>>>>
+
+SBITS, EBITS_F32, MBITS_F32 = 1, 8, 23
+EBITS_BF16, MBITS_BF16 = 8, 7
+EBITS_F4_E2M1, MBITS_F4_E2M1 = 2, 1
+EBITS_F6_E2M3, MBITS_F6_E2M3 = 2, 3
+EBITS_F6_E3M2, MBITS_F6_E3M2 = 3, 2
+EBITS_F8_E4M3, MBITS_F8_E4M3 = 4, 3
+EBITS_F8_E5M2, MBITS_F8_E5M2 = 5, 2
+from torchao.prototype.mx_formats.mx_tensor import to_dtype as ao_to_dtype
+
+import os
+
+_USE_CT_UNPACK = os.getenv("USE_CT_UNPACK", "0").lower() in ("1", "true", "yes")
+
+from enum import Enum, auto
+from typing import Callable, Dict, Union
+
+import torch
+import os
+from torchao.prototype.mx_formats.constants import (
+    F4_E2M1_MAX,
+    F8E4M3_MAX,
+)
+
+
+def per_tensor_amax_to_scale(amax: torch.Tensor) -> torch.Tensor:
+    """Convert per-tensor amax to per-tensor scale.
+    Used to scale fp32 scales down to fp8 scales
+
+    Args:
+        amax: Per-tensor amax tensor
+
+    Returns:
+        torch.Tensor: Per-tensor scale tensor
+    """
+    return torch.clamp(amax / F8E4M3_MAX, min=E4M3_EPS, max=F8E4M3_MAX).to(
+        torch.float32
+    )
+
+
+E4M3_EPS = torch.finfo(torch.float8_e4m3fn).tiny
+from compressed_tensors.quantization.quant_args import FP4_E2M1_DATA
+from compressed_tensors.compressors.quantized_compressors.nvfp4_quantized import (
+    pack_fp4_to_uint8,
+    unpack_fp4_from_uint8,
+)


This module contains several issues related to imports and unused code:

The function unpack_fp4_from_uint8 is defined locally (line 51) but is then re-imported from compressed_tensors.compressors.quantized_compressors.nvfp4_quantized (line 134), shadowing the local definition. If the imported version is the one intended for use, the local definition should be removed.

There are many unused imports, such as FP4_E2M1_DATA, ao_to_dtype, dequant_mx_fp8, quant_mx_fp8, etc. Remove these to clean up the code.

_USE_CT_UNPACK is defined but never used.

gemini-code-assist · 2025-07-10T06:36:19Z

vllm/model_executor/layers/quantization/utils/nvfp4_qdq.py

+def test_qdq(args):
+    hidden_size = args.hidden_size
+    intermediate_size = args.intermediate_size
+    cpu_tensor = torch.randn(
+        hidden_size, intermediate_size, dtype=torch.bfloat16
+    )
+    hpu_tensor = cpu_tensor.to("hpu")
+    # scale_e8m0_biased, quant_tensor = quant_mx_fp8(hpu_tensor)
+    # dequant_tensor = dequant_mx_fp8(quant_tensor, scale_e8m0_biased, block_size=32)
+    # print(f"quant_tensor shape: {quant_tensor.shape}, dtype: {quant_tensor.dtype}")
+    dequant_tensor = quant_dequant_mxfp4(hpu_tensor)
+
+    diff = torch.abs(hpu_tensor - dequant_tensor)
+    print(f"diff: max: {diff.max()}, min: {diff.min()}, mean: {diff.mean()}")
+    ht.hpu.synchronize()
+
+
+def time_fn(func, times, warmup=50):
+    torch.hpu.synchronize()
+    for _ in range(warmup):
+        func()
+    torch.hpu.synchronize()
+    gc.collect()
+
+    start = time.time()
+    for _ in range(times):
+        func()
+    torch.hpu.synchronize()
+    end = time.time()
+    return (end - start) / times
+
+
+def test_linear(args):
+    hidden_size = args.hidden_size
+    intermediate_size = args.intermediate_size
+    batch_size = args.bs
+    cpu_tensor = torch.randn(batch_size, hidden_size, dtype=torch.bfloat16)
+    hpu_tensor = cpu_tensor.to("hpu")
+    linear = torch.nn.Linear(
+        in_features=hidden_size, out_features=intermediate_size, bias=False
+    ).to("hpu")
+    nvfp4_linear = NVFP4Linear.from_linear(linear)
+    nvfp4_linear = nvfp4_linear.to("hpu")
+    nvfp4_unpacked_linear = NVFP4LinearUnpacked.from_linear(linear)
+    nvfp4_unpacked_linear = nvfp4_unpacked_linear.to("hpu")
+
+    linear = wrap_in_hpu_graph(linear)
+    nvfp4_linear = wrap_in_hpu_graph(nvfp4_linear)
+    nvfp4_unpacked_linear = wrap_in_hpu_graph(nvfp4_unpacked_linear)
+
+    out_ref = linear(hpu_tensor)
+    out1 = nvfp4_linear(hpu_tensor)
+    out2 = nvfp4_unpacked_linear(hpu_tensor)
+    print(
+        f"diff ref vs nvfp4: max: {torch.abs(out_ref - out1).max()}, min: {torch.abs(out_ref - out1).min()}, mean: {torch.abs(out_ref - out1).mean()}"
+    )
+    print(
+        f"diff ref vs nvfp4 unpacked: max: {torch.abs(out_ref - out2).max()}, min: {torch.abs(out_ref - out2).min()}, mean: {torch.abs(out_ref - out2).mean()}"
+    )
+    print(
+        f"diff nvfp4 vs nvfp4 unpacked: max: {torch.abs(out1 - out2).max()}, min: {torch.abs(out1 - out2).min()}, mean: {torch.abs(out1 - out2).mean()}"
+    )
+    torch.hpu.synchronize()
+    latency0 = time_fn(lambda: linear(hpu_tensor), args.bench_steps)
+    latency1 = time_fn(lambda: nvfp4_linear(hpu_tensor), args.bench_steps)
+    latency2 = time_fn(
+        lambda: nvfp4_unpacked_linear(hpu_tensor), args.bench_steps
+    )
+    print(
+        f"latency linear: {latency0:.6f} s, nvfp4_linear: {latency1:.6f} s, nvfp4_unpacked_linear: {latency2:.6f} s"
+    )
+
+    print(
+        f"speed up nvfp4_linear: {latency0 / latency1:.2f}x, nvfp4_unpacked_linear: {latency0 / latency2:.2f}x"
+    )
+    print(
+        f"speed up nvfp4_unpacked_linear vs nvfp4_linear: {latency1 / latency2:.2f}x"
+    )
+    exit()
+    profile_steps = args.profile_steps
+    warmup_steps = args.warmup_steps
+    if profile_steps > 0:
+        activities = [
+            torch.profiler.ProfilerActivity.CPU,
+            torch.profiler.ProfilerActivity.HPU,
+        ]
+        schedule = torch.profiler.schedule(
+            wait=0, warmup=warmup_steps, active=profile_steps, repeat=1
+        )
+        print(f"Profiling steps {profile_steps} with warmup {warmup_steps}")
+        with torch.profiler.profile(
+            activities=activities,
+            schedule=schedule,
+            on_trace_ready=torch.profiler.tensorboard_trace_handler(
+                f"./profile_mxfp4/"
+            ),
+            record_shapes=True,
+            with_stack=True,
+        ) as profiler:
+            for i in range(warmup_steps + profile_steps):
+                out_ref = linear(hpu_tensor)
+                out1 = nvfp4_linear(hpu_tensor)
+                out2 = nvfp4_unpacked_linear(hpu_tensor)
+                # result2 = dist.all_reduce(result2, op=dist.ReduceOp.SUM)
+                ht.hpu.synchronize()
+                profiler.step()
+        profiler.stop()
+        ht.hpu.synchronize()
+
+
+# if __name__ == "__main__":
+#     parser = argparse.ArgumentParser()
+#     parser.add_argument("--bs", default=32, type=int)
+#     parser.add_argument("--hidden_size", "-H", default=7168, type=int)
+#     parser.add_argument("--intermediate_size", "-I", default=2048, type=int)
+#     parser.add_argument("--num_total_experts", "-E", default=256, type=int)
+#     parser.add_argument("--ep_size", "-P", default=8, type=int)
+#     parser.add_argument("--topk", "-K", default=8, type=int)
+#     parser.add_argument("--warmup_steps", default=5, type=int)
+#     parser.add_argument("--bench_steps", "-S", default=1000, type=int)
+#     parser.add_argument("--profile_steps", default=0, type=int)
+#     args = parser.parse_args()
+
+#     test_linear(args)


The code inside the if __name__ == "__main__": block appears to be for testing or experimentation and is currently commented out. It also references an undefined function quant_dequant_mxfp4 and contains an exit() call. Move this block to a dedicated test file under the tests/ directory or remove it entirely if it's just a scratchpad.

gemini-code-assist · 2025-07-10T06:36:19Z

examples/offline_inference/basic/basic_hpu.py

 model_path = "/software/users/yiliu4/HF_HOME/weiweiz1/DeepSeek-V2-Lite-MXFP4-autoround"
 model_path = "/software/users/yiliu4/HF_HOME/Yi30/DeepSeek-V2-Lite-NVFP4-llm-compressor/"
-model_path  = "/software/users/yiliu4/HF_HOME/Yi30/Llama-3.2-1B-Instruct-NVFP4-llm-compressor"
+# model_path  = "/software/users/yiliu4/HF_HOME/Yi30/Llama-3.2-1B-Instruct-NVFP4-llm-compressor"


This commented-out line is a leftover from local testing or debugging. It's best to remove such artifacts to maintain code clarity.

gemini-code-assist · 2025-07-10T06:36:20Z

vllm/model_executor/layers/quantization/compressed_tensors/compressed_tensors_moe.py

+def nvfp4_unpacked_weight_gemm(
+    x, weight_unpacked, weight_scale, weight_global_scale
+):
+    # return self.run_nvfp4_emulations(x, layer)
+    from vllm.model_executor.layers.quantization.utils.nvfp4_qdq import (
+        unpacked_nvfp4_to_fp8,
+        dequant_nvfp4,
+        qdq_nvfp4,
+    )
+
+    # bs, seq_len, hidden_size = x.shape
+    # x = x.reshape(bs * seq_len, hidden_size)
+    hp_weight = dequant_nvfp4(
+        data_lp=weight_unpacked,
+        out_scales=weight_scale,
+        per_tensor_scale=weight_global_scale,
+        original_dtype=x.dtype,
+        packed=False,
+    )
+
+    # breakpoint()
+    x = qdq_nvfp4(x)
+    out = x @ hp_weight.t()
+    # out = out.reshape(bs, seq_len, -1)
+    return out


This new function contains a few items that should be cleaned up:

The from vllm.model_executor... import is inside the function. While sometimes necessary, it's generally better to have imports at the top of the file for readability.

unpacked_nvfp4_to_fp8 is imported but not used within this function and should be removed.

There are several commented-out lines (e.g., # breakpoint(), reshaping logic) that should be removed.

A docstring explaining the function's purpose, parameters, and return values would improve maintainability.

def nvfp4_unpacked_weight_gemm( x, weight_unpacked, weight_scale, weight_global_scale ): """Performs GEMM for nvfp4 using unpacked weights. This is an emulation path for HPU. """ from vllm.model_executor.layers.quantization.utils.nvfp4_qdq import ( dequant_nvfp4, qdq_nvfp4, ) hp_weight = dequant_nvfp4( data_lp=weight_unpacked, out_scales=weight_scale, per_tensor_scale=weight_global_scale, original_dtype=x.dtype, packed=False, ) x = qdq_nvfp4(x) out = x @ hp_weight.t() return out

gemini-code-assist · 2025-07-10T06:36:20Z

vllm/model_executor/layers/quantization/compressed_tensors/compressed_tensors_moe.py

+            w13_weight_unpacked_lst = []
+            w2_weight_unpacked_lst = []
+            for expert_index in range(num_experts):
+                w13_weight_unpacked = unpacked_nvfp4_to_fp8(
+                    w13_weight_packed[expert_index]
+                )
+                w13_weight_unpacked_lst.append(w13_weight_unpacked)
+                w2_weight_unpacked = unpacked_nvfp4_to_fp8(
+                    layer.w2_weight_packed.data[expert_index]
+                )
+                w2_weight_unpacked_lst.append(w2_weight_unpacked)


The loops for creating w13_weight_unpacked_lst and w2_weight_unpacked_lst can be made more concise using list comprehensions to improve readability.

w13_weight_unpacked_lst = [ unpacked_nvfp4_to_fp8(w13_weight_packed[expert_index]) for expert_index in range(num_experts) ] w2_weight_unpacked_lst = [ unpacked_nvfp4_to_fp8(layer.w2_weight_packed.data[expert_index]) for expert_index in range(num_experts) ]

gemini-code-assist · 2025-07-10T06:36:20Z

...del_executor/layers/quantization/compressed_tensors/schemes/compressed_tensors_w4a4_nvfp4.py

+            packed=False,
+        )
+
+        # breakpoint()


This commented-out breakpoint() seems to be a debugging artifact and should be removed.

gemini-code-assist · 2025-07-10T06:36:20Z

vllm/model_executor/layers/quantization/utils/nvfp4_qdq.py

+    # x[(x >= 0.0) & (x <= 0.25)] = 0.0
+    # x[(x > 0.25) & (x < 0.75)] = 0.5
+    # x[(x >= 0.75) & (x <= 1.25)] = 1.0
+    # x[(x > 1.25) & (x < 1.75)] = 1.5
+    # x[(x >= 1.75) & (x <= 2.5)] = 2.0
+    # x[(x > 2.5) & (x < 3.5)] = 3.0
+    # x[(x >= 3.5) & (x <= 5.0)] = 4.0
+    # x[x > 5.0] = 6.0


The cast_to_fp4 function contains a block of commented-out code that should be removed to improve clarity.

Signed-off-by: Yi Liu <yiliu4@habana.ai>

add nvfp4 support

f806aa1

Signed-off-by: Yi Liu <yiliu4@habana.ai>

gemini-code-assist bot reviewed Jul 10, 2025

View reviewed changes

yiliu30 marked this pull request as ready for review July 10, 2025 06:40

yiliu30 requested a review from xuechendi as a code owner July 10, 2025 06:40

update example for nvfp4

f76cbf8

Signed-off-by: Yi Liu <yiliu4@habana.ai>

yiliu30 merged commit 449515a into hpu-mxfp8-moe Jul 16, 2025

		os.environ["PT_HPU_LAZY_MODE"] = "1"
		os.environ["PT_HPU_ENABLE_LAZY_COLLECTIVES"] = "1"

Uh oh!

Add nvfp4 support #56

Add nvfp4 support #56

Uh oh!

Conversation

yiliu30 commented Jul 10, 2025

Essential Elements of an Effective PR Description Checklist

Purpose

Test Plan

Test Result

Uh oh!

gemini-code-assist bot left a comment

Choose a reason for hiding this comment

Summary of Changes

Highlights

Footnotes

Uh oh!

gemini-code-assist bot left a comment

Choose a reason for hiding this comment

Code Review

Uh oh!

gemini-code-assist bot Jul 10, 2025

Choose a reason for hiding this comment

Uh oh!

gemini-code-assist bot Jul 10, 2025

Choose a reason for hiding this comment

Uh oh!

gemini-code-assist bot Jul 10, 2025

Choose a reason for hiding this comment

Uh oh!

gemini-code-assist bot Jul 10, 2025

Choose a reason for hiding this comment

Uh oh!

gemini-code-assist bot Jul 10, 2025

Choose a reason for hiding this comment

Uh oh!

gemini-code-assist bot Jul 10, 2025

Choose a reason for hiding this comment

Uh oh!

gemini-code-assist bot Jul 10, 2025

Choose a reason for hiding this comment

Uh oh!

gemini-code-assist bot Jul 10, 2025

Choose a reason for hiding this comment

Uh oh!

gemini-code-assist bot Jul 10, 2025

Choose a reason for hiding this comment

Uh oh!

gemini-code-assist bot Jul 10, 2025

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants