Hpu mxfp8 #48

yiliu30 · 2025-06-30T03:20:18Z

Essential Elements of an Effective PR Description Checklist

The purpose of the PR, such as "Fix some issue (link existing issues this PR will resolve)".
The test plan, such as providing test command.
The test results, such as pasting the results comparison before and after, or e2e results

PLEASE FILL IN THE PR DESCRIPTION HERE ENSURING ALL CHECKLIST ITEMS ABOVE HAVE BEEN CONSIDERED.

Purpose

Test Plan

Test Result

BEFORE SUBMITTING, PLEASE READ https://docs.vllm.ai/en/latest/contributing (anything written below this line will be removed by GitHub Actions)

Signed-off-by: yiliu30 <yi4.liu@intel.com>

Signed-off-by: Yi Liu <yiliu4@habana.ai>

Signed-off-by: yiliu30 <yi4.liu@intel.com>

Signed-off-by: Yi Liu <yiliu4@habana.ai>

Signed-off-by: yiliu30 <yi4.liu@intel.com> Signed-off-by: Yi Liu <yiliu4@habana.ai>

Signed-off-by: Yi Liu <yiliu4@habana.ai>

gemini-code-assist

Summary of Changes

Hello @yiliu30, I'm Gemini Code Assist¹! I'm currently reviewing this pull request and will post my feedback shortly. In the meantime, here's a summary to help you and other reviewers quickly get up to speed!

This pull request introduces comprehensive support for NVFP4 and MXFP8 quantization within vLLM, focusing on both general linear layers and Mixture of Experts (MoE) layers. It includes new quantization schemes, their integration with HPU-specific optimizations, and provides emulation capabilities for NVFP4 to ensure broader compatibility. The changes aim to improve performance and memory efficiency for models utilizing these advanced quantization formats.

Highlights

New Quantization Schemes: Introduced support for W4A4 NVFP4 (NVIDIA FP4) and W8A8 MXFP8 (Mixed Precision FP8) quantization schemes, significantly expanding vLLM's capability to handle various low-precision models.
HPU Integration and Optimization: Enhanced vLLM's compatibility with Habana Processing Units (HPUs) by integrating the new quantization schemes, adding HPU-specific environment variables (VLLM_USE_STATIC_MOE_HPU, VLLM_HPU_FORCE_CHANNEL_FP8), and providing an example HPU inference script.
NVFP4 Emulation Support: Implemented reference NVFP4 quantization and dequantization utilities, enabling these schemes to run via software emulation even on hardware platforms that do not natively support Cutlass NVFP4 operations.
Mixture of Experts (MoE) Quantization: Extended MoE layers to support the new CompressedTensorsW4A4MoeMethod, incorporating both Marlin kernel integration and a static HPU emulation path for MoE operations.

Using Gemini Code Assist

The full guide for Gemini Code Assist can be found on our documentation page, here are some quick tips.

Invoking Gemini

You can request assistance from Gemini at any point in your pull request via creating an issue comment (i.e. comment on the pull request page) using either /gemini <command> or @gemini-code-assist <command>. Below is a summary of the supported commands.

Feature	Command	Description
Code Review	`/gemini review`	Performs a code review for the current pull request in its current state.
Pull Request Summary	`/gemini summary`	Provides a summary of the current pull request in its current state.
Comment	@gemini-code-assist	Responds in comments when explicitly tagged, both in issue comments and review comments.
Help	`/gemini help`	Displays a list of available commands.

Customization

To customize Gemini Code Assist for GitHub experience, repository maintainers can create a configuration file and/or provide a custom code review style guide (such as PEP-8 for Python) by creating and adding files to a .gemini/ folder in the base of the repository. Detailed instructions can be found here.

Limitations & Feedback

Gemini Code Assist is currently in preview and may make mistakes. Please leave feedback on any instances where its feedback is incorrect or counter productive. You can react with 👍 and 👎 on @gemini-code-assist comments to provide feedback.

You can also get AI-powered code generation, chat, as well as code reviews directly in the IDE at no cost with the Gemini Code Assist IDE Extension.

Review the Privacy Notices, Generative AI Prohibited Use Policy, Terms of Service, and learn how to configure Gemini Code Assist in GitHub here. Gemini can make mistakes, so double check it and use code with caution. ↩

gemini-code-assist

Code Review

This pull request introduces support for new quantization formats, including MXFP8 and NVFP4, with a focus on HPU devices. The changes are extensive, adding new quantization schemes, MoE methods, and emulation utilities. A new example script for HPU inference is also included.

My review has identified a critical bug in an __init__.py file due to a missing comma, which could lead to import errors. I've also found some dead code, leftover debugging artifacts, and areas for cleanup, particularly in the new example script. Addressing these points will improve the quality and maintainability of the code.

gemini-code-assist · 2025-06-30T03:22:32Z

vllm/model_executor/layers/quantization/compressed_tensors/schemes/__init__.py

+    "CompressedTensorsW4A4Fp4"
+    "CompressedTensorsW8A8MXFp8"


There is a missing comma between the two string literals in the __all__ list. This will cause Python to concatenate them into a single string: 'CompressedTensorsW4A4Fp4CompressedTensorsW8A8MXFp8'. This is a bug and will prevent CompressedTensorsW8A8MXFp8 from being correctly exported, potentially causing ImportError downstream.

Suggested change

"CompressedTensorsW4A4Fp4"

"CompressedTensorsW8A8MXFp8"

"CompressedTensorsW4A4Fp4",

"CompressedTensorsW8A8MXFp8"

gemini-code-assist · 2025-06-30T03:22:32Z

examples/offline_inference/basic/basic_hpu.py

+model_path = "/models/Qwen3-32B"
+model_path = "/models/DeepSeek-R1-Distill-Qwen-7B"
+model_path= "/mnt/disk3/yiliu4/RedHatAI/Llama-3.1-8B-tldr-FP8-dynamic"
+model_path  = "/software/users/yiliu4/HF_HOME/RedHatAI/Llama-3.1-8B-tldr-FP8-dynamic"
+model_path  = "/software/users/yiliu4/HF_HOME/Yi30/Llama-3.2-1B-Instruct-NVFP4-llm-compressor"
+model_path  = "/software/users/yiliu4/HF_HOME/Yi30/Llama-3.2-1B-Instruct-NVFP4-llm-compressor"
+model_path = "/mnt/disk3/yiliu4/Yi30/DeepSeek-V2-Lite-NVFP4-llm-compressor"
+model_path = "/software/users/yiliu4/HF_HOME/Yi30/DeepSeek-V2-Lite-NVFP4-llm-compressor/"
+model_path = "/software/users/yiliu4/HF_HOME/Yi30/Llama-3.3-70B-Instruct-NVFP4-llmc"
+model_path = "/software/users/yiliu4/HF_HOME/Yi30/Yi30/Llama-3.2-1B-Instruct-MXFP8-llmc"
+model_path = "/software/users/yiliu4/HF_HOME/Yi30/Yi30/Llama-3.3-70B-Instruct-MXFP8-llmc"


This example script contains multiple assignments to model_path with hardcoded, user-specific paths. This is confusing and reduces the script's reusability. It's better to have a single, well-defined default model path (e.g., a public model from Hugging Face Hub) or a clear placeholder, and let the user override it with the --model_path argument.

gemini-code-assist · 2025-06-30T03:22:33Z

examples/offline_inference/basic/basic_hpu.py

+# INFO 06-26 17:17:55 [llm_engine.py:439] init engine (profile, create kv cache, warmup model) took 53.52 seconds
+# Adding requests: 100%|████████████████████████████████████████████████████████████████████████████████████████████████████| 4/4 [00:00<00:00, 197.22it/s]
+# Processed prompts:   0%|                                                       | 0/4 [00:00<?, ?it/s, est. speed input: 0.00 toks/s, output: 0.00 toks/s]WARNING 06-26 17:17:55 [hpu_model_runner.py:1230] Configuration: ('prompt', 4, 128, 0) was not warmed-up!
+# (VllmWorkerProcess pid=54973) WARNING 06-26 17:17:55 [hpu_model_runner.py:1230] Configuration: ('prompt', 4, 128, 0) was not warmed-up!
+# (VllmWorkerProcess pid=54971) WARNING 06-26 17:17:55 [hpu_model_runner.py:1230] Configuration: ('prompt', 4, 128, 0) was not warmed-up!
+# (VllmWorkerProcess pid=54975) WARNING 06-26 17:17:55 [hpu_model_runner.py:1230] Configuration: ('prompt', 4, 128, 0) was not warmed-up!
+# (VllmWorkerProcess pid=54969) WARNING 06-26 17:17:55 [hpu_model_runner.py:1230] Configuration: ('prompt', 4, 128, 0) was not warmed-up!
+# (VllmWorkerProcess pid=54972) WARNING 06-26 17:17:55 [hpu_model_runner.py:1230] Configuration: ('prompt', 4, 128, 0) was not warmed-up!
+# (VllmWorkerProcess pid=54970) WARNING 06-26 17:17:55 [hpu_model_runner.py:1230] Configuration: ('prompt', 4, 128, 0) was not warmed-up!
+# (VllmWorkerProcess pid=54974) WARNING 06-26 17:17:55 [hpu_model_runner.py:1230] Configuration: ('prompt', 4, 128, 0) was not warmed-up!
+# (VllmWorkerProcess pid=54974) WARNING 06-26 17:18:01 [hpu_model_runner.py:1230] Configuration: ('decode', 4, 1, 128) was not warmed-up!
+# (VllmWorkerProcess pid=54969) WARNING 06-26 17:18:01 [hpu_model_runner.py:1230] Configuration: ('decode', 4, 1, 128) was not warmed-up!
+# (VllmWorkerProcess pid=54971) WARNING 06-26 17:18:01 [hpu_model_runner.py:1230] Configuration: ('decode', 4, 1, 128) was not warmed-up!
+# (VllmWorkerProcess pid=54973) WARNING 06-26 17:18:01 [hpu_model_runner.py:1230] Configuration: ('decode', 4, 1, 128) was not warmed-up!
+# (VllmWorkerProcess pid=54970) WARNING 06-26 17:18:01 [hpu_model_runner.py:1230] Configuration: ('decode', 4, 1, 128) was not warmed-up!
+# (VllmWorkerProcess pid=54975) WARNING 06-26 17:18:01 [hpu_model_runner.py:1230] Configuration: ('decode', 4, 1, 128) was not warmed-up!
+# (VllmWorkerProcess pid=54972) WARNING 06-26 17:18:01 [hpu_model_runner.py:1230] Configuration: ('decode', 4, 1, 128) was not warmed-up!
+# WARNING 06-26 17:18:01 [hpu_model_runner.py:1230] Configuration: ('decode', 4, 1, 128) was not warmed-up!
+# Processed prompts: 100%|██████████████████████████████████████████████| 4/4 [09:06<00:00, 136.71s/it, est. speed input: 0.05 toks/s, output: 0.12 toks/s]
+
+# Generated Outputs:
+# ------------------------------------------------------------
+# Prompt:    'Hello, my name is'
+# Output:    ' Tony, I am a Software Engineer at Google. I have been working on the'
+# ------------------------------------------------------------
+# Prompt:    'The president of the United States is'
+# Output:    ' the head of state and head of government of the United States. The president is'
+# ------------------------------------------------------------
+# Prompt:    'The capital of France is'
+# Output:    ' always a good idea, but there are many other destinations that are worth visiting as'
+# ------------------------------------------------------------
+# Prompt:    'The future of AI is'
+# Output:    " here, and it's more powerful than ever. But with great power comes great"
+# ------------------------------------------------------------
+# Adding requests: 100%|███████████████████████████████████████████████████████████████████████████████████████████████████| 4/4 [00:00<00:00, 1055.37it/s]
+# Processed prompts: 100%|██████████████████████████████████████████████| 4/4 [09:35<00:00, 143.78s/it, est. speed input: 0.05 toks/s, output: 0.11 toks/s]
+# Time taken for second inference: 575.12 seconds
+# INFO 06-26 17:36:37 [multiproc_worker_utils.py:139] Terminating local vLLM worker processes
+# (VllmWorkerProcess pid=54969) INFO 06-26 17:36:37 [multiproc_worker_utils.py:261] Worker exiting
+# (VllmWorkerProcess pid=54973) INFO 06-26 17:36:37 [multiproc_worker_utils.py:261] Worker exiting
+# (VllmWorkerProcess pid=54971) INFO 06-26 17:36:37 [multiproc_worker_utils.py:261] Worker exiting
+# (VllmWorkerProcess pid=54970) INFO 06-26 17:36:37 [multiproc_worker_utils.py:261] Worker exiting
+# (VllmWorkerProcess pid=54972) INFO 06-26 17:36:37 [multiproc_worker_utils.py:261] Worker exiting
+# (VllmWorkerProcess pid=54975) INFO 06-26 17:36:37 [multiproc_worker_utils.py:261] Worker exiting
+# (VllmWorkerProcess pid=54974) INFO 06-26 17:36:37 [multiproc_worker_utils.py:261] Worker exiting
+# <p22> yiliu4@yiliu4-63gd-g3-l-vm:basic$ /usr/lib/python3.10/multiprocessing/resource_tracker.py:224: UserWarning: resource_tracker: There appear to be 1 leaked shared_memory objects to clean up at shutdown
+#   warnings.warn('resource_tracker: There appear to be %d '
+
+# <p22> yiliu4@yiliu4-63gd-g3-l-vm:basic$ p basic_hpu.py --tp 8


This file contains a large block of commented-out logs and shell output. This content appears to be leftover from a debugging session and should be removed. It adds noise to the example script, making it difficult to read and understand.

gemini-code-assist · 2025-06-30T03:22:33Z

...del_executor/layers/quantization/compressed_tensors/schemes/compressed_tensors_w8a8_mxfp8.py

+        return self.fp8_linear.apply(input=x,
+                                     weight=layer.weight,
+                                     weight_scale=layer.weight_scale,
+                                     out_dtype=self.out_dtype,
+                                     input_scale=layer.input_scale,
+                                     bias=bias)


This block of code is unreachable due to the return statement on line 173. It appears to be dead code from a previous implementation and should be removed to improve code clarity and maintainability. The self.fp8_linear attribute it references is also commented out in the __init__ method.

gemini-code-assist · 2025-06-30T03:22:33Z

vllm/model_executor/layers/quantization/compressed_tensors/compressed_tensors.py

+        is_per_tensor_group_weight = (weight_quant.strategy in [
+
+            QuantizationStrategy.TENSOR_GROUP
+        ])


For checking against a single item, using the equality operator == is more direct and readable than using in with a single-element list.

is_per_tensor_group_weight = ( weight_quant.strategy == QuantizationStrategy.TENSOR_GROUP )

gemini-code-assist · 2025-06-30T03:22:33Z

vllm/model_executor/layers/quantization/compressed_tensors/compressed_tensors.py

            self, weight_quant: BaseModel,
            input_quant: BaseModel) -> "CompressedTensorsScheme":
-
+        # breakpoint()


A commented-out breakpoint() call is present here. This appears to be a leftover from debugging and should be removed.

gemini-code-assist · 2025-06-30T03:22:33Z

vllm/model_executor/layers/quantization/compressed_tensors/compressed_tensors_moe.py

            raise RuntimeError(
                f"Unsupported FusedMoe scheme: {weight_quant}, {input_quant}")

+import vllm.envs as envs


The import of vllm.envs is done locally within the class. According to PEP 8, imports should be at the top of the file. This improves readability and helps avoid potential circular import issues.

gemini-code-assist · 2025-06-30T03:22:33Z

vllm/model_executor/layers/quantization/compressed_tensors/compressed_tensors_moe.py

+            # w_state_dict = [
+            #     "w13_weight_packed:torch.Size([64, 2816, 1024])",
+            #     "w13_weight_scale:torch.Size([64, 2816, 128])",
+            #     "w13_weight_global_scale:torch.Size([64, 2])",
+            #     "w13_input_global_scale:torch.Size([64, 2])",
+            #     "w2_weight_packed:torch.Size([64, 2048, 704])",
+            #     "w2_weight_scale:torch.Size([64, 2048, 88])",
+            #     "w2_weight_global_scale:torch.Size([64])",
+            #     "w2_input_global_scale:torch.Size([64])",
+            # ]
+            # w_list = [
+            #     "w13_weight_packed",
+            #     "w13_weight_scale",
+            #     "w13_weight_global_scale",
+            #     "w13_input_global_scale",
+            #     "w2_weight_packed",
+            #     "w2_weight_scale",
+            #     "w2_weight_global_scale",
+            #     "w2_input_global_scale",
+            # ]


This block of commented-out code seems to be for debugging purposes. It should be removed to keep the codebase clean.

dsikka and others added 11 commits June 25, 2025 07:18

port nvfp4

b840b2e

Signed-off-by: yiliu30 <yi4.liu@intel.com>

add example

d39c7bd

Signed-off-by: yiliu30 <yi4.liu@intel.com>

add hpu support

0240c06

Signed-off-by: Yi Liu <yiliu4@habana.ai>

update example

a14189f

Signed-off-by: Yi Liu <yiliu4@habana.ai>

add moe example

0418b9e

Signed-off-by: yiliu30 <yi4.liu@intel.com>

add moe support

3f8505f

Signed-off-by: yiliu30 <yi4.liu@intel.com>

ep 2

07cf1c0

Signed-off-by: Yi Liu <yiliu4@habana.ai>

refine log

5b987a6

Signed-off-by: Yi Liu <yiliu4@habana.ai>

update example

4b8e087

Signed-off-by: Yi Liu <yiliu4@habana.ai>

add d-qd mxfp8 support

f1d9058

Signed-off-by: yiliu30 <yi4.liu@intel.com> Signed-off-by: Yi Liu <yiliu4@habana.ai>

fix mxfp8 for hpu

d7f0178

Signed-off-by: Yi Liu <yiliu4@habana.ai>

gemini-code-assist bot reviewed Jun 30, 2025

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Uh oh!

Hpu mxfp8 #48

Hpu mxfp8 #48

Uh oh!

yiliu30 commented Jun 30, 2025

Uh oh!

gemini-code-assist bot left a comment

Uh oh!

gemini-code-assist bot left a comment

Uh oh!

gemini-code-assist bot Jun 30, 2025

Uh oh!

gemini-code-assist bot Jun 30, 2025

Uh oh!

gemini-code-assist bot Jun 30, 2025

Uh oh!

gemini-code-assist bot Jun 30, 2025

Uh oh!

gemini-code-assist bot Jun 30, 2025

Uh oh!

gemini-code-assist bot Jun 30, 2025

Uh oh!

gemini-code-assist bot Jun 30, 2025

Uh oh!

gemini-code-assist bot Jun 30, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

Uh oh!

Hpu mxfp8 #48

Are you sure you want to change the base?

Hpu mxfp8 #48

Uh oh!

Conversation

yiliu30 commented Jun 30, 2025

Essential Elements of an Effective PR Description Checklist

Purpose

Test Plan

Test Result

Uh oh!

gemini-code-assist bot left a comment

Choose a reason for hiding this comment

Summary of Changes

Highlights

Footnotes

Uh oh!

gemini-code-assist bot left a comment

Choose a reason for hiding this comment

Code Review

Uh oh!

gemini-code-assist bot Jun 30, 2025

Choose a reason for hiding this comment

Uh oh!

gemini-code-assist bot Jun 30, 2025

Choose a reason for hiding this comment

Uh oh!

gemini-code-assist bot Jun 30, 2025

Choose a reason for hiding this comment

Uh oh!

gemini-code-assist bot Jun 30, 2025

Choose a reason for hiding this comment

Uh oh!

gemini-code-assist bot Jun 30, 2025

Choose a reason for hiding this comment

Uh oh!

gemini-code-assist bot Jun 30, 2025

Choose a reason for hiding this comment

Uh oh!

gemini-code-assist bot Jun 30, 2025

Choose a reason for hiding this comment

Uh oh!

gemini-code-assist bot Jun 30, 2025

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants