Skip to content

Conversation

@yiliu30
Copy link
Owner

@yiliu30 yiliu30 commented Jun 30, 2025

Essential Elements of an Effective PR Description Checklist

  • The purpose of the PR, such as "Fix some issue (link existing issues this PR will resolve)".
  • The test plan, such as providing test command.
  • The test results, such as pasting the results comparison before and after, or e2e results

PLEASE FILL IN THE PR DESCRIPTION HERE ENSURING ALL CHECKLIST ITEMS ABOVE HAVE BEEN CONSIDERED.

Purpose

Test Plan

Test Result

BEFORE SUBMITTING, PLEASE READ https://docs.vllm.ai/en/latest/contributing (anything written below this line will be removed by GitHub Actions)

dsikka and others added 11 commits June 25, 2025 07:18
Signed-off-by: yiliu30 <yi4.liu@intel.com>
Signed-off-by: yiliu30 <yi4.liu@intel.com>
Signed-off-by: Yi Liu <yiliu4@habana.ai>
Signed-off-by: Yi Liu <yiliu4@habana.ai>
Signed-off-by: yiliu30 <yi4.liu@intel.com>
Signed-off-by: yiliu30 <yi4.liu@intel.com>
Signed-off-by: Yi Liu <yiliu4@habana.ai>
Signed-off-by: Yi Liu <yiliu4@habana.ai>
Signed-off-by: Yi Liu <yiliu4@habana.ai>
Signed-off-by: yiliu30 <yi4.liu@intel.com>

Signed-off-by: Yi Liu <yiliu4@habana.ai>
Signed-off-by: Yi Liu <yiliu4@habana.ai>
Copy link

@gemini-code-assist gemini-code-assist bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Summary of Changes

Hello @yiliu30, I'm Gemini Code Assist1! I'm currently reviewing this pull request and will post my feedback shortly. In the meantime, here's a summary to help you and other reviewers quickly get up to speed!

This pull request introduces comprehensive support for NVFP4 and MXFP8 quantization within vLLM, focusing on both general linear layers and Mixture of Experts (MoE) layers. It includes new quantization schemes, their integration with HPU-specific optimizations, and provides emulation capabilities for NVFP4 to ensure broader compatibility. The changes aim to improve performance and memory efficiency for models utilizing these advanced quantization formats.

Highlights

  • New Quantization Schemes: Introduced support for W4A4 NVFP4 (NVIDIA FP4) and W8A8 MXFP8 (Mixed Precision FP8) quantization schemes, significantly expanding vLLM's capability to handle various low-precision models.
  • HPU Integration and Optimization: Enhanced vLLM's compatibility with Habana Processing Units (HPUs) by integrating the new quantization schemes, adding HPU-specific environment variables (VLLM_USE_STATIC_MOE_HPU, VLLM_HPU_FORCE_CHANNEL_FP8), and providing an example HPU inference script.
  • NVFP4 Emulation Support: Implemented reference NVFP4 quantization and dequantization utilities, enabling these schemes to run via software emulation even on hardware platforms that do not natively support Cutlass NVFP4 operations.
  • Mixture of Experts (MoE) Quantization: Extended MoE layers to support the new CompressedTensorsW4A4MoeMethod, incorporating both Marlin kernel integration and a static HPU emulation path for MoE operations.
Using Gemini Code Assist

The full guide for Gemini Code Assist can be found on our documentation page, here are some quick tips.

Invoking Gemini

You can request assistance from Gemini at any point in your pull request via creating an issue comment (i.e. comment on the pull request page) using either /gemini <command> or @gemini-code-assist <command>. Below is a summary of the supported commands.

Feature Command Description
Code Review /gemini review Performs a code review for the current pull request in its current state.
Pull Request Summary /gemini summary Provides a summary of the current pull request in its current state.
Comment @gemini-code-assist Responds in comments when explicitly tagged, both in issue comments and review comments.
Help /gemini help Displays a list of available commands.

Customization

To customize Gemini Code Assist for GitHub experience, repository maintainers can create a configuration file and/or provide a custom code review style guide (such as PEP-8 for Python) by creating and adding files to a .gemini/ folder in the base of the repository. Detailed instructions can be found here.

Limitations & Feedback

Gemini Code Assist is currently in preview and may make mistakes. Please leave feedback on any instances where its feedback is incorrect or counter productive. You can react with 👍 and 👎 on @gemini-code-assist comments to provide feedback.

You can also get AI-powered code generation, chat, as well as code reviews directly in the IDE at no cost with the Gemini Code Assist IDE Extension.

Footnotes

  1. Review the Privacy Notices, Generative AI Prohibited Use Policy, Terms of Service, and learn how to configure Gemini Code Assist in GitHub here. Gemini can make mistakes, so double check it and use code with caution.

Copy link

@gemini-code-assist gemini-code-assist bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request introduces support for new quantization formats, including MXFP8 and NVFP4, with a focus on HPU devices. The changes are extensive, adding new quantization schemes, MoE methods, and emulation utilities. A new example script for HPU inference is also included.

My review has identified a critical bug in an __init__.py file due to a missing comma, which could lead to import errors. I've also found some dead code, leftover debugging artifacts, and areas for cleanup, particularly in the new example script. Addressing these points will improve the quality and maintainability of the code.

Comment on lines +22 to +23
"CompressedTensorsW4A4Fp4"
"CompressedTensorsW8A8MXFp8"

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

critical

There is a missing comma between the two string literals in the __all__ list. This will cause Python to concatenate them into a single string: 'CompressedTensorsW4A4Fp4CompressedTensorsW8A8MXFp8'. This is a bug and will prevent CompressedTensorsW8A8MXFp8 from being correctly exported, potentially causing ImportError downstream.

Suggested change
"CompressedTensorsW4A4Fp4"
"CompressedTensorsW8A8MXFp8"
"CompressedTensorsW4A4Fp4",
"CompressedTensorsW8A8MXFp8"

Comment on lines +3 to +13
model_path = "/models/Qwen3-32B"
model_path = "/models/DeepSeek-R1-Distill-Qwen-7B"
model_path= "/mnt/disk3/yiliu4/RedHatAI/Llama-3.1-8B-tldr-FP8-dynamic"
model_path = "/software/users/yiliu4/HF_HOME/RedHatAI/Llama-3.1-8B-tldr-FP8-dynamic"
model_path = "/software/users/yiliu4/HF_HOME/Yi30/Llama-3.2-1B-Instruct-NVFP4-llm-compressor"
model_path = "/software/users/yiliu4/HF_HOME/Yi30/Llama-3.2-1B-Instruct-NVFP4-llm-compressor"
model_path = "/mnt/disk3/yiliu4/Yi30/DeepSeek-V2-Lite-NVFP4-llm-compressor"
model_path = "/software/users/yiliu4/HF_HOME/Yi30/DeepSeek-V2-Lite-NVFP4-llm-compressor/"
model_path = "/software/users/yiliu4/HF_HOME/Yi30/Llama-3.3-70B-Instruct-NVFP4-llmc"
model_path = "/software/users/yiliu4/HF_HOME/Yi30/Yi30/Llama-3.2-1B-Instruct-MXFP8-llmc"
model_path = "/software/users/yiliu4/HF_HOME/Yi30/Yi30/Llama-3.3-70B-Instruct-MXFP8-llmc"

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

high

This example script contains multiple assignments to model_path with hardcoded, user-specific paths. This is confusing and reduces the script's reusability. It's better to have a single, well-defined default model path (e.g., a public model from Hugging Face Hub) or a clear placeholder, and let the user override it with the --model_path argument.

Comment on lines +104 to +152
# INFO 06-26 17:17:55 [llm_engine.py:439] init engine (profile, create kv cache, warmup model) took 53.52 seconds
# Adding requests: 100%|████████████████████████████████████████████████████████████████████████████████████████████████████| 4/4 [00:00<00:00, 197.22it/s]
# Processed prompts: 0%| | 0/4 [00:00<?, ?it/s, est. speed input: 0.00 toks/s, output: 0.00 toks/s]WARNING 06-26 17:17:55 [hpu_model_runner.py:1230] Configuration: ('prompt', 4, 128, 0) was not warmed-up!
# (VllmWorkerProcess pid=54973) WARNING 06-26 17:17:55 [hpu_model_runner.py:1230] Configuration: ('prompt', 4, 128, 0) was not warmed-up!
# (VllmWorkerProcess pid=54971) WARNING 06-26 17:17:55 [hpu_model_runner.py:1230] Configuration: ('prompt', 4, 128, 0) was not warmed-up!
# (VllmWorkerProcess pid=54975) WARNING 06-26 17:17:55 [hpu_model_runner.py:1230] Configuration: ('prompt', 4, 128, 0) was not warmed-up!
# (VllmWorkerProcess pid=54969) WARNING 06-26 17:17:55 [hpu_model_runner.py:1230] Configuration: ('prompt', 4, 128, 0) was not warmed-up!
# (VllmWorkerProcess pid=54972) WARNING 06-26 17:17:55 [hpu_model_runner.py:1230] Configuration: ('prompt', 4, 128, 0) was not warmed-up!
# (VllmWorkerProcess pid=54970) WARNING 06-26 17:17:55 [hpu_model_runner.py:1230] Configuration: ('prompt', 4, 128, 0) was not warmed-up!
# (VllmWorkerProcess pid=54974) WARNING 06-26 17:17:55 [hpu_model_runner.py:1230] Configuration: ('prompt', 4, 128, 0) was not warmed-up!
# (VllmWorkerProcess pid=54974) WARNING 06-26 17:18:01 [hpu_model_runner.py:1230] Configuration: ('decode', 4, 1, 128) was not warmed-up!
# (VllmWorkerProcess pid=54969) WARNING 06-26 17:18:01 [hpu_model_runner.py:1230] Configuration: ('decode', 4, 1, 128) was not warmed-up!
# (VllmWorkerProcess pid=54971) WARNING 06-26 17:18:01 [hpu_model_runner.py:1230] Configuration: ('decode', 4, 1, 128) was not warmed-up!
# (VllmWorkerProcess pid=54973) WARNING 06-26 17:18:01 [hpu_model_runner.py:1230] Configuration: ('decode', 4, 1, 128) was not warmed-up!
# (VllmWorkerProcess pid=54970) WARNING 06-26 17:18:01 [hpu_model_runner.py:1230] Configuration: ('decode', 4, 1, 128) was not warmed-up!
# (VllmWorkerProcess pid=54975) WARNING 06-26 17:18:01 [hpu_model_runner.py:1230] Configuration: ('decode', 4, 1, 128) was not warmed-up!
# (VllmWorkerProcess pid=54972) WARNING 06-26 17:18:01 [hpu_model_runner.py:1230] Configuration: ('decode', 4, 1, 128) was not warmed-up!
# WARNING 06-26 17:18:01 [hpu_model_runner.py:1230] Configuration: ('decode', 4, 1, 128) was not warmed-up!
# Processed prompts: 100%|██████████████████████████████████████████████| 4/4 [09:06<00:00, 136.71s/it, est. speed input: 0.05 toks/s, output: 0.12 toks/s]

# Generated Outputs:
# ------------------------------------------------------------
# Prompt: 'Hello, my name is'
# Output: ' Tony, I am a Software Engineer at Google. I have been working on the'
# ------------------------------------------------------------
# Prompt: 'The president of the United States is'
# Output: ' the head of state and head of government of the United States. The president is'
# ------------------------------------------------------------
# Prompt: 'The capital of France is'
# Output: ' always a good idea, but there are many other destinations that are worth visiting as'
# ------------------------------------------------------------
# Prompt: 'The future of AI is'
# Output: " here, and it's more powerful than ever. But with great power comes great"
# ------------------------------------------------------------
# Adding requests: 100%|███████████████████████████████████████████████████████████████████████████████████████████████████| 4/4 [00:00<00:00, 1055.37it/s]
# Processed prompts: 100%|██████████████████████████████████████████████| 4/4 [09:35<00:00, 143.78s/it, est. speed input: 0.05 toks/s, output: 0.11 toks/s]
# Time taken for second inference: 575.12 seconds
# INFO 06-26 17:36:37 [multiproc_worker_utils.py:139] Terminating local vLLM worker processes
# (VllmWorkerProcess pid=54969) INFO 06-26 17:36:37 [multiproc_worker_utils.py:261] Worker exiting
# (VllmWorkerProcess pid=54973) INFO 06-26 17:36:37 [multiproc_worker_utils.py:261] Worker exiting
# (VllmWorkerProcess pid=54971) INFO 06-26 17:36:37 [multiproc_worker_utils.py:261] Worker exiting
# (VllmWorkerProcess pid=54970) INFO 06-26 17:36:37 [multiproc_worker_utils.py:261] Worker exiting
# (VllmWorkerProcess pid=54972) INFO 06-26 17:36:37 [multiproc_worker_utils.py:261] Worker exiting
# (VllmWorkerProcess pid=54975) INFO 06-26 17:36:37 [multiproc_worker_utils.py:261] Worker exiting
# (VllmWorkerProcess pid=54974) INFO 06-26 17:36:37 [multiproc_worker_utils.py:261] Worker exiting
# <p22> yiliu4@yiliu4-63gd-g3-l-vm:basic$ /usr/lib/python3.10/multiprocessing/resource_tracker.py:224: UserWarning: resource_tracker: There appear to be 1 leaked shared_memory objects to clean up at shutdown
# warnings.warn('resource_tracker: There appear to be %d '

# <p22> yiliu4@yiliu4-63gd-g3-l-vm:basic$ p basic_hpu.py --tp 8 No newline at end of file

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

high

This file contains a large block of commented-out logs and shell output. This content appears to be leftover from a debugging session and should be removed. It adds noise to the example script, making it difficult to read and understand.

Comment on lines +175 to +180
return self.fp8_linear.apply(input=x,
weight=layer.weight,
weight_scale=layer.weight_scale,
out_dtype=self.out_dtype,
input_scale=layer.input_scale,
bias=bias)

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

high

This block of code is unreachable due to the return statement on line 173. It appears to be dead code from a previous implementation and should be removed to improve code clarity and maintainability. The self.fp8_linear attribute it references is also commented out in the __init__ method.

Comment on lines +326 to +329
is_per_tensor_group_weight = (weight_quant.strategy in [

QuantizationStrategy.TENSOR_GROUP
])

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

medium

For checking against a single item, using the equality operator == is more direct and readable than using in with a single-element list.

        is_per_tensor_group_weight = (
            weight_quant.strategy == QuantizationStrategy.TENSOR_GROUP
        )

self, weight_quant: BaseModel,
input_quant: BaseModel) -> "CompressedTensorsScheme":

# breakpoint()

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

medium

A commented-out breakpoint() call is present here. This appears to be a leftover from debugging and should be removed.

raise RuntimeError(
f"Unsupported FusedMoe scheme: {weight_quant}, {input_quant}")

import vllm.envs as envs

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

medium

The import of vllm.envs is done locally within the class. According to PEP 8, imports should be at the top of the file. This improves readability and helps avoid potential circular import issues.

Comment on lines +300 to +319
# w_state_dict = [
# "w13_weight_packed:torch.Size([64, 2816, 1024])",
# "w13_weight_scale:torch.Size([64, 2816, 128])",
# "w13_weight_global_scale:torch.Size([64, 2])",
# "w13_input_global_scale:torch.Size([64, 2])",
# "w2_weight_packed:torch.Size([64, 2048, 704])",
# "w2_weight_scale:torch.Size([64, 2048, 88])",
# "w2_weight_global_scale:torch.Size([64])",
# "w2_input_global_scale:torch.Size([64])",
# ]
# w_list = [
# "w13_weight_packed",
# "w13_weight_scale",
# "w13_weight_global_scale",
# "w13_input_global_scale",
# "w2_weight_packed",
# "w2_weight_scale",
# "w2_weight_global_scale",
# "w2_input_global_scale",
# ]

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

medium

This block of commented-out code seems to be for debugging purposes. It should be removed to keep the codebase clean.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants