Nvfp4 static gs #61

yiliu30 · 2025-08-01T11:18:13Z

Essential Elements of an Effective PR Description Checklist

The purpose of the PR, such as "Fix some issue (link existing issues this PR will resolve)".
The test plan, such as providing test command.
The test results, such as pasting the results comparison before and after, or e2e results

PLEASE FILL IN THE PR DESCRIPTION HERE ENSURING ALL CHECKLIST ITEMS ABOVE HAVE BEEN CONSIDERED.

Purpose

Test Plan

Test Result

BEFORE SUBMITTING, PLEASE READ https://docs.vllm.ai/en/latest/contributing (anything written below this line will be removed by GitHub Actions)

Signed-off-by: Yi Liu <yiliu4@habana.ai>

Signed-off-by: Yi Liu <yiliu4@habana.ai> Co-authored-by: Yi Liu <yiliu4@habana.ai>

Signed-off-by: Yi Liu <yiliu4@habana.ai>

gemini-code-assist

Summary of Changes

Hello @yiliu30, I'm Gemini Code Assist¹! I'm currently reviewing this pull request and will post my feedback shortly. In the meantime, here's a summary to help you and other reviewers quickly get up to speed!

This pull request primarily focuses on enhancing the NVFP4 quantization scheme by introducing support for static global scaling for input activations. This change aims to provide more control and potentially improve the accuracy or stability of quantized operations, particularly within Mixture of Experts (MoE) layers and W4A4 configurations. Additionally, it includes updates to example scripts to facilitate testing of these new quantization behaviors and related inference settings.

Highlights

NVFP4 Static Global Scaling: Implemented support for static global scaling in NVFP4 quantization by introducing qdq_nvfp4_with_gs and modifying to_nvfp4 functions to accept and utilize a pre-defined global scale for input quantization. This allows for more consistent scaling across operations.
Integration with MoE and W4A4 Schemes: Updated the NVFP4 Mixture of Experts (MoE) and W4A4 quantization schemes to leverage the new static global scaling functionality for input quantization during GEMM operations, ensuring that these advanced quantization methods benefit from the improved scaling control.
Testing Infrastructure Updates: Added new command-line arguments (--next_token) and environment variable controls (VLLM_DELAYED_SAMPLING, max-num-prefill-seqs) to the start_vllm.sh example script. These changes facilitate testing of specific inference scenarios, such as next token prediction, in conjunction with the NVFP4 quantization updates.

Using Gemini Code Assist

The full guide for Gemini Code Assist can be found on our documentation page, here are some quick tips.

Invoking Gemini

You can request assistance from Gemini at any point in your pull request via creating an issue comment (i.e. comment on the pull request page) using either /gemini <command> or @gemini-code-assist <command>. Below is a summary of the supported commands.

Feature	Command	Description
Code Review	`/gemini review`	Performs a code review for the current pull request in its current state.
Pull Request Summary	`/gemini summary`	Provides a summary of the current pull request in its current state.
Comment	@gemini-code-assist	Responds in comments when explicitly tagged, both in issue comments and review comments.
Help	`/gemini help`	Displays a list of available commands.

Customization

To customize Gemini Code Assist for GitHub experience, repository maintainers can create a configuration file and/or provide a custom code review style guide (such as PEP-8 for Python) by creating and adding files to a .gemini/ folder in the base of the repository. Detailed instructions can be found here.

Limitations & Feedback

Gemini Code Assist may make mistakes. Please leave feedback on any instances where its feedback is incorrect or counter productive. You can react with 👍 and 👎 on @gemini-code-assist comments or fill out our survey to provide feedback.

You can also get AI-powered code generation, chat, as well as code reviews directly in the IDE at no cost with the Gemini Code Assist IDE Extension.

Review the Privacy Notices, Generative AI Prohibited Use Policy, Terms of Service, and learn how to configure Gemini Code Assist in GitHub here. Gemini can make mistakes, so double check it and use code with caution. ↩

gemini-code-assist

Code Review

This pull request appears to introduce support for static global scaling for NVFP4 quantization, including a new "next token prediction" feature controlled by a NEXT_TOKEN flag.

While the core logic for passing the global scale seems to be on the right track, the implementation in vllm/model_executor/layers/quantization/utils/nvfp4_qdq.py has several critical issues, including an UnboundLocalError and a duplicated function definition with bugs in both implementations. These need to be addressed.

Additionally, the example scripts (basic_hpu.py and start_vllm.sh) contain leftover debugging code, such as print statements and redundant variable assignments, which should be cleaned up for clarity and maintainability.

I've provided detailed comments and suggestions to address these points.

Copilot

Pull Request Overview

This PR introduces static global scale support for NVFP4 quantization by adding optional global scale parameters to quantization functions and creating specialized variants for handling pre-computed global scales.

Adds optional x_global_scale parameter to existing functions and creates new *_with_gs variants
Updates MoE layer to use input global scales for quantization operations
Modifies compressed tensors schemes to utilize the new global scale functionality

Reviewed Changes

Copilot reviewed 4 out of 4 changed files in this pull request and generated 4 comments.

File	Description
vllm/model_executor/layers/quantization/utils/nvfp4_qdq.py	Adds global scale parameters and creates specialized functions for handling pre-computed scales
vllm/model_executor/layers/quantization/compressed_tensors/schemes/compressed_tensors_w4a4_nvfp4.py	Updates to use new global scale function variant
vllm/model_executor/layers/quantization/compressed_tensors/compressed_tensors_moe.py	Adds input global scale support and improves parameter naming
examples/offline_inference/basic/basic_hpu.py	Adds debug print and commented model path options

Copilot · 2025-08-01T11:23:36Z

vllm/model_executor/layers/quantization/utils/nvfp4_qdq.py

+        per_tensor_scale=x_global_scale,
        do_pack=do_pack,
    )
    return data_lp, out_scales, per_tensor_scale


The function returns per_tensor_scale but when x_global_scale is provided, it should return x_global_scale instead since that's the actual global scale being used.

Suggested change

return data_lp, out_scales, per_tensor_scale

return data_lp, out_scales, x_global_scale

Copilot · 2025-08-01T11:23:36Z

vllm/model_executor/layers/quantization/utils/nvfp4_qdq.py

+def qdq_nvfp4(x, x_global_scale=None):
+    if envs.VLLM_DISABLE_INPUT_QDQ:
+        return x
+
+    data_lp, x_scale = to_nvfp4(x, x_global_scale, do_pack=False)
+    x_dq = dequant_nvfp4(
+        data_lp,
+        x_scale,
+        x_global_scale,
+        original_dtype=x.dtype,
+        packed=False,
+    )
+    return x_dq
+
+


This function definition is duplicated. There are two qdq_nvfp4 functions defined at lines 332 and 347, which will cause the second definition to override the first.

Suggested change

def qdq_nvfp4(x, x_global_scale=None):

if envs.VLLM_DISABLE_INPUT_QDQ:

return x

data_lp, x_scale = to_nvfp4(x, x_global_scale, do_pack=False)

x_dq = dequant_nvfp4(

data_lp,

x_scale,

x_global_scale,

original_dtype=x.dtype,

packed=False,

)

return x_dq

# Removed duplicate definition of qdq_nvfp4 to resolve redundancy.

Copilot · 2025-08-01T11:23:36Z

vllm/model_executor/layers/quantization/utils/nvfp4_qdq.py

+    if envs.VLLM_DISABLE_INPUT_QDQ:
+        return x
+
+    data_lp, x_scale = to_nvfp4(x, x_global_scale, do_pack=False)


The function expects 3 return values from to_nvfp4 based on the original implementation, but now it only expects 2. This will cause an unpacking error since to_nvfp4 returns 3 values.

Suggested change

data_lp, x_scale = to_nvfp4(x, x_global_scale, do_pack=False)

data_lp, x_scale, _ = to_nvfp4(x, x_global_scale, do_pack=False)

Copilot · 2025-08-01T11:23:37Z

vllm/model_executor/layers/quantization/utils/nvfp4_qdq.py

+    x_dq = dequant_nvfp4(
+        data_lp,
+        x_scale,
+        x_global_scale,


When x_global_scale is None, this will pass None to dequant_nvfp4, but the function expects a valid scale tensor. The global scale should be extracted from the to_nvfp4 return value.

This reverts commit c4ef813.

Yi4Liu and others added 10 commits July 23, 2025 14:01

fix global scale

1e3be6e

Signed-off-by: Yi Liu <yiliu4@habana.ai>

reduce memory usage in moe (#59)

9a1caf4

Signed-off-by: Yi Liu <yiliu4@habana.ai> Co-authored-by: Yi Liu <yiliu4@habana.ai>

update transformers version

bc32bbd

Signed-off-by: Yi Liu <yiliu4@habana.ai>

update

e46500f

Signed-off-by: Yi Liu <yiliu4@habana.ai>

enable next token task

e0d8092

Signed-off-by: Yi Liu <yiliu4@habana.ai>

merge

78abe63

Signed-off-by: Yi Liu <yiliu4@habana.ai>

add eval code back

ed0407c

Signed-off-by: Yi Liu <yiliu4@habana.ai>

fix

1012416

Signed-off-by: Yi Liu <yiliu4@habana.ai>

clean

2f60246

Signed-off-by: Yi Liu <yiliu4@habana.ai>

use gs

b812395

Signed-off-by: Yi Liu <yiliu4@habana.ai>

gemini-code-assist bot reviewed Aug 1, 2025

View reviewed changes

Merge branch 'hpu-mxfp8-moe' into nvfp4-static-gs

e9431df

gemini-code-assist bot reviewed Aug 1, 2025

View reviewed changes

yiliu30 requested a review from Copilot August 1, 2025 11:22

yiliu30 marked this pull request as ready for review August 1, 2025 11:23

yiliu30 requested a review from xuechendi as a code owner August 1, 2025 11:23

Copilot AI reviewed Aug 1, 2025

View reviewed changes

yiliu30 merged commit c4ef813 into hpu-mxfp8-moe Aug 1, 2025

yiliu30 added a commit that referenced this pull request Aug 1, 2025

Revert "Nvfp4 static gs (#61)"

ce27def

This reverts commit c4ef813.

yiliu30 mentioned this pull request Aug 1, 2025

Revert "Nvfp4 static gs" #63

Merged

yiliu30 added a commit that referenced this pull request Aug 1, 2025

Revert "Nvfp4 static gs (#61)" (#63)

95135b6

This reverts commit c4ef813.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Uh oh!

Nvfp4 static gs #61

Nvfp4 static gs #61

Uh oh!

yiliu30 commented Aug 1, 2025

Uh oh!

gemini-code-assist bot left a comment

Uh oh!

gemini-code-assist bot left a comment

Uh oh!

Copilot AI left a comment

Uh oh!

Copilot AI Aug 1, 2025

Uh oh!

Copilot AI Aug 1, 2025

Uh oh!

Copilot AI Aug 1, 2025

Uh oh!

Copilot AI Aug 1, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

	return data_lp, out_scales, per_tensor_scale
	return data_lp, out_scales, x_global_scale

	data_lp, x_scale = to_nvfp4(x, x_global_scale, do_pack=False)
	data_lp, x_scale, _ = to_nvfp4(x, x_global_scale, do_pack=False)

Uh oh!

Nvfp4 static gs #61

Nvfp4 static gs #61

Uh oh!

Conversation

yiliu30 commented Aug 1, 2025

Essential Elements of an Effective PR Description Checklist

Purpose

Test Plan

Test Result

Uh oh!

gemini-code-assist bot left a comment

Choose a reason for hiding this comment

Summary of Changes

Highlights

Footnotes

Uh oh!

gemini-code-assist bot left a comment

Choose a reason for hiding this comment

Code Review

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull Request Overview

Reviewed Changes

Uh oh!

Copilot AI Aug 1, 2025

Choose a reason for hiding this comment

Uh oh!

Copilot AI Aug 1, 2025

Choose a reason for hiding this comment

Uh oh!

Copilot AI Aug 1, 2025

Choose a reason for hiding this comment

Uh oh!

Copilot AI Aug 1, 2025

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants