Skip to content

Conversation

@yiliu30
Copy link
Owner

@yiliu30 yiliu30 commented Aug 13, 2025

Essential Elements of an Effective PR Description Checklist

  • The purpose of the PR, such as "Fix some issue (link existing issues this PR will resolve)".
  • The test plan, such as providing test command.
  • The test results, such as pasting the results comparison before and after, or e2e results

PLEASE FILL IN THE PR DESCRIPTION HERE ENSURING ALL CHECKLIST ITEMS ABOVE HAVE BEEN CONSIDERED.

Purpose

Test Plan

Test Result

BEFORE SUBMITTING, PLEASE READ https://docs.vllm.ai/en/latest/contributing (anything written below this line will be removed by GitHub Actions)

Yi4Liu added 4 commits August 8, 2025 10:02
Signed-off-by: Yi Liu <yiliu4@habana.ai>
Signed-off-by: Yi Liu <yiliu4@habana.ai>
Signed-off-by: Yi Liu <yiliu4@habana.ai>
Signed-off-by: Yi Liu <yiliu4@habana.ai>
Copy link

@gemini-code-assist gemini-code-assist bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Summary of Changes

Hello @yiliu30, I'm Gemini Code Assist1! I'm currently reviewing this pull request and will post my feedback shortly. In the meantime, here's a summary to help you and other reviewers quickly get up to speed!

This pull request introduces an "even rounding" method for MXFP4 (Mixed-Precision 4-bit Floating Point) quantization, aiming to improve numerical precision. It also includes a significant refactoring of the project's dependency management by removing explicit PyTorch and related library version constraints from various requirements.txt files and pyproject.toml, likely to allow for more flexible or system-managed installations. Additionally, several example scripts have been updated, possibly for testing the new quantization method and other model configurations.

Highlights

  • MXFP4 Even Rounding: Implemented an "even rounding" method for MXFP4 quantization, which can be enabled via the VLLM_MXFP4_EVEN_ROUNDING environment variable.
  • Dependency Management Refactoring: Refactored dependency management by removing explicit torch, torchaudio, torchvision, and xformers version pins from pyproject.toml and various requirements/*.txt files across different hardware platforms (CPU, CUDA, ROCm, Neuron, TPU, XPU).
  • Example Script Updates: Updated example inference scripts (basic_hpu.py, start_vllm.sh) to include new model paths, adjust memory utilization settings, increase maximum sequence length, and modify lm_eval parameters.
Using Gemini Code Assist

The full guide for Gemini Code Assist can be found on our documentation page, here are some quick tips.

Invoking Gemini

You can request assistance from Gemini at any point in your pull request via creating an issue comment (i.e. comment on the pull request page) using either /gemini <command> or @gemini-code-assist <command>. Below is a summary of the supported commands.

Feature Command Description
Code Review /gemini review Performs a code review for the current pull request in its current state.
Pull Request Summary /gemini summary Provides a summary of the current pull request in its current state.
Comment @gemini-code-assist Responds in comments when explicitly tagged, both in issue comments and review comments.
Help /gemini help Displays a list of available commands.

Customization

To customize Gemini Code Assist for GitHub experience, repository maintainers can create a configuration file and/or provide a custom code review style guide (such as PEP-8 for Python) by creating and adding files to a .gemini/ folder in the base of the repository. Detailed instructions can be found here.

Limitations & Feedback

Gemini Code Assist may make mistakes. Please leave feedback on any instances where its feedback is incorrect or counter productive. You can react with 👍 and 👎 on @gemini-code-assist comments. If you're interested in giving your feedback about your experience with Gemini Code Assist for Github and other Google products, sign up here.

You can also get AI-powered code generation, chat, as well as code reviews directly in the IDE at no cost with the Gemini Code Assist IDE Extension.

Footnotes

  1. Review the Privacy Notices, Generative AI Prohibited Use Policy, Terms of Service, and learn how to configure Gemini Code Assist in GitHub here. Gemini can make mistakes, so double check it and use code with caution.

Copy link

@gemini-code-assist gemini-code-assist bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request introduces an alternative "even rounding" implementation for MXFP4 quantization, controlled by the VLLM_MXFP4_EVEN_ROUNDING environment variable. The PR also contains substantial changes to dependency management, removing pinned versions of torch and its related packages from all requirements files. This is a major change that could impact build stability and reproducibility, and its motivation should be clarified in the PR description. Additionally, several example and testing scripts have been modified with hardcoded paths and commented-out code, which appear to be remnants of local development and should be cleaned up before merging. A critical syntax error was found in the new quantization utility function.

Comment on lines +317 to +320
# echo "Stopping vLLM server"
#kill ${pid}
#echo "Script execution completed"
#sleep 10

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

high

The server cleanup logic (kill ${pid}) is commented out. This will leave the vLLM server process running after the script completes, which can consume resources unnecessarily. If this was for debugging, please re-enable it.

Suggested change
# echo "Stopping vLLM server"
#kill ${pid}
#echo "Script execution completed"
#sleep 10
echo "Stopping vLLM server"
kill ${pid}
echo "Script execution completed"
sleep 10

Comment on lines 40 to +46
model_path = "/software/users/yiliu4/HF_HOME/weiweiz1/DeepSeek-V2-Lite-NVFP4-autoround"
# model_path = "/software/users/yiliu4/deepseek-ai/DeepSeek-R1-NVFP4-OFFLINE"
model_path = "/software/users/yiliu4/HF_HOME/weiweiz1/DeepSeek-R1-NVFP4-RTN"

model_path = "/software/users/yiliu4/HF_HOME/weiweiz1/DeepSeek-R1-NVFP4-autoround"
model_path = "/software/users/yiliu4/HF_HOME/Yi30/Llama-3.2-1B-Instruct-MXFP4-llmc"
# model_path = "/software/users/yiliu4/HF_HOME/Yi30/DeepSeek-V2-Lite-NVFP4-W4A4-RTN-GLOBAL-SCALE-WW"

model_path = "/software/users/yiliu4/HF_HOME/weiweiz1/DeepSeek-R1-MXFP4-RTN"

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

medium

This block contains multiple assignments to model_path, many of which are immediately overwritten. This appears to be for local testing and should be cleaned up. Please consolidate this to a single default model_path and rely on the command-line argument --model_path to specify different models for testing.

Comment on lines 15 to +21
nvfp4_model_path=/software/users/yiliu4/HF_HOME/weiweiz1/DeepSeek-R1-NVFP4-autoround/
nvfp4_model_path="/software/users/yiliu4/deepseek-ai/DeepSeek-R1-nvfp4-fix-723"
nvfp4_model_path="/software/users/yiliu4/deepseek-ai/DeepSeek-R1-nvfp4-fix-723-skip-atten"
nvfp4_model_path=/software/users/yiliu4/deepseek-ai/DeepSeek-R1-NVFP4-OFFLINE
nvfp4_model_path="/software/users/yiliu4/HF_HOME/weiweiz1/DeepSeek-R1-NVFP4-RTN"
nvfp4_model_path="/software/users/yiliu4/HF_HOME/weiweiz1/DeepSeek-R1-NVFP4-RTN"
nvfp4_model_path="/software/users/yiliu4/HF_HOME/weiweiz1/DeepSeek-R1-NVFP4-autoround"

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

medium

There are multiple assignments to nvfp4_model_path, including a duplicate. This appears to be for local testing and should be cleaned up to avoid confusion. Please retain only the necessary model path assignments.

model_base_name=$(basename $model_path)

EVAL_LOG_NAME="mxfp8_${model_base_name}_lm_eval_output_${task_name}_bs${batch_size}__${timestamp}"
EVAL_LOG_NAME="mxfp8_${model_base_name}_lm_eval_output__bs${batch_size}__${timestamp}"

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

medium

The task_name variable is no longer included in EVAL_LOG_NAME, but it is still used in the echo command on line 297 and the lm_eval command on line 303. This creates a mismatch between the log message and the actual log file name. For better traceability, consider adding task_name back to the log file name.

Suggested change
EVAL_LOG_NAME="mxfp8_${model_base_name}_lm_eval_output__bs${batch_size}__${timestamp}"
EVAL_LOG_NAME="mxfp8_${model_base_name}_lm_eval_output_${task_name}_bs${batch_size}__${timestamp}"

yiliu30 and others added 2 commits August 13, 2025 14:54
…utils.py

Co-authored-by: gemini-code-assist[bot] <176961590+gemini-code-assist[bot]@users.noreply.github.com>
Signed-off-by: Yi Liu <yiliu4@habana.ai>
@yiliu30 yiliu30 merged commit c9c5d12 into hpu-mxfp8-moe Aug 16, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants