-
-
Notifications
You must be signed in to change notification settings - Fork 2
Even rounding for mxfp4 #65
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
Signed-off-by: Yi Liu <yiliu4@habana.ai>
Signed-off-by: Yi Liu <yiliu4@habana.ai>
Signed-off-by: Yi Liu <yiliu4@habana.ai>
Signed-off-by: Yi Liu <yiliu4@habana.ai>
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Summary of Changes
Hello @yiliu30, I'm Gemini Code Assist1! I'm currently reviewing this pull request and will post my feedback shortly. In the meantime, here's a summary to help you and other reviewers quickly get up to speed!
This pull request introduces an "even rounding" method for MXFP4 (Mixed-Precision 4-bit Floating Point) quantization, aiming to improve numerical precision. It also includes a significant refactoring of the project's dependency management by removing explicit PyTorch and related library version constraints from various requirements.txt files and pyproject.toml, likely to allow for more flexible or system-managed installations. Additionally, several example scripts have been updated, possibly for testing the new quantization method and other model configurations.
Highlights
- MXFP4 Even Rounding: Implemented an "even rounding" method for MXFP4 quantization, which can be enabled via the VLLM_MXFP4_EVEN_ROUNDING environment variable.
- Dependency Management Refactoring: Refactored dependency management by removing explicit
torch,torchaudio,torchvision, andxformersversion pins frompyproject.tomland variousrequirements/*.txtfiles across different hardware platforms (CPU, CUDA, ROCm, Neuron, TPU, XPU). - Example Script Updates: Updated example inference scripts (
basic_hpu.py,start_vllm.sh) to include new model paths, adjust memory utilization settings, increase maximum sequence length, and modifylm_evalparameters.
Using Gemini Code Assist
The full guide for Gemini Code Assist can be found on our documentation page, here are some quick tips.
Invoking Gemini
You can request assistance from Gemini at any point in your pull request via creating an issue comment (i.e. comment on the pull request page) using either /gemini <command> or @gemini-code-assist <command>. Below is a summary of the supported commands.
| Feature | Command | Description |
|---|---|---|
| Code Review | /gemini review |
Performs a code review for the current pull request in its current state. |
| Pull Request Summary | /gemini summary |
Provides a summary of the current pull request in its current state. |
| Comment | @gemini-code-assist | Responds in comments when explicitly tagged, both in issue comments and review comments. |
| Help | /gemini help |
Displays a list of available commands. |
Customization
To customize Gemini Code Assist for GitHub experience, repository maintainers can create a configuration file and/or provide a custom code review style guide (such as PEP-8 for Python) by creating and adding files to a .gemini/ folder in the base of the repository. Detailed instructions can be found here.
Limitations & Feedback
Gemini Code Assist may make mistakes. Please leave feedback on any instances where its feedback is incorrect or counter productive. You can react with 👍 and 👎 on @gemini-code-assist comments. If you're interested in giving your feedback about your experience with Gemini Code Assist for Github and other Google products, sign up here.
You can also get AI-powered code generation, chat, as well as code reviews directly in the IDE at no cost with the Gemini Code Assist IDE Extension.
Footnotes
-
Review the Privacy Notices, Generative AI Prohibited Use Policy, Terms of Service, and learn how to configure Gemini Code Assist in GitHub here. Gemini can make mistakes, so double check it and use code with caution. ↩
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Code Review
This pull request introduces an alternative "even rounding" implementation for MXFP4 quantization, controlled by the VLLM_MXFP4_EVEN_ROUNDING environment variable. The PR also contains substantial changes to dependency management, removing pinned versions of torch and its related packages from all requirements files. This is a major change that could impact build stability and reproducibility, and its motivation should be clarified in the PR description. Additionally, several example and testing scripts have been modified with hardcoded paths and commented-out code, which appear to be remnants of local development and should be cleaned up before merging. A critical syntax error was found in the new quantization utility function.
vllm/model_executor/layers/quantization/utils/mxfp4_emulation_utils.py
Outdated
Show resolved
Hide resolved
| # echo "Stopping vLLM server" | ||
| #kill ${pid} | ||
| #echo "Script execution completed" | ||
| #sleep 10 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The server cleanup logic (kill ${pid}) is commented out. This will leave the vLLM server process running after the script completes, which can consume resources unnecessarily. If this was for debugging, please re-enable it.
| # echo "Stopping vLLM server" | |
| #kill ${pid} | |
| #echo "Script execution completed" | |
| #sleep 10 | |
| echo "Stopping vLLM server" | |
| kill ${pid} | |
| echo "Script execution completed" | |
| sleep 10 |
| model_path = "/software/users/yiliu4/HF_HOME/weiweiz1/DeepSeek-V2-Lite-NVFP4-autoround" | ||
| # model_path = "/software/users/yiliu4/deepseek-ai/DeepSeek-R1-NVFP4-OFFLINE" | ||
| model_path = "/software/users/yiliu4/HF_HOME/weiweiz1/DeepSeek-R1-NVFP4-RTN" | ||
|
|
||
| model_path = "/software/users/yiliu4/HF_HOME/weiweiz1/DeepSeek-R1-NVFP4-autoround" | ||
| model_path = "/software/users/yiliu4/HF_HOME/Yi30/Llama-3.2-1B-Instruct-MXFP4-llmc" | ||
| # model_path = "/software/users/yiliu4/HF_HOME/Yi30/DeepSeek-V2-Lite-NVFP4-W4A4-RTN-GLOBAL-SCALE-WW" | ||
|
|
||
| model_path = "/software/users/yiliu4/HF_HOME/weiweiz1/DeepSeek-R1-MXFP4-RTN" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This block contains multiple assignments to model_path, many of which are immediately overwritten. This appears to be for local testing and should be cleaned up. Please consolidate this to a single default model_path and rely on the command-line argument --model_path to specify different models for testing.
| nvfp4_model_path=/software/users/yiliu4/HF_HOME/weiweiz1/DeepSeek-R1-NVFP4-autoround/ | ||
| nvfp4_model_path="/software/users/yiliu4/deepseek-ai/DeepSeek-R1-nvfp4-fix-723" | ||
| nvfp4_model_path="/software/users/yiliu4/deepseek-ai/DeepSeek-R1-nvfp4-fix-723-skip-atten" | ||
| nvfp4_model_path=/software/users/yiliu4/deepseek-ai/DeepSeek-R1-NVFP4-OFFLINE | ||
| nvfp4_model_path="/software/users/yiliu4/HF_HOME/weiweiz1/DeepSeek-R1-NVFP4-RTN" | ||
| nvfp4_model_path="/software/users/yiliu4/HF_HOME/weiweiz1/DeepSeek-R1-NVFP4-RTN" | ||
| nvfp4_model_path="/software/users/yiliu4/HF_HOME/weiweiz1/DeepSeek-R1-NVFP4-autoround" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
| model_base_name=$(basename $model_path) | ||
|
|
||
| EVAL_LOG_NAME="mxfp8_${model_base_name}_lm_eval_output_${task_name}_bs${batch_size}__${timestamp}" | ||
| EVAL_LOG_NAME="mxfp8_${model_base_name}_lm_eval_output__bs${batch_size}__${timestamp}" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The task_name variable is no longer included in EVAL_LOG_NAME, but it is still used in the echo command on line 297 and the lm_eval command on line 303. This creates a mismatch between the log message and the actual log file name. For better traceability, consider adding task_name back to the log file name.
| EVAL_LOG_NAME="mxfp8_${model_base_name}_lm_eval_output__bs${batch_size}__${timestamp}" | |
| EVAL_LOG_NAME="mxfp8_${model_base_name}_lm_eval_output_${task_name}_bs${batch_size}__${timestamp}" |
…utils.py Co-authored-by: gemini-code-assist[bot] <176961590+gemini-code-assist[bot]@users.noreply.github.com>
Signed-off-by: Yi Liu <yiliu4@habana.ai>
Essential Elements of an Effective PR Description Checklist
PLEASE FILL IN THE PR DESCRIPTION HERE ENSURING ALL CHECKLIST ITEMS ABOVE HAVE BEEN CONSIDERED.
Purpose
Test Plan
Test Result
BEFORE SUBMITTING, PLEASE READ https://docs.vllm.ai/en/latest/contributing (anything written below this line will be removed by GitHub Actions)