-
-
Notifications
You must be signed in to change notification settings - Fork 2
Hpu mxfp8 #48
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: habana-mai-79ba403d5
Are you sure you want to change the base?
Hpu mxfp8 #48
Conversation
Signed-off-by: yiliu30 <yi4.liu@intel.com>
Signed-off-by: yiliu30 <yi4.liu@intel.com>
Signed-off-by: Yi Liu <yiliu4@habana.ai>
Signed-off-by: Yi Liu <yiliu4@habana.ai>
Signed-off-by: yiliu30 <yi4.liu@intel.com>
Signed-off-by: yiliu30 <yi4.liu@intel.com>
Signed-off-by: Yi Liu <yiliu4@habana.ai>
Signed-off-by: Yi Liu <yiliu4@habana.ai>
Signed-off-by: yiliu30 <yi4.liu@intel.com> Signed-off-by: Yi Liu <yiliu4@habana.ai>
Signed-off-by: Yi Liu <yiliu4@habana.ai>
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Summary of Changes
Hello @yiliu30, I'm Gemini Code Assist1! I'm currently reviewing this pull request and will post my feedback shortly. In the meantime, here's a summary to help you and other reviewers quickly get up to speed!
This pull request introduces comprehensive support for NVFP4 and MXFP8 quantization within vLLM, focusing on both general linear layers and Mixture of Experts (MoE) layers. It includes new quantization schemes, their integration with HPU-specific optimizations, and provides emulation capabilities for NVFP4 to ensure broader compatibility. The changes aim to improve performance and memory efficiency for models utilizing these advanced quantization formats.
Highlights
- New Quantization Schemes: Introduced support for W4A4 NVFP4 (NVIDIA FP4) and W8A8 MXFP8 (Mixed Precision FP8) quantization schemes, significantly expanding vLLM's capability to handle various low-precision models.
- HPU Integration and Optimization: Enhanced vLLM's compatibility with Habana Processing Units (HPUs) by integrating the new quantization schemes, adding HPU-specific environment variables (
VLLM_USE_STATIC_MOE_HPU,VLLM_HPU_FORCE_CHANNEL_FP8), and providing an example HPU inference script. - NVFP4 Emulation Support: Implemented reference NVFP4 quantization and dequantization utilities, enabling these schemes to run via software emulation even on hardware platforms that do not natively support Cutlass NVFP4 operations.
- Mixture of Experts (MoE) Quantization: Extended MoE layers to support the new
CompressedTensorsW4A4MoeMethod, incorporating both Marlin kernel integration and a static HPU emulation path for MoE operations.
Using Gemini Code Assist
The full guide for Gemini Code Assist can be found on our documentation page, here are some quick tips.
Invoking Gemini
You can request assistance from Gemini at any point in your pull request via creating an issue comment (i.e. comment on the pull request page) using either /gemini <command> or @gemini-code-assist <command>. Below is a summary of the supported commands.
| Feature | Command | Description |
|---|---|---|
| Code Review | /gemini review |
Performs a code review for the current pull request in its current state. |
| Pull Request Summary | /gemini summary |
Provides a summary of the current pull request in its current state. |
| Comment | @gemini-code-assist | Responds in comments when explicitly tagged, both in issue comments and review comments. |
| Help | /gemini help |
Displays a list of available commands. |
Customization
To customize Gemini Code Assist for GitHub experience, repository maintainers can create a configuration file and/or provide a custom code review style guide (such as PEP-8 for Python) by creating and adding files to a .gemini/ folder in the base of the repository. Detailed instructions can be found here.
Limitations & Feedback
Gemini Code Assist is currently in preview and may make mistakes. Please leave feedback on any instances where its feedback is incorrect or counter productive. You can react with 👍 and 👎 on @gemini-code-assist comments to provide feedback.
You can also get AI-powered code generation, chat, as well as code reviews directly in the IDE at no cost with the Gemini Code Assist IDE Extension.
Footnotes
-
Review the Privacy Notices, Generative AI Prohibited Use Policy, Terms of Service, and learn how to configure Gemini Code Assist in GitHub here. Gemini can make mistakes, so double check it and use code with caution. ↩
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Code Review
This pull request introduces support for new quantization formats, including MXFP8 and NVFP4, with a focus on HPU devices. The changes are extensive, adding new quantization schemes, MoE methods, and emulation utilities. A new example script for HPU inference is also included.
My review has identified a critical bug in an __init__.py file due to a missing comma, which could lead to import errors. I've also found some dead code, leftover debugging artifacts, and areas for cleanup, particularly in the new example script. Addressing these points will improve the quality and maintainability of the code.
| "CompressedTensorsW4A4Fp4" | ||
| "CompressedTensorsW8A8MXFp8" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There is a missing comma between the two string literals in the __all__ list. This will cause Python to concatenate them into a single string: 'CompressedTensorsW4A4Fp4CompressedTensorsW8A8MXFp8'. This is a bug and will prevent CompressedTensorsW8A8MXFp8 from being correctly exported, potentially causing ImportError downstream.
| "CompressedTensorsW4A4Fp4" | |
| "CompressedTensorsW8A8MXFp8" | |
| "CompressedTensorsW4A4Fp4", | |
| "CompressedTensorsW8A8MXFp8" |
| model_path = "/models/Qwen3-32B" | ||
| model_path = "/models/DeepSeek-R1-Distill-Qwen-7B" | ||
| model_path= "/mnt/disk3/yiliu4/RedHatAI/Llama-3.1-8B-tldr-FP8-dynamic" | ||
| model_path = "/software/users/yiliu4/HF_HOME/RedHatAI/Llama-3.1-8B-tldr-FP8-dynamic" | ||
| model_path = "/software/users/yiliu4/HF_HOME/Yi30/Llama-3.2-1B-Instruct-NVFP4-llm-compressor" | ||
| model_path = "/software/users/yiliu4/HF_HOME/Yi30/Llama-3.2-1B-Instruct-NVFP4-llm-compressor" | ||
| model_path = "/mnt/disk3/yiliu4/Yi30/DeepSeek-V2-Lite-NVFP4-llm-compressor" | ||
| model_path = "/software/users/yiliu4/HF_HOME/Yi30/DeepSeek-V2-Lite-NVFP4-llm-compressor/" | ||
| model_path = "/software/users/yiliu4/HF_HOME/Yi30/Llama-3.3-70B-Instruct-NVFP4-llmc" | ||
| model_path = "/software/users/yiliu4/HF_HOME/Yi30/Yi30/Llama-3.2-1B-Instruct-MXFP8-llmc" | ||
| model_path = "/software/users/yiliu4/HF_HOME/Yi30/Yi30/Llama-3.3-70B-Instruct-MXFP8-llmc" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This example script contains multiple assignments to model_path with hardcoded, user-specific paths. This is confusing and reduces the script's reusability. It's better to have a single, well-defined default model path (e.g., a public model from Hugging Face Hub) or a clear placeholder, and let the user override it with the --model_path argument.
| # INFO 06-26 17:17:55 [llm_engine.py:439] init engine (profile, create kv cache, warmup model) took 53.52 seconds | ||
| # Adding requests: 100%|████████████████████████████████████████████████████████████████████████████████████████████████████| 4/4 [00:00<00:00, 197.22it/s] | ||
| # Processed prompts: 0%| | 0/4 [00:00<?, ?it/s, est. speed input: 0.00 toks/s, output: 0.00 toks/s]WARNING 06-26 17:17:55 [hpu_model_runner.py:1230] Configuration: ('prompt', 4, 128, 0) was not warmed-up! | ||
| # (VllmWorkerProcess pid=54973) WARNING 06-26 17:17:55 [hpu_model_runner.py:1230] Configuration: ('prompt', 4, 128, 0) was not warmed-up! | ||
| # (VllmWorkerProcess pid=54971) WARNING 06-26 17:17:55 [hpu_model_runner.py:1230] Configuration: ('prompt', 4, 128, 0) was not warmed-up! | ||
| # (VllmWorkerProcess pid=54975) WARNING 06-26 17:17:55 [hpu_model_runner.py:1230] Configuration: ('prompt', 4, 128, 0) was not warmed-up! | ||
| # (VllmWorkerProcess pid=54969) WARNING 06-26 17:17:55 [hpu_model_runner.py:1230] Configuration: ('prompt', 4, 128, 0) was not warmed-up! | ||
| # (VllmWorkerProcess pid=54972) WARNING 06-26 17:17:55 [hpu_model_runner.py:1230] Configuration: ('prompt', 4, 128, 0) was not warmed-up! | ||
| # (VllmWorkerProcess pid=54970) WARNING 06-26 17:17:55 [hpu_model_runner.py:1230] Configuration: ('prompt', 4, 128, 0) was not warmed-up! | ||
| # (VllmWorkerProcess pid=54974) WARNING 06-26 17:17:55 [hpu_model_runner.py:1230] Configuration: ('prompt', 4, 128, 0) was not warmed-up! | ||
| # (VllmWorkerProcess pid=54974) WARNING 06-26 17:18:01 [hpu_model_runner.py:1230] Configuration: ('decode', 4, 1, 128) was not warmed-up! | ||
| # (VllmWorkerProcess pid=54969) WARNING 06-26 17:18:01 [hpu_model_runner.py:1230] Configuration: ('decode', 4, 1, 128) was not warmed-up! | ||
| # (VllmWorkerProcess pid=54971) WARNING 06-26 17:18:01 [hpu_model_runner.py:1230] Configuration: ('decode', 4, 1, 128) was not warmed-up! | ||
| # (VllmWorkerProcess pid=54973) WARNING 06-26 17:18:01 [hpu_model_runner.py:1230] Configuration: ('decode', 4, 1, 128) was not warmed-up! | ||
| # (VllmWorkerProcess pid=54970) WARNING 06-26 17:18:01 [hpu_model_runner.py:1230] Configuration: ('decode', 4, 1, 128) was not warmed-up! | ||
| # (VllmWorkerProcess pid=54975) WARNING 06-26 17:18:01 [hpu_model_runner.py:1230] Configuration: ('decode', 4, 1, 128) was not warmed-up! | ||
| # (VllmWorkerProcess pid=54972) WARNING 06-26 17:18:01 [hpu_model_runner.py:1230] Configuration: ('decode', 4, 1, 128) was not warmed-up! | ||
| # WARNING 06-26 17:18:01 [hpu_model_runner.py:1230] Configuration: ('decode', 4, 1, 128) was not warmed-up! | ||
| # Processed prompts: 100%|██████████████████████████████████████████████| 4/4 [09:06<00:00, 136.71s/it, est. speed input: 0.05 toks/s, output: 0.12 toks/s] | ||
|
|
||
| # Generated Outputs: | ||
| # ------------------------------------------------------------ | ||
| # Prompt: 'Hello, my name is' | ||
| # Output: ' Tony, I am a Software Engineer at Google. I have been working on the' | ||
| # ------------------------------------------------------------ | ||
| # Prompt: 'The president of the United States is' | ||
| # Output: ' the head of state and head of government of the United States. The president is' | ||
| # ------------------------------------------------------------ | ||
| # Prompt: 'The capital of France is' | ||
| # Output: ' always a good idea, but there are many other destinations that are worth visiting as' | ||
| # ------------------------------------------------------------ | ||
| # Prompt: 'The future of AI is' | ||
| # Output: " here, and it's more powerful than ever. But with great power comes great" | ||
| # ------------------------------------------------------------ | ||
| # Adding requests: 100%|███████████████████████████████████████████████████████████████████████████████████████████████████| 4/4 [00:00<00:00, 1055.37it/s] | ||
| # Processed prompts: 100%|██████████████████████████████████████████████| 4/4 [09:35<00:00, 143.78s/it, est. speed input: 0.05 toks/s, output: 0.11 toks/s] | ||
| # Time taken for second inference: 575.12 seconds | ||
| # INFO 06-26 17:36:37 [multiproc_worker_utils.py:139] Terminating local vLLM worker processes | ||
| # (VllmWorkerProcess pid=54969) INFO 06-26 17:36:37 [multiproc_worker_utils.py:261] Worker exiting | ||
| # (VllmWorkerProcess pid=54973) INFO 06-26 17:36:37 [multiproc_worker_utils.py:261] Worker exiting | ||
| # (VllmWorkerProcess pid=54971) INFO 06-26 17:36:37 [multiproc_worker_utils.py:261] Worker exiting | ||
| # (VllmWorkerProcess pid=54970) INFO 06-26 17:36:37 [multiproc_worker_utils.py:261] Worker exiting | ||
| # (VllmWorkerProcess pid=54972) INFO 06-26 17:36:37 [multiproc_worker_utils.py:261] Worker exiting | ||
| # (VllmWorkerProcess pid=54975) INFO 06-26 17:36:37 [multiproc_worker_utils.py:261] Worker exiting | ||
| # (VllmWorkerProcess pid=54974) INFO 06-26 17:36:37 [multiproc_worker_utils.py:261] Worker exiting | ||
| # <p22> yiliu4@yiliu4-63gd-g3-l-vm:basic$ /usr/lib/python3.10/multiprocessing/resource_tracker.py:224: UserWarning: resource_tracker: There appear to be 1 leaked shared_memory objects to clean up at shutdown | ||
| # warnings.warn('resource_tracker: There appear to be %d ' | ||
|
|
||
| # <p22> yiliu4@yiliu4-63gd-g3-l-vm:basic$ p basic_hpu.py --tp 8 No newline at end of file |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
| return self.fp8_linear.apply(input=x, | ||
| weight=layer.weight, | ||
| weight_scale=layer.weight_scale, | ||
| out_dtype=self.out_dtype, | ||
| input_scale=layer.input_scale, | ||
| bias=bias) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
| is_per_tensor_group_weight = (weight_quant.strategy in [ | ||
|
|
||
| QuantizationStrategy.TENSOR_GROUP | ||
| ]) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
| self, weight_quant: BaseModel, | ||
| input_quant: BaseModel) -> "CompressedTensorsScheme": | ||
|
|
||
| # breakpoint() |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
| raise RuntimeError( | ||
| f"Unsupported FusedMoe scheme: {weight_quant}, {input_quant}") | ||
|
|
||
| import vllm.envs as envs |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
| # w_state_dict = [ | ||
| # "w13_weight_packed:torch.Size([64, 2816, 1024])", | ||
| # "w13_weight_scale:torch.Size([64, 2816, 128])", | ||
| # "w13_weight_global_scale:torch.Size([64, 2])", | ||
| # "w13_input_global_scale:torch.Size([64, 2])", | ||
| # "w2_weight_packed:torch.Size([64, 2048, 704])", | ||
| # "w2_weight_scale:torch.Size([64, 2048, 88])", | ||
| # "w2_weight_global_scale:torch.Size([64])", | ||
| # "w2_input_global_scale:torch.Size([64])", | ||
| # ] | ||
| # w_list = [ | ||
| # "w13_weight_packed", | ||
| # "w13_weight_scale", | ||
| # "w13_weight_global_scale", | ||
| # "w13_input_global_scale", | ||
| # "w2_weight_packed", | ||
| # "w2_weight_scale", | ||
| # "w2_weight_global_scale", | ||
| # "w2_input_global_scale", | ||
| # ] |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Essential Elements of an Effective PR Description Checklist
PLEASE FILL IN THE PR DESCRIPTION HERE ENSURING ALL CHECKLIST ITEMS ABOVE HAVE BEEN CONSIDERED.
Purpose
Test Plan
Test Result
BEFORE SUBMITTING, PLEASE READ https://docs.vllm.ai/en/latest/contributing (anything written below this line will be removed by GitHub Actions)