-
-
Notifications
You must be signed in to change notification settings - Fork 2
add ar ext #75
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
add ar ext #75
Conversation
Signed-off-by: yiliu30 <yi4.liu@intel.com>
Signed-off-by: yiliu30 <yi4.liu@intel.com>
Signed-off-by: yiliu30 <yi4.liu@intel.com>
Signed-off-by: yiliu30 <yi4.liu@intel.com>
Signed-off-by: yiliu30 <yi4.liu@intel.com>
Signed-off-by: yiliu30 <yi4.liu@intel.com>
Signed-off-by: yiliu30 <yi4.liu@intel.com>
* add eval code Signed-off-by: yiliu30 <yi4.liu@intel.com> * refine eval code Signed-off-by: yiliu30 <yi4.liu@intel.com> * add more note Signed-off-by: yiliu30 <yi4.liu@intel.com> * refine eval cmd Signed-off-by: yiliu30 <yi4.liu@intel.com> --------- Signed-off-by: yiliu30 <yi4.liu@intel.com>
|
/gemini review |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Code Review
This pull request introduces an 'auto-round' quantization extension. However, the implementation appears to be in a very early and experimental stage. It includes multiple hardcoded user-specific paths in example scripts, which makes them non-portable. The core logic is riddled with FIXME and TODO comments, and contains debugging code such as breakpoint() calls, indicating that the feature is incomplete and not ready for production. There are also critical issues noted, such as breaking bf16 support and incomplete tensor parallelism handling. The integration via sitecustomize.py is also unconventional for a library. Due to these reasons, the PR is not ready for merging.
|
|
||
| expert_data = param.data if full_load else param.data[expert_id] | ||
|
|
||
| # FIXME: @yiliu30 handle TP |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
| def process_weights_after_loading(self, layer: torch.nn.Module) -> None: | ||
| # TODO: @yiliu30 remove it | ||
| def check_nan(tensor): | ||
| return tensor.float().sum() == 0 | ||
|
|
||
| if check_nan(layer.w13_weight): | ||
| logger.info("all zeros self.w13_weight") | ||
| breakpoint() | ||
|
|
||
| if check_nan(layer.w2_weight): | ||
| logger.info("NAN IN self.w2_weight") | ||
| breakpoint() | ||
|
|
||
| if check_nan(layer.w13_bias): | ||
| logger.info("NAN IN self.w13_bias") | ||
| breakpoint() | ||
| if check_nan(layer.w2_bias): | ||
| logger.info("NAN IN self.w2_bias") | ||
| breakpoint() | ||
| if check_nan(layer.w13_weight_scale): | ||
| logger.info("NAN IN self.w13_weight_scale") | ||
| breakpoint() | ||
| if check_nan(layer.w2_weight_scale): | ||
| logger.info("NAN IN self.w2_weight_scale") | ||
| breakpoint() | ||
|
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
| def load_weights_ar( | ||
| self, weights: Iterable[tuple[str, torch.Tensor]] | ||
| ) -> set[str]: | ||
| # FIXME: @yiliu30: this break the bf16 path, fixme |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
| model="/data5/yliu7/HF_HOME/OPEA/Qwen2.5-0.5B-Instruct-int4-sym-inc" | ||
| model="/home/yliu7/workspace/auto-round/tmp_autoround_llama_mxfp8" | ||
| llm = LLM( | ||
| # model="facebook/opt-125m" | ||
| model=model | ||
| ) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
| model_path=/home/yliu7/workspace/inc/3rd-party/llm-compressor/examples/quantization_non_uniform/Llama-3.2-1B-Instruct-NVFP4-FP8-Dynamic | ||
| model_path="/data5/yliu7/HF_HOME/qwen_moe_skip_lm_head" | ||
| model_path=/data5/yliu7/HF_HOME/ByteDance-Seed/Seed-OSS-36B-Instruct | ||
| model_path=/data5/yliu7/HF_HOME/GLM-4.5-Air-w8afp8-llmc/GLM-4.5-Air-w8afp8 | ||
| # model_path=/data5/yliu7/HF_HOME/Llama-3.2-1B-Instruct-NVFPP_B16/ | ||
| # model_path=/data5/yliu7/HF_HOME/meta-llama/Llama-3.2-1B-Instruct/ | ||
| model_path=/data5/yliu7/HF_HOME/Yi30/gpt-oss-20b-BF16-MXFP8/ | ||
| model_path=/data5/yliu7/HF_HOME/Yi30/gpt-oss-120b-BF16-unsloth-MXFP8 | ||
| # model_path=/models/DeepSeek-V2-Lite-Chat/ |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
| model_path=/home/yiliu7/models/deepseek-ai/DeepSeek-R1 | ||
| model_path=/home/yliu7/workspace/inc/3rd-party/llm-compressor/examples/quantization_non_uniform/Llama-3.2-1B-Instruct-NVFP4-FP8-Dynamic | ||
| model_path="/data5/yliu7/HF_HOME/qwen_moe_skip_lm_head" | ||
| # model_path=/data5/yliu7/HF_HOME/ByteDance-Seed/Seed-OSS-36B-Instruct | ||
| model_path=/data5/yliu7/HF_HOME/GLM-4.5-Air-w8afp8-llmc/GLM-4.5-Air-w8afp8 | ||
| # model_path=/data5/yliu7/HF_HOME/Llama-3.2-1B-Instruct-NVFPP_B16/ | ||
| # model_path=/data5/yliu7/HF_HOME/meta-llama/Llama-3.2-1B-Instruct/ | ||
| model_path=/data5/yliu7/HF_HOME/Yi30/gpt-oss-20b-BF16-MXFP8/ | ||
| model_path=/data5/yliu7/HF_HOME/Yi30/gpt-oss-120b-BF16-unsloth-MXFP8 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
| {"auto_round:llm_compressor"}) | ||
|
|
||
| def get_quant_method(self, layer: torch.nn.Module, prefix: str): | ||
| # FIXME: (Yi) parse the per-layer quant scheme |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
| if VLLM_ENABLE_AR_EXT: | ||
| print( | ||
| "*****************************************************************************" | ||
| ) | ||
| print( | ||
| f"* !!! VLLM_ENABLE_AR_EXT is set to {VLLM_ENABLE_AR_EXT}, applying auto_round_vllm_extension *" | ||
| ) | ||
| print( | ||
| "*****************************************************************************" | ||
| ) | ||
| from vllm.model_executor.layers.quantization import ( | ||
| auto_round_vllm_extension as auto_round_ext, | ||
| ) | ||
|
|
||
| auto_round_ext.apply() |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Using sitecustomize.py to monkey-patch the library at runtime is an unconventional and potentially fragile approach. It can have unintended side effects on the Python environment for any user who has this directory in their PYTHONPATH. Additionally, printing a large banner to stdout is not ideal for a library. A more explicit and cleaner integration mechanism, such as an initialization function call in the user's code, should be considered.
Signed-off-by: yiliu30 <yi4.liu@intel.com>
|
/gemini review |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Code Review
This pull request introduces an "auto-round" quantization extension, which is a significant new feature. However, the implementation appears to be in a very early, work-in-progress state. There are several critical issues, including hardcoded user-specific paths in example scripts, multiple FIXME comments indicating incomplete or incorrect logic (especially regarding Tensor Parallelism and bf16 support), and leftover debugging code such as breakpoint() calls. These issues must be addressed before this PR can be considered for merging. The example scripts should be cleaned up to be generally usable, and all debugging artifacts and incomplete logic should be resolved.
| model="/data5/yliu7/HF_HOME/OPEA/Qwen2.5-0.5B-Instruct-int4-sym-inc" | ||
| model="/home/yliu7/workspace/auto-round/tmp_autoround_llama_mxfp8" | ||
| llm = LLM( | ||
| # model="facebook/opt-125m" | ||
| model=model | ||
| ) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This example file contains hardcoded, user-specific model paths and commented-out code. Example files should be clean and not contain personal development paths. Please remove these and use a public model or make the model path configurable via command-line arguments.
| model="/data5/yliu7/HF_HOME/OPEA/Qwen2.5-0.5B-Instruct-int4-sym-inc" | |
| model="/home/yliu7/workspace/auto-round/tmp_autoround_llama_mxfp8" | |
| llm = LLM( | |
| # model="facebook/opt-125m" | |
| model=model | |
| ) | |
| llm = LLM(model="facebook/opt-125m") |
| # curl http://127.0.0.1:8088/metrics | ||
|
|
||
| export no_proxy="localhost, 127.0.0.1, ::1" | ||
| task_name=gsm8k | ||
| batch_size=16 | ||
| # LIMIT=128 | ||
| timestamp=$(date +%Y%m%d_%H%M%S) | ||
| EVAL_LOG_NAME="eval_${task_name}_${timestamp}" | ||
| max_length=8192 | ||
| max_gen_toks=6144 | ||
|
|
||
| mkdir -p benchmark_logs | ||
| model_path=/home/yliu7/workspace/inc/3rd-party/llm-compressor/examples/quantization_non_uniform/Llama-3.2-1B-Instruct-NVFP4-FP8-Dynamic | ||
| model_path="/data5/yliu7/HF_HOME/qwen_moe_skip_lm_head" | ||
| model_path=/data5/yliu7/HF_HOME/ByteDance-Seed/Seed-OSS-36B-Instruct | ||
| model_path=/data5/yliu7/HF_HOME/GLM-4.5-Air-w8afp8-llmc/GLM-4.5-Air-w8afp8 | ||
| # model_path=/data5/yliu7/HF_HOME/Llama-3.2-1B-Instruct-NVFPP_B16/ | ||
| # model_path=/data5/yliu7/HF_HOME/meta-llama/Llama-3.2-1B-Instruct/ | ||
| model_path=/data5/yliu7/HF_HOME/Yi30/gpt-oss-20b-BF16-MXFP8/ | ||
| model_path=/data5/yliu7/HF_HOME/Yi30/gpt-oss-120b-BF16-unsloth-MXFP8 | ||
| # model_path=/models/DeepSeek-V2-Lite-Chat/ | ||
| port=8088 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This script contains numerous hardcoded, user-specific paths (e.g., in model_path, include_path) and a large amount of commented-out code. This appears to be a personal testing script and is not suitable as a general-purpose example. Please remove this file or clean it up significantly to make it a generic, usable example for other users.
| export VLLM_LOGGING_LEVEL=DEBUG | ||
| timestamp=$(date +%Y%m%d-%H%M%S) | ||
| log_file=server.$timestamp.log | ||
| model_path=/home/yiliu7/models/deepseek-ai/DeepSeek-R1 | ||
| model_path=/home/yliu7/workspace/inc/3rd-party/llm-compressor/examples/quantization_non_uniform/Llama-3.2-1B-Instruct-NVFP4-FP8-Dynamic | ||
| model_path="/data5/yliu7/HF_HOME/qwen_moe_skip_lm_head" | ||
| # model_path=/data5/yliu7/HF_HOME/ByteDance-Seed/Seed-OSS-36B-Instruct | ||
| model_path=/data5/yliu7/HF_HOME/GLM-4.5-Air-w8afp8-llmc/GLM-4.5-Air-w8afp8 | ||
| # model_path=/data5/yliu7/HF_HOME/Llama-3.2-1B-Instruct-NVFPP_B16/ | ||
| # model_path=/data5/yliu7/HF_HOME/meta-llama/Llama-3.2-1B-Instruct/ | ||
| model_path=/data5/yliu7/HF_HOME/Yi30/gpt-oss-20b-BF16-MXFP8/ | ||
| model_path=/data5/yliu7/HF_HOME/Yi30/gpt-oss-120b-BF16-unsloth-MXFP8 | ||
| tp_size=4 | ||
| ep_size=2 | ||
|
|
||
| PYTHONPATH=/home/yliu7/workspace/inc/3rd-party/vllm/vllm/model_executor/layers/quantization/auto_round_vllm_extension/:$PYTHONPATH \ | ||
| VLLM_ENABLE_AR_EXT=1 \ | ||
| VLLM_USE_STATIC_MOE_HPU=1 \ |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
|
|
||
| expert_data = param.data if full_load else param.data[expert_id] | ||
|
|
||
| # FIXME: @yiliu30 handle TP |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
| def process_weights_after_loading(self, layer: torch.nn.Module) -> None: | ||
| # TODO: @yiliu30 remove it | ||
| def check_nan(tensor): | ||
| return tensor.float().sum() == 0 | ||
|
|
||
| if check_nan(layer.w13_weight): | ||
| logger.info("all zeros self.w13_weight") | ||
| breakpoint() | ||
|
|
||
| if check_nan(layer.w2_weight): | ||
| logger.info("NAN IN self.w2_weight") | ||
| breakpoint() | ||
|
|
||
| if check_nan(layer.w13_bias): | ||
| logger.info("NAN IN self.w13_bias") | ||
| breakpoint() | ||
| if check_nan(layer.w2_bias): | ||
| logger.info("NAN IN self.w2_bias") | ||
| breakpoint() | ||
| if check_nan(layer.w13_weight_scale): | ||
| logger.info("NAN IN self.w13_weight_scale") | ||
| breakpoint() | ||
| if check_nan(layer.w2_weight_scale): | ||
| logger.info("NAN IN self.w2_weight_scale") | ||
| breakpoint() | ||
|
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The process_weights_after_loading method contains multiple breakpoint() calls, which are debugging artifacts. These must be removed before merging. The checks for NaN or zero tensors could be converted to warnings if they are valuable for users, but breakpoints are not acceptable in production code.
| def process_weights_after_loading(self, layer: torch.nn.Module) -> None: | |
| # TODO: @yiliu30 remove it | |
| def check_nan(tensor): | |
| return tensor.float().sum() == 0 | |
| if check_nan(layer.w13_weight): | |
| logger.info("all zeros self.w13_weight") | |
| breakpoint() | |
| if check_nan(layer.w2_weight): | |
| logger.info("NAN IN self.w2_weight") | |
| breakpoint() | |
| if check_nan(layer.w13_bias): | |
| logger.info("NAN IN self.w13_bias") | |
| breakpoint() | |
| if check_nan(layer.w2_bias): | |
| logger.info("NAN IN self.w2_bias") | |
| breakpoint() | |
| if check_nan(layer.w13_weight_scale): | |
| logger.info("NAN IN self.w13_weight_scale") | |
| breakpoint() | |
| if check_nan(layer.w2_weight_scale): | |
| logger.info("NAN IN self.w2_weight_scale") | |
| breakpoint() | |
| def process_weights_after_loading(self, layer: torch.nn.Module) -> None: | |
| pass |
| def load_weights_ar( | ||
| self, weights: Iterable[tuple[str, torch.Tensor]] | ||
| ) -> set[str]: | ||
| # FIXME: @yiliu30: this break the bf16 path, fixme |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
| params_dict = dict(self.named_parameters()) | ||
| loaded_params: set[str] = set() | ||
| expert_params_mapping = self.get_expert_mapping() | ||
| # breakpoint() |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
| def from_config(cls, config: dict[str, Any]) -> AutoRoundConfig: | ||
| ar_config = super().from_config(config) | ||
|
|
||
| # FIXME: (Yi) parse the per-layer quant scheme |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The FIXME comment indicates that per-layer quantization schemes are not parsed. The current implementation uses a single quantization scheme for all layers. This is a significant feature limitation that prevents applying different quantization strategies to different layers, which is often necessary for balancing performance and accuracy. This should be implemented to make the feature more flexible and useful.
| input_quant = None | ||
|
|
||
| # FIXME: @yiliu30: temporarily only support MXFP8 | ||
| if 1 or quant_config._is_mxfp8_w8a8(weight_quant, input_quant): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The condition if 1 or ... hardcodes the execution path to the MXFP8 implementation. This is likely a temporary measure for development but should be replaced with the correct logic to dynamically select the quantization implementation based on the configuration, as the FIXME comment also suggests.
| if 1 or quant_config._is_mxfp8_w8a8(weight_quant, input_quant): | |
| if quant_config._is_mxfp8_w8a8(weight_quant, input_quant): |
Signed-off-by: yiliu30 <yi4.liu@intel.com>
Signed-off-by: yiliu30 <yi4.liu@intel.com>
Signed-off-by: yiliu30 <yi4.liu@intel.com>
|
/gemini review |
Signed-off-by: yiliu30 <yi4.liu@intel.com>
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Code Review
This pull request introduces an extension for AutoRound quantization. The changes are extensive, adding the core framework for the extension, quantization method implementations for linear and MoE layers, and corresponding weight loading logic.
While this is a significant feature addition, the current state of the pull request suggests it is a work-in-progress and not ready for merging. My review has identified several critical issues:
- The example scripts contain hardcoded, user-specific paths, making them non-reproducible.
- There are debugging artifacts left in the code, such as
breakpoint()calls andif 1 or ...conditions, which will cause runtime failures or incorrect behavior. - Several
FIXMEcomments point to incomplete or incorrect implementations, particularly concerning Tensor Parallelism andbfloat16support. - Some code paths are explicitly not implemented, which will lead to
NotImplementedErrorexceptions.
These issues must be addressed before this PR can be considered for merging. The quantization implementations also appear to be emulations, which is acceptable for initial integration and correctness verification, but it would be beneficial to clarify the plan for introducing optimized kernels for performance.
| model="/data5/yliu7/HF_HOME/OPEA/Qwen2.5-0.5B-Instruct-int4-sym-inc" | ||
| model="/home/yliu7/workspace/auto-round/tmp_autoround_llama_mxfp8" | ||
| llm = LLM( | ||
| # model="facebook/opt-125m" | ||
| model=model | ||
| ) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This example script includes hardcoded, user-specific absolute paths for the model, and the model variable is immediately reassigned. This makes the example non-reproducible and confusing for other users. Please use a model identifier from the Hugging Face Hub or a placeholder path with instructions.
| model="/data5/yliu7/HF_HOME/OPEA/Qwen2.5-0.5B-Instruct-int4-sym-inc" | |
| model="/home/yliu7/workspace/auto-round/tmp_autoround_llama_mxfp8" | |
| llm = LLM( | |
| # model="facebook/opt-125m" | |
| model=model | |
| ) | |
| llm = LLM(model="facebook/opt-125m") |
| model_path=/home/yliu7/workspace/inc/3rd-party/llm-compressor/examples/quantization_non_uniform/Llama-3.2-1B-Instruct-NVFP4-FP8-Dynamic | ||
| model_path="/data5/yliu7/HF_HOME/qwen_moe_skip_lm_head" | ||
| model_path=/data5/yliu7/HF_HOME/ByteDance-Seed/Seed-OSS-36B-Instruct | ||
| model_path=/data5/yliu7/HF_HOME/GLM-4.5-Air-w8afp8-llmc/GLM-4.5-Air-w8afp8 | ||
| # model_path=/data5/yliu7/HF_HOME/Llama-3.2-1B-Instruct-NVFPP_B16/ | ||
| # model_path=/data5/yliu7/HF_HOME/meta-llama/Llama-3.2-1B-Instruct/ | ||
| model_path=/data5/yliu7/HF_HOME/Yi30/gpt-oss-20b-BF16-MXFP8/ | ||
| model_path=/data5/yliu7/HF_HOME/Yi30/gpt-oss-120b-BF16-unsloth-MXFP8 | ||
| # model_path=/models/DeepSeek-V2-Lite-Chat/ |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This example script contains multiple hardcoded, user-specific absolute paths. This makes the script non-portable and not reproducible for other users. Please replace these with a single placeholder and add comments explaining what path is expected.
| model_path=/home/yliu7/workspace/inc/3rd-party/llm-compressor/examples/quantization_non_uniform/Llama-3.2-1B-Instruct-NVFP4-FP8-Dynamic | |
| model_path="/data5/yliu7/HF_HOME/qwen_moe_skip_lm_head" | |
| model_path=/data5/yliu7/HF_HOME/ByteDance-Seed/Seed-OSS-36B-Instruct | |
| model_path=/data5/yliu7/HF_HOME/GLM-4.5-Air-w8afp8-llmc/GLM-4.5-Air-w8afp8 | |
| # model_path=/data5/yliu7/HF_HOME/Llama-3.2-1B-Instruct-NVFPP_B16/ | |
| # model_path=/data5/yliu7/HF_HOME/meta-llama/Llama-3.2-1B-Instruct/ | |
| model_path=/data5/yliu7/HF_HOME/Yi30/gpt-oss-20b-BF16-MXFP8/ | |
| model_path=/data5/yliu7/HF_HOME/Yi30/gpt-oss-120b-BF16-unsloth-MXFP8 | |
| # model_path=/models/DeepSeek-V2-Lite-Chat/ | |
| # Set model_path to the path of your model, for example: | |
| # model_path="/path/to/your/model" | |
| model_path="facebook/opt-125m" |
| model_path=/home/yiliu7/models/deepseek-ai/DeepSeek-R1 | ||
| model_path=/home/yliu7/workspace/inc/3rd-party/llm-compressor/examples/quantization_non_uniform/Llama-3.2-1B-Instruct-NVFP4-FP8-Dynamic | ||
| model_path="/data5/yliu7/HF_HOME/qwen_moe_skip_lm_head" | ||
| # model_path=/data5/yliu7/HF_HOME/ByteDance-Seed/Seed-OSS-36B-Instruct | ||
| model_path=/data5/yliu7/HF_HOME/GLM-4.5-Air-w8afp8-llmc/GLM-4.5-Air-w8afp8 | ||
| # model_path=/data5/yliu7/HF_HOME/Llama-3.2-1B-Instruct-NVFPP_B16/ | ||
| # model_path=/data5/yliu7/HF_HOME/meta-llama/Llama-3.2-1B-Instruct/ | ||
| model_path=/data5/yliu7/HF_HOME/Yi30/gpt-oss-20b-BF16-MXFP8/ | ||
| model_path=/data5/yliu7/HF_HOME/Yi30/gpt-oss-120b-BF16-unsloth-MXFP8 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This example script contains multiple hardcoded, user-specific absolute paths. This makes the script non-portable and not reproducible for other users. Please replace these with a single placeholder and add comments explaining what path is expected.
| model_path=/home/yiliu7/models/deepseek-ai/DeepSeek-R1 | |
| model_path=/home/yliu7/workspace/inc/3rd-party/llm-compressor/examples/quantization_non_uniform/Llama-3.2-1B-Instruct-NVFP4-FP8-Dynamic | |
| model_path="/data5/yliu7/HF_HOME/qwen_moe_skip_lm_head" | |
| # model_path=/data5/yliu7/HF_HOME/ByteDance-Seed/Seed-OSS-36B-Instruct | |
| model_path=/data5/yliu7/HF_HOME/GLM-4.5-Air-w8afp8-llmc/GLM-4.5-Air-w8afp8 | |
| # model_path=/data5/yliu7/HF_HOME/Llama-3.2-1B-Instruct-NVFPP_B16/ | |
| # model_path=/data5/yliu7/HF_HOME/meta-llama/Llama-3.2-1B-Instruct/ | |
| model_path=/data5/yliu7/HF_HOME/Yi30/gpt-oss-20b-BF16-MXFP8/ | |
| model_path=/data5/yliu7/HF_HOME/Yi30/gpt-oss-120b-BF16-unsloth-MXFP8 | |
| # Set model_path to the path of your model, for example: | |
| # model_path=/path/to/your/model | |
| model_path="facebook/opt-125m" |
|
|
||
| if check_nan(layer.w13_weight): | ||
| logger.info("all zeros self.w13_weight") | ||
| breakpoint() |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
|
|
||
| # impl = AutoRoundMoEMethodMXFP8(quant_config, layer.moe_config) | ||
| # return impl | ||
| if 1 or quant_config._is_mxfp4_w4a8(weight_quant, input_quant): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The condition if 1 or ... unconditionally forces the MXFP4 implementation path. This is likely a debugging artifact and must be replaced with the correct logic to select the quantization method based on the configuration.
| if 1 or quant_config._is_mxfp4_w4a8(weight_quant, input_quant): | |
| if quant_config._is_mxfp4_w4a8(weight_quant, input_quant): |
| # FIXME: @yiliu30 handle TP | ||
| # ==-----------------------------------------------------------------== |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The FIXME comment indicates that Tensor Parallelism (TP) is not correctly handled for bias loading. The current implementation for w1 and w3 biases does not appear to shard the bias tensor, which will lead to incorrect behavior or errors when tp_size > 1. The bias should be sharded along the intermediate dimension, similar to the weights.
| else: | ||
| raise NotImplementedError( | ||
| "process_weights_after_loading is not implemented for now." | ||
| ) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
| def load_weights_ar( | ||
| self, weights: Iterable[tuple[str, torch.Tensor]] | ||
| ) -> set[str]: | ||
| # FIXME: @yiliu30: this break the bf16 path, fixme |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Signed-off-by: yiliu30 <yi4.liu@intel.com>
Signed-off-by: yiliu30 <yi4.liu@intel.com>
Signed-off-by: yiliu30 <yi4.liu@intel.com>
Signed-off-by: yiliu30 <yi4.liu@intel.com>
Signed-off-by: yiliu30 <yi4.liu@intel.com>
Signed-off-by: yiliu30 <yi4.liu@intel.com>
Signed-off-by: yiliu30 <yi4.liu@intel.com>
Signed-off-by: yiliu30 <yi4.liu@intel.com>
PLEASE FILL IN THE PR DESCRIPTION HERE ENSURING ALL CHECKLIST ITEMS (AT THE BOTTOM) HAVE BEEN CONSIDERED.
Purpose
Test Plan
Test Result
Essential Elements of an Effective PR Description Checklist
supported_models.mdandexamplesfor a new model.BEFORE SUBMITTING, PLEASE READ https://docs.vllm.ai/en/latest/contributing (anything written below this line will be removed by GitHub Actions)