Skip to content

Conversation

@yiliu30
Copy link
Owner

@yiliu30 yiliu30 commented Sep 24, 2025

PLEASE FILL IN THE PR DESCRIPTION HERE ENSURING ALL CHECKLIST ITEMS (AT THE BOTTOM) HAVE BEEN CONSIDERED.

Purpose

Test Plan

Test Result


Essential Elements of an Effective PR Description Checklist
  • The purpose of the PR, such as "Fix some issue (link existing issues this PR will resolve)".
  • The test plan, such as providing test command.
  • The test results, such as pasting the results comparison before and after, or e2e results
  • (Optional) The necessary documentation update, such as updating supported_models.md and examples for a new model.
  • (Optional) Release notes update. If your change is user facing, please update the release notes draft in the Google Doc.

BEFORE SUBMITTING, PLEASE READ https://docs.vllm.ai/en/latest/contributing (anything written below this line will be removed by GitHub Actions)

Signed-off-by: yiliu30 <yi4.liu@intel.com>
Signed-off-by: yiliu30 <yi4.liu@intel.com>
Signed-off-by: yiliu30 <yi4.liu@intel.com>
Signed-off-by: yiliu30 <yi4.liu@intel.com>
Signed-off-by: yiliu30 <yi4.liu@intel.com>
Signed-off-by: yiliu30 <yi4.liu@intel.com>
Signed-off-by: yiliu30 <yi4.liu@intel.com>
Signed-off-by: yiliu30 <yi4.liu@intel.com>
Signed-off-by: yiliu30 <yi4.liu@intel.com>
Signed-off-by: yiliu30 <yi4.liu@intel.com>
* add eval code

Signed-off-by: yiliu30 <yi4.liu@intel.com>

* refine eval code

Signed-off-by: yiliu30 <yi4.liu@intel.com>

* add more note

Signed-off-by: yiliu30 <yi4.liu@intel.com>

* refine eval cmd

Signed-off-by: yiliu30 <yi4.liu@intel.com>

---------

Signed-off-by: yiliu30 <yi4.liu@intel.com>
@yiliu30
Copy link
Owner Author

yiliu30 commented Oct 9, 2025

/gemini review

Copy link

@gemini-code-assist gemini-code-assist bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request introduces an 'auto-round' quantization extension. However, the implementation appears to be in a very early and experimental stage. It includes multiple hardcoded user-specific paths in example scripts, which makes them non-portable. The core logic is riddled with FIXME and TODO comments, and contains debugging code such as breakpoint() calls, indicating that the feature is incomplete and not ready for production. There are also critical issues noted, such as breaking bf16 support and incomplete tensor parallelism handling. The integration via sitecustomize.py is also unconventional for a library. Due to these reasons, the PR is not ready for merging.


expert_data = param.data if full_load else param.data[expert_id]

# FIXME: @yiliu30 handle TP

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

critical

This FIXME indicates that tensor parallelism (TP) is not correctly handled for bias loading. This is a critical issue that can lead to incorrect behavior or crashes in a distributed environment. This must be resolved before merging.

Comment on lines 472 to 497
def process_weights_after_loading(self, layer: torch.nn.Module) -> None:
# TODO: @yiliu30 remove it
def check_nan(tensor):
return tensor.float().sum() == 0

if check_nan(layer.w13_weight):
logger.info("all zeros self.w13_weight")
breakpoint()

if check_nan(layer.w2_weight):
logger.info("NAN IN self.w2_weight")
breakpoint()

if check_nan(layer.w13_bias):
logger.info("NAN IN self.w13_bias")
breakpoint()
if check_nan(layer.w2_bias):
logger.info("NAN IN self.w2_bias")
breakpoint()
if check_nan(layer.w13_weight_scale):
logger.info("NAN IN self.w13_weight_scale")
breakpoint()
if check_nan(layer.w2_weight_scale):
logger.info("NAN IN self.w2_weight_scale")
breakpoint()

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

critical

This file contains multiple breakpoint() calls within the process_weights_after_loading method (lines 479, 483, 487, 490, 493, 496). These are debugging statements and must be removed from production code as they will halt execution if triggered.

def load_weights_ar(
self, weights: Iterable[tuple[str, torch.Tensor]]
) -> set[str]:
# FIXME: @yiliu30: this break the bf16 path, fixme

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

critical

The FIXME comment explicitly states that this change breaks the bf16 path. This is a critical regression and must be fixed before this code can be merged.

Comment on lines +21 to +26
model="/data5/yliu7/HF_HOME/OPEA/Qwen2.5-0.5B-Instruct-int4-sym-inc"
model="/home/yliu7/workspace/auto-round/tmp_autoround_llama_mxfp8"
llm = LLM(
# model="facebook/opt-125m"
model=model
)

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

high

This example file has been modified to use hardcoded, user-specific model paths, and the original model facebook/opt-125m is commented out. This breaks the example for other users and makes it non-portable. Please revert these changes to use a standard, publicly accessible model.

Comment on lines 13 to 21
model_path=/home/yliu7/workspace/inc/3rd-party/llm-compressor/examples/quantization_non_uniform/Llama-3.2-1B-Instruct-NVFP4-FP8-Dynamic
model_path="/data5/yliu7/HF_HOME/qwen_moe_skip_lm_head"
model_path=/data5/yliu7/HF_HOME/ByteDance-Seed/Seed-OSS-36B-Instruct
model_path=/data5/yliu7/HF_HOME/GLM-4.5-Air-w8afp8-llmc/GLM-4.5-Air-w8afp8
# model_path=/data5/yliu7/HF_HOME/Llama-3.2-1B-Instruct-NVFPP_B16/
# model_path=/data5/yliu7/HF_HOME/meta-llama/Llama-3.2-1B-Instruct/
model_path=/data5/yliu7/HF_HOME/Yi30/gpt-oss-20b-BF16-MXFP8/
model_path=/data5/yliu7/HF_HOME/Yi30/gpt-oss-120b-BF16-unsloth-MXFP8
# model_path=/models/DeepSeek-V2-Lite-Chat/

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

high

This script contains multiple hardcoded, user-specific paths (e.g., /home/yliu7/..., /data5/yliu7/...). This makes the script non-portable and unusable for other developers. Example scripts should use relative paths, command-line arguments, or environment variables for paths.

Comment on lines +4 to +12
model_path=/home/yiliu7/models/deepseek-ai/DeepSeek-R1
model_path=/home/yliu7/workspace/inc/3rd-party/llm-compressor/examples/quantization_non_uniform/Llama-3.2-1B-Instruct-NVFP4-FP8-Dynamic
model_path="/data5/yliu7/HF_HOME/qwen_moe_skip_lm_head"
# model_path=/data5/yliu7/HF_HOME/ByteDance-Seed/Seed-OSS-36B-Instruct
model_path=/data5/yliu7/HF_HOME/GLM-4.5-Air-w8afp8-llmc/GLM-4.5-Air-w8afp8
# model_path=/data5/yliu7/HF_HOME/Llama-3.2-1B-Instruct-NVFPP_B16/
# model_path=/data5/yliu7/HF_HOME/meta-llama/Llama-3.2-1B-Instruct/
model_path=/data5/yliu7/HF_HOME/Yi30/gpt-oss-20b-BF16-MXFP8/
model_path=/data5/yliu7/HF_HOME/Yi30/gpt-oss-120b-BF16-unsloth-MXFP8

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

high

This script contains multiple hardcoded, user-specific paths (e.g., /home/yliu7/..., /data5/yliu7/...). This makes the script non-portable and unusable for other developers. Example scripts should use relative paths, command-line arguments, or environment variables for paths.

{"auto_round:llm_compressor"})

def get_quant_method(self, layer: torch.nn.Module, prefix: str):
# FIXME: (Yi) parse the per-layer quant scheme

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

high

This file is littered with FIXME and TODO comments (e.g., lines 25, 305, 354, 434, 517, 536, 623), indicating that the implementation is incomplete and not ready for production. These should be addressed to ensure the feature is fully functional and robust before merging.

Comment on lines +11 to +25
if VLLM_ENABLE_AR_EXT:
print(
"*****************************************************************************"
)
print(
f"* !!! VLLM_ENABLE_AR_EXT is set to {VLLM_ENABLE_AR_EXT}, applying auto_round_vllm_extension *"
)
print(
"*****************************************************************************"
)
from vllm.model_executor.layers.quantization import (
auto_round_vllm_extension as auto_round_ext,
)

auto_round_ext.apply()

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

high

Using sitecustomize.py to monkey-patch the library at runtime is an unconventional and potentially fragile approach. It can have unintended side effects on the Python environment for any user who has this directory in their PYTHONPATH. Additionally, printing a large banner to stdout is not ideal for a library. A more explicit and cleaner integration mechanism, such as an initialization function call in the user's code, should be considered.

Signed-off-by: yiliu30 <yi4.liu@intel.com>
Signed-off-by: yiliu30 <yi4.liu@intel.com>
@yiliu30
Copy link
Owner Author

yiliu30 commented Oct 9, 2025

/gemini review

@yiliu30 yiliu30 marked this pull request as ready for review October 9, 2025 10:47
Copy link

@gemini-code-assist gemini-code-assist bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request introduces an "auto-round" quantization extension, which is a significant new feature. However, the implementation appears to be in a very early, work-in-progress state. There are several critical issues, including hardcoded user-specific paths in example scripts, multiple FIXME comments indicating incomplete or incorrect logic (especially regarding Tensor Parallelism and bf16 support), and leftover debugging code such as breakpoint() calls. These issues must be addressed before this PR can be considered for merging. The example scripts should be cleaned up to be generally usable, and all debugging artifacts and incomplete logic should be resolved.

Comment on lines +21 to +26
model="/data5/yliu7/HF_HOME/OPEA/Qwen2.5-0.5B-Instruct-int4-sym-inc"
model="/home/yliu7/workspace/auto-round/tmp_autoround_llama_mxfp8"
llm = LLM(
# model="facebook/opt-125m"
model=model
)

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

critical

This example file contains hardcoded, user-specific model paths and commented-out code. Example files should be clean and not contain personal development paths. Please remove these and use a public model or make the model path configurable via command-line arguments.

Suggested change
model="/data5/yliu7/HF_HOME/OPEA/Qwen2.5-0.5B-Instruct-int4-sym-inc"
model="/home/yliu7/workspace/auto-round/tmp_autoround_llama_mxfp8"
llm = LLM(
# model="facebook/opt-125m"
model=model
)
llm = LLM(model="facebook/opt-125m")

Comment on lines 1 to 22
# curl http://127.0.0.1:8088/metrics

export no_proxy="localhost, 127.0.0.1, ::1"
task_name=gsm8k
batch_size=16
# LIMIT=128
timestamp=$(date +%Y%m%d_%H%M%S)
EVAL_LOG_NAME="eval_${task_name}_${timestamp}"
max_length=8192
max_gen_toks=6144

mkdir -p benchmark_logs
model_path=/home/yliu7/workspace/inc/3rd-party/llm-compressor/examples/quantization_non_uniform/Llama-3.2-1B-Instruct-NVFP4-FP8-Dynamic
model_path="/data5/yliu7/HF_HOME/qwen_moe_skip_lm_head"
model_path=/data5/yliu7/HF_HOME/ByteDance-Seed/Seed-OSS-36B-Instruct
model_path=/data5/yliu7/HF_HOME/GLM-4.5-Air-w8afp8-llmc/GLM-4.5-Air-w8afp8
# model_path=/data5/yliu7/HF_HOME/Llama-3.2-1B-Instruct-NVFPP_B16/
# model_path=/data5/yliu7/HF_HOME/meta-llama/Llama-3.2-1B-Instruct/
model_path=/data5/yliu7/HF_HOME/Yi30/gpt-oss-20b-BF16-MXFP8/
model_path=/data5/yliu7/HF_HOME/Yi30/gpt-oss-120b-BF16-unsloth-MXFP8
# model_path=/models/DeepSeek-V2-Lite-Chat/
port=8088

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

critical

This script contains numerous hardcoded, user-specific paths (e.g., in model_path, include_path) and a large amount of commented-out code. This appears to be a personal testing script and is not suitable as a general-purpose example. Please remove this file or clean it up significantly to make it a generic, usable example for other users.

Comment on lines 1 to 18
export VLLM_LOGGING_LEVEL=DEBUG
timestamp=$(date +%Y%m%d-%H%M%S)
log_file=server.$timestamp.log
model_path=/home/yiliu7/models/deepseek-ai/DeepSeek-R1
model_path=/home/yliu7/workspace/inc/3rd-party/llm-compressor/examples/quantization_non_uniform/Llama-3.2-1B-Instruct-NVFP4-FP8-Dynamic
model_path="/data5/yliu7/HF_HOME/qwen_moe_skip_lm_head"
# model_path=/data5/yliu7/HF_HOME/ByteDance-Seed/Seed-OSS-36B-Instruct
model_path=/data5/yliu7/HF_HOME/GLM-4.5-Air-w8afp8-llmc/GLM-4.5-Air-w8afp8
# model_path=/data5/yliu7/HF_HOME/Llama-3.2-1B-Instruct-NVFPP_B16/
# model_path=/data5/yliu7/HF_HOME/meta-llama/Llama-3.2-1B-Instruct/
model_path=/data5/yliu7/HF_HOME/Yi30/gpt-oss-20b-BF16-MXFP8/
model_path=/data5/yliu7/HF_HOME/Yi30/gpt-oss-120b-BF16-unsloth-MXFP8
tp_size=4
ep_size=2

PYTHONPATH=/home/yliu7/workspace/inc/3rd-party/vllm/vllm/model_executor/layers/quantization/auto_round_vllm_extension/:$PYTHONPATH \
VLLM_ENABLE_AR_EXT=1 \
VLLM_USE_STATIC_MOE_HPU=1 \

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

critical

Similar to start_client.sh, this script is filled with hardcoded, user-specific paths and commented-out experimental flags. It should be removed or generalized to serve as a proper example for other users.


expert_data = param.data if full_load else param.data[expert_id]

# FIXME: @yiliu30 handle TP

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

critical

The FIXME comment explicitly states that Tensor Parallelism (TP) is not handled for bias loading. This is a critical correctness issue that will likely cause incorrect behavior or crashes when tp_size > 1. The logic must be updated to correctly handle TP sharding for MoE biases.

Comment on lines +181 to +206
def process_weights_after_loading(self, layer: torch.nn.Module) -> None:
# TODO: @yiliu30 remove it
def check_nan(tensor):
return tensor.float().sum() == 0

if check_nan(layer.w13_weight):
logger.info("all zeros self.w13_weight")
breakpoint()

if check_nan(layer.w2_weight):
logger.info("NAN IN self.w2_weight")
breakpoint()

if check_nan(layer.w13_bias):
logger.info("NAN IN self.w13_bias")
breakpoint()
if check_nan(layer.w2_bias):
logger.info("NAN IN self.w2_bias")
breakpoint()
if check_nan(layer.w13_weight_scale):
logger.info("NAN IN self.w13_weight_scale")
breakpoint()
if check_nan(layer.w2_weight_scale):
logger.info("NAN IN self.w2_weight_scale")
breakpoint()

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

critical

The process_weights_after_loading method contains multiple breakpoint() calls, which are debugging artifacts. These must be removed before merging. The checks for NaN or zero tensors could be converted to warnings if they are valuable for users, but breakpoints are not acceptable in production code.

Suggested change
def process_weights_after_loading(self, layer: torch.nn.Module) -> None:
# TODO: @yiliu30 remove it
def check_nan(tensor):
return tensor.float().sum() == 0
if check_nan(layer.w13_weight):
logger.info("all zeros self.w13_weight")
breakpoint()
if check_nan(layer.w2_weight):
logger.info("NAN IN self.w2_weight")
breakpoint()
if check_nan(layer.w13_bias):
logger.info("NAN IN self.w13_bias")
breakpoint()
if check_nan(layer.w2_bias):
logger.info("NAN IN self.w2_bias")
breakpoint()
if check_nan(layer.w13_weight_scale):
logger.info("NAN IN self.w13_weight_scale")
breakpoint()
if check_nan(layer.w2_weight_scale):
logger.info("NAN IN self.w2_weight_scale")
breakpoint()
def process_weights_after_loading(self, layer: torch.nn.Module) -> None:
pass

def load_weights_ar(
self, weights: Iterable[tuple[str, torch.Tensor]]
) -> set[str]:
# FIXME: @yiliu30: this break the bf16 path, fixme

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

critical

The FIXME comment states that this weight loading path breaks for bf16. This is a critical correctness issue. The implementation must be fixed to correctly support all intended data types, including bfloat16, before this can be merged.

params_dict = dict(self.named_parameters())
loaded_params: set[str] = set()
expert_params_mapping = self.get_expert_mapping()
# breakpoint()

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

critical

A commented-out breakpoint() call is present in the code. This is a debugging artifact and must be removed.

def from_config(cls, config: dict[str, Any]) -> AutoRoundConfig:
ar_config = super().from_config(config)

# FIXME: (Yi) parse the per-layer quant scheme

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

high

The FIXME comment indicates that per-layer quantization schemes are not parsed. The current implementation uses a single quantization scheme for all layers. This is a significant feature limitation that prevents applying different quantization strategies to different layers, which is often necessary for balancing performance and accuracy. This should be implemented to make the feature more flexible and useful.

input_quant = None

# FIXME: @yiliu30: temporarily only support MXFP8
if 1 or quant_config._is_mxfp8_w8a8(weight_quant, input_quant):

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

high

The condition if 1 or ... hardcodes the execution path to the MXFP8 implementation. This is likely a temporary measure for development but should be replaced with the correct logic to dynamically select the quantization implementation based on the configuration, as the FIXME comment also suggests.

Suggested change
if 1 or quant_config._is_mxfp8_w8a8(weight_quant, input_quant):
if quant_config._is_mxfp8_w8a8(weight_quant, input_quant):

Signed-off-by: yiliu30 <yi4.liu@intel.com>
Signed-off-by: yiliu30 <yi4.liu@intel.com>
Signed-off-by: yiliu30 <yi4.liu@intel.com>
Signed-off-by: yiliu30 <yi4.liu@intel.com>
@yiliu30
Copy link
Owner Author

yiliu30 commented Oct 10, 2025

/gemini review

Signed-off-by: yiliu30 <yi4.liu@intel.com>
Copy link

@gemini-code-assist gemini-code-assist bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request introduces an extension for AutoRound quantization. The changes are extensive, adding the core framework for the extension, quantization method implementations for linear and MoE layers, and corresponding weight loading logic.

While this is a significant feature addition, the current state of the pull request suggests it is a work-in-progress and not ready for merging. My review has identified several critical issues:

  • The example scripts contain hardcoded, user-specific paths, making them non-reproducible.
  • There are debugging artifacts left in the code, such as breakpoint() calls and if 1 or ... conditions, which will cause runtime failures or incorrect behavior.
  • Several FIXME comments point to incomplete or incorrect implementations, particularly concerning Tensor Parallelism and bfloat16 support.
  • Some code paths are explicitly not implemented, which will lead to NotImplementedError exceptions.

These issues must be addressed before this PR can be considered for merging. The quantization implementations also appear to be emulations, which is acceptable for initial integration and correctness verification, but it would be beneficial to clarify the plan for introducing optimized kernels for performance.

Comment on lines +21 to +26
model="/data5/yliu7/HF_HOME/OPEA/Qwen2.5-0.5B-Instruct-int4-sym-inc"
model="/home/yliu7/workspace/auto-round/tmp_autoround_llama_mxfp8"
llm = LLM(
# model="facebook/opt-125m"
model=model
)

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

critical

This example script includes hardcoded, user-specific absolute paths for the model, and the model variable is immediately reassigned. This makes the example non-reproducible and confusing for other users. Please use a model identifier from the Hugging Face Hub or a placeholder path with instructions.

Suggested change
model="/data5/yliu7/HF_HOME/OPEA/Qwen2.5-0.5B-Instruct-int4-sym-inc"
model="/home/yliu7/workspace/auto-round/tmp_autoround_llama_mxfp8"
llm = LLM(
# model="facebook/opt-125m"
model=model
)
llm = LLM(model="facebook/opt-125m")

Comment on lines 13 to 21
model_path=/home/yliu7/workspace/inc/3rd-party/llm-compressor/examples/quantization_non_uniform/Llama-3.2-1B-Instruct-NVFP4-FP8-Dynamic
model_path="/data5/yliu7/HF_HOME/qwen_moe_skip_lm_head"
model_path=/data5/yliu7/HF_HOME/ByteDance-Seed/Seed-OSS-36B-Instruct
model_path=/data5/yliu7/HF_HOME/GLM-4.5-Air-w8afp8-llmc/GLM-4.5-Air-w8afp8
# model_path=/data5/yliu7/HF_HOME/Llama-3.2-1B-Instruct-NVFPP_B16/
# model_path=/data5/yliu7/HF_HOME/meta-llama/Llama-3.2-1B-Instruct/
model_path=/data5/yliu7/HF_HOME/Yi30/gpt-oss-20b-BF16-MXFP8/
model_path=/data5/yliu7/HF_HOME/Yi30/gpt-oss-120b-BF16-unsloth-MXFP8
# model_path=/models/DeepSeek-V2-Lite-Chat/

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

critical

This example script contains multiple hardcoded, user-specific absolute paths. This makes the script non-portable and not reproducible for other users. Please replace these with a single placeholder and add comments explaining what path is expected.

Suggested change
model_path=/home/yliu7/workspace/inc/3rd-party/llm-compressor/examples/quantization_non_uniform/Llama-3.2-1B-Instruct-NVFP4-FP8-Dynamic
model_path="/data5/yliu7/HF_HOME/qwen_moe_skip_lm_head"
model_path=/data5/yliu7/HF_HOME/ByteDance-Seed/Seed-OSS-36B-Instruct
model_path=/data5/yliu7/HF_HOME/GLM-4.5-Air-w8afp8-llmc/GLM-4.5-Air-w8afp8
# model_path=/data5/yliu7/HF_HOME/Llama-3.2-1B-Instruct-NVFPP_B16/
# model_path=/data5/yliu7/HF_HOME/meta-llama/Llama-3.2-1B-Instruct/
model_path=/data5/yliu7/HF_HOME/Yi30/gpt-oss-20b-BF16-MXFP8/
model_path=/data5/yliu7/HF_HOME/Yi30/gpt-oss-120b-BF16-unsloth-MXFP8
# model_path=/models/DeepSeek-V2-Lite-Chat/
# Set model_path to the path of your model, for example:
# model_path="/path/to/your/model"
model_path="facebook/opt-125m"

Comment on lines +4 to +12
model_path=/home/yiliu7/models/deepseek-ai/DeepSeek-R1
model_path=/home/yliu7/workspace/inc/3rd-party/llm-compressor/examples/quantization_non_uniform/Llama-3.2-1B-Instruct-NVFP4-FP8-Dynamic
model_path="/data5/yliu7/HF_HOME/qwen_moe_skip_lm_head"
# model_path=/data5/yliu7/HF_HOME/ByteDance-Seed/Seed-OSS-36B-Instruct
model_path=/data5/yliu7/HF_HOME/GLM-4.5-Air-w8afp8-llmc/GLM-4.5-Air-w8afp8
# model_path=/data5/yliu7/HF_HOME/Llama-3.2-1B-Instruct-NVFPP_B16/
# model_path=/data5/yliu7/HF_HOME/meta-llama/Llama-3.2-1B-Instruct/
model_path=/data5/yliu7/HF_HOME/Yi30/gpt-oss-20b-BF16-MXFP8/
model_path=/data5/yliu7/HF_HOME/Yi30/gpt-oss-120b-BF16-unsloth-MXFP8

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

critical

This example script contains multiple hardcoded, user-specific absolute paths. This makes the script non-portable and not reproducible for other users. Please replace these with a single placeholder and add comments explaining what path is expected.

Suggested change
model_path=/home/yiliu7/models/deepseek-ai/DeepSeek-R1
model_path=/home/yliu7/workspace/inc/3rd-party/llm-compressor/examples/quantization_non_uniform/Llama-3.2-1B-Instruct-NVFP4-FP8-Dynamic
model_path="/data5/yliu7/HF_HOME/qwen_moe_skip_lm_head"
# model_path=/data5/yliu7/HF_HOME/ByteDance-Seed/Seed-OSS-36B-Instruct
model_path=/data5/yliu7/HF_HOME/GLM-4.5-Air-w8afp8-llmc/GLM-4.5-Air-w8afp8
# model_path=/data5/yliu7/HF_HOME/Llama-3.2-1B-Instruct-NVFPP_B16/
# model_path=/data5/yliu7/HF_HOME/meta-llama/Llama-3.2-1B-Instruct/
model_path=/data5/yliu7/HF_HOME/Yi30/gpt-oss-20b-BF16-MXFP8/
model_path=/data5/yliu7/HF_HOME/Yi30/gpt-oss-120b-BF16-unsloth-MXFP8
# Set model_path to the path of your model, for example:
# model_path=/path/to/your/model
model_path="facebook/opt-125m"


if check_nan(layer.w13_weight):
logger.info("all zeros self.w13_weight")
breakpoint()

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

critical

This file contains multiple breakpoint() calls (lines 188, 192, 196, 199, 202, 205), which are debugging artifacts. These must be removed as they will halt program execution.


# impl = AutoRoundMoEMethodMXFP8(quant_config, layer.moe_config)
# return impl
if 1 or quant_config._is_mxfp4_w4a8(weight_quant, input_quant):

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

critical

The condition if 1 or ... unconditionally forces the MXFP4 implementation path. This is likely a debugging artifact and must be replaced with the correct logic to select the quantization method based on the configuration.

Suggested change
if 1 or quant_config._is_mxfp4_w4a8(weight_quant, input_quant):
if quant_config._is_mxfp4_w4a8(weight_quant, input_quant):

Comment on lines +1623 to +1624
# FIXME: @yiliu30 handle TP
# ==-----------------------------------------------------------------==

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

high

The FIXME comment indicates that Tensor Parallelism (TP) is not correctly handled for bias loading. The current implementation for w1 and w3 biases does not appear to shard the bias tensor, which will lead to incorrect behavior or errors when tp_size > 1. The bias should be sharded along the intermediate dimension, similar to the weights.

Comment on lines +142 to +145
else:
raise NotImplementedError(
"process_weights_after_loading is not implemented for now."
)

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

high

The process_weights_after_loading function raises a NotImplementedError when VLLM_ENABLE_STATIC_MOE is not enabled. This indicates an incomplete implementation for the dynamic MoE path, which will cause runtime failures.

def load_weights_ar(
self, weights: Iterable[tuple[str, torch.Tensor]]
) -> set[str]:
# FIXME: @yiliu30: this break the bf16 path, fixme

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

high

The FIXME comment indicates that this new weight loading logic (load_weights_ar) is broken for bfloat16 models. This must be resolved to ensure correct behavior for all supported data types.

Signed-off-by: yiliu30 <yi4.liu@intel.com>
Signed-off-by: yiliu30 <yi4.liu@intel.com>
Signed-off-by: yiliu30 <yi4.liu@intel.com>
Signed-off-by: yiliu30 <yi4.liu@intel.com>
Signed-off-by: yiliu30 <yi4.liu@intel.com>
Signed-off-by: yiliu30 <yi4.liu@intel.com>
Signed-off-by: yiliu30 <yi4.liu@intel.com>
Signed-off-by: yiliu30 <yi4.liu@intel.com>
Signed-off-by: yiliu30 <yi4.liu@intel.com>
Signed-off-by: yiliu30 <yi4.liu@intel.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants