add ar ext #75

yiliu30 · 2025-09-24T10:11:07Z

PLEASE FILL IN THE PR DESCRIPTION HERE ENSURING ALL CHECKLIST ITEMS (AT THE BOTTOM) HAVE BEEN CONSIDERED.

Purpose

Test Plan

Test Result

Essential Elements of an Effective PR Description Checklist

The purpose of the PR, such as "Fix some issue (link existing issues this PR will resolve)".
The test plan, such as providing test command.
The test results, such as pasting the results comparison before and after, or e2e results
(Optional) The necessary documentation update, such as updating supported_models.md and examples for a new model.
(Optional) Release notes update. If your change is user facing, please update the release notes draft in the Google Doc.

BEFORE SUBMITTING, PLEASE READ https://docs.vllm.ai/en/latest/contributing (anything written below this line will be removed by GitHub Actions)

Signed-off-by: yiliu30 <yi4.liu@intel.com>

* add eval code Signed-off-by: yiliu30 <yi4.liu@intel.com> * refine eval code Signed-off-by: yiliu30 <yi4.liu@intel.com> * add more note Signed-off-by: yiliu30 <yi4.liu@intel.com> * refine eval cmd Signed-off-by: yiliu30 <yi4.liu@intel.com> --------- Signed-off-by: yiliu30 <yi4.liu@intel.com>

yiliu30 · 2025-10-09T08:42:04Z

/gemini review

gemini-code-assist

Code Review

This pull request introduces an 'auto-round' quantization extension. However, the implementation appears to be in a very early and experimental stage. It includes multiple hardcoded user-specific paths in example scripts, which makes them non-portable. The core logic is riddled with FIXME and TODO comments, and contains debugging code such as breakpoint() calls, indicating that the feature is incomplete and not ready for production. There are also critical issues noted, such as breaking bf16 support and incomplete tensor parallelism handling. The integration via sitecustomize.py is also unconventional for a library. Due to these reasons, the PR is not ready for merging.

gemini-code-assist · 2025-10-09T08:44:19Z

vllm/model_executor/layers/fused_moe/layer.py


        expert_data = param.data if full_load else param.data[expert_id]
+
+        # FIXME: @yiliu30 handle TP


This FIXME indicates that tensor parallelism (TP) is not correctly handled for bias loading. This is a critical issue that can lead to incorrect behavior or crashes in a distributed environment. This must be resolved before merging.

gemini-code-assist · 2025-10-09T08:44:19Z

vllm/model_executor/layers/quantization/auto_round_vllm_extension/auto_round_ext.py

+    def process_weights_after_loading(self, layer: torch.nn.Module) -> None:
+        # TODO: @yiliu30 remove it
+        def check_nan(tensor):
+            return tensor.float().sum() == 0
+
+        if check_nan(layer.w13_weight):
+            logger.info("all zeros self.w13_weight")
+            breakpoint()
+
+        if check_nan(layer.w2_weight):
+            logger.info("NAN IN self.w2_weight")
+            breakpoint()
+
+        if check_nan(layer.w13_bias):
+            logger.info("NAN IN self.w13_bias")
+            breakpoint()
+        if check_nan(layer.w2_bias):
+            logger.info("NAN IN self.w2_bias")
+            breakpoint()
+        if check_nan(layer.w13_weight_scale):
+            logger.info("NAN IN self.w13_weight_scale")
+            breakpoint()
+        if check_nan(layer.w2_weight_scale):
+            logger.info("NAN IN self.w2_weight_scale")
+            breakpoint()
+


This file contains multiple breakpoint() calls within the process_weights_after_loading method (lines 479, 483, 487, 490, 493, 496). These are debugging statements and must be removed from production code as they will halt execution if triggered.

gemini-code-assist · 2025-10-09T08:44:20Z

vllm/model_executor/models/gpt_oss.py

+    def load_weights_ar(
+        self, weights: Iterable[tuple[str, torch.Tensor]]
+    ) -> set[str]:
+        # FIXME: @yiliu30: this break the bf16 path, fixme


The FIXME comment explicitly states that this change breaks the bf16 path. This is a critical regression and must be fixed before this code can be merged.

gemini-code-assist · 2025-10-09T08:44:20Z

examples/offline_inference/basic/basic.py

+    model="/data5/yliu7/HF_HOME/OPEA/Qwen2.5-0.5B-Instruct-int4-sym-inc"
+    model="/home/yliu7/workspace/auto-round/tmp_autoround_llama_mxfp8"
+    llm = LLM(
+        # model="facebook/opt-125m"
+        model=model
+        )


This example file has been modified to use hardcoded, user-specific model paths, and the original model facebook/opt-125m is commented out. This breaks the example for other users and makes it non-portable. Please revert these changes to use a standard, publicly accessible model.

gemini-code-assist · 2025-10-09T08:44:20Z

examples/offline_inference/basic/start_client.sh

+model_path=/home/yliu7/workspace/inc/3rd-party/llm-compressor/examples/quantization_non_uniform/Llama-3.2-1B-Instruct-NVFP4-FP8-Dynamic
+model_path="/data5/yliu7/HF_HOME/qwen_moe_skip_lm_head"
+model_path=/data5/yliu7/HF_HOME/ByteDance-Seed/Seed-OSS-36B-Instruct
+model_path=/data5/yliu7/HF_HOME/GLM-4.5-Air-w8afp8-llmc/GLM-4.5-Air-w8afp8
+# model_path=/data5/yliu7/HF_HOME/Llama-3.2-1B-Instruct-NVFPP_B16/
+# model_path=/data5/yliu7/HF_HOME/meta-llama/Llama-3.2-1B-Instruct/
+model_path=/data5/yliu7/HF_HOME/Yi30/gpt-oss-20b-BF16-MXFP8/
+model_path=/data5/yliu7/HF_HOME/Yi30/gpt-oss-120b-BF16-unsloth-MXFP8
+# model_path=/models/DeepSeek-V2-Lite-Chat/


This script contains multiple hardcoded, user-specific paths (e.g., /home/yliu7/..., /data5/yliu7/...). This makes the script non-portable and unusable for other developers. Example scripts should use relative paths, command-line arguments, or environment variables for paths.

gemini-code-assist · 2025-10-09T08:44:20Z

examples/offline_inference/basic/start_vllm_cuda.sh

+model_path=/home/yiliu7/models/deepseek-ai/DeepSeek-R1
+model_path=/home/yliu7/workspace/inc/3rd-party/llm-compressor/examples/quantization_non_uniform/Llama-3.2-1B-Instruct-NVFP4-FP8-Dynamic
+model_path="/data5/yliu7/HF_HOME/qwen_moe_skip_lm_head"
+# model_path=/data5/yliu7/HF_HOME/ByteDance-Seed/Seed-OSS-36B-Instruct
+model_path=/data5/yliu7/HF_HOME/GLM-4.5-Air-w8afp8-llmc/GLM-4.5-Air-w8afp8
+# model_path=/data5/yliu7/HF_HOME/Llama-3.2-1B-Instruct-NVFPP_B16/
+# model_path=/data5/yliu7/HF_HOME/meta-llama/Llama-3.2-1B-Instruct/
+model_path=/data5/yliu7/HF_HOME/Yi30/gpt-oss-20b-BF16-MXFP8/
+model_path=/data5/yliu7/HF_HOME/Yi30/gpt-oss-120b-BF16-unsloth-MXFP8


This script contains multiple hardcoded, user-specific paths (e.g., /home/yliu7/..., /data5/yliu7/...). This makes the script non-portable and unusable for other developers. Example scripts should use relative paths, command-line arguments, or environment variables for paths.

gemini-code-assist · 2025-10-09T08:44:20Z

vllm/model_executor/layers/quantization/auto_round_vllm_extension/auto_round_ext.py

+        {"auto_round:llm_compressor"})
+
+    def get_quant_method(self, layer: torch.nn.Module, prefix: str):
+        # FIXME: (Yi) parse the per-layer quant scheme


This file is littered with FIXME and TODO comments (e.g., lines 25, 305, 354, 434, 517, 536, 623), indicating that the implementation is incomplete and not ready for production. These should be addressed to ensure the feature is fully functional and robust before merging.

gemini-code-assist · 2025-10-09T08:44:20Z

vllm/model_executor/layers/quantization/auto_round_vllm_extension/sitecustomize.py

+if VLLM_ENABLE_AR_EXT:
+    print(
+        "*****************************************************************************"
+    )
+    print(
+        f"* !!! VLLM_ENABLE_AR_EXT is set to {VLLM_ENABLE_AR_EXT}, applying auto_round_vllm_extension *"
+    )
+    print(
+        "*****************************************************************************"
+    )
+    from vllm.model_executor.layers.quantization import (
+        auto_round_vllm_extension as auto_round_ext,
+    )
+
+    auto_round_ext.apply()


Using sitecustomize.py to monkey-patch the library at runtime is an unconventional and potentially fragile approach. It can have unintended side effects on the Python environment for any user who has this directory in their PYTHONPATH. Additionally, printing a large banner to stdout is not ideal for a library. A more explicit and cleaner integration mechanism, such as an initialization function call in the user's code, should be considered.

Signed-off-by: yiliu30 <yi4.liu@intel.com>

…-ext

Signed-off-by: yiliu30 <yi4.liu@intel.com>

yiliu30 · 2025-10-09T10:47:31Z

/gemini review

gemini-code-assist

Code Review

This pull request introduces an "auto-round" quantization extension, which is a significant new feature. However, the implementation appears to be in a very early, work-in-progress state. There are several critical issues, including hardcoded user-specific paths in example scripts, multiple FIXME comments indicating incomplete or incorrect logic (especially regarding Tensor Parallelism and bf16 support), and leftover debugging code such as breakpoint() calls. These issues must be addressed before this PR can be considered for merging. The example scripts should be cleaned up to be generally usable, and all debugging artifacts and incomplete logic should be resolved.

gemini-code-assist · 2025-10-09T10:49:39Z

examples/offline_inference/basic/basic.py

+    model="/data5/yliu7/HF_HOME/OPEA/Qwen2.5-0.5B-Instruct-int4-sym-inc"
+    model="/home/yliu7/workspace/auto-round/tmp_autoround_llama_mxfp8"
+    llm = LLM(
+        # model="facebook/opt-125m"
+        model=model
+        )


This example file contains hardcoded, user-specific model paths and commented-out code. Example files should be clean and not contain personal development paths. Please remove these and use a public model or make the model path configurable via command-line arguments.

Suggested change

model="/data5/yliu7/HF_HOME/OPEA/Qwen2.5-0.5B-Instruct-int4-sym-inc"

model="/home/yliu7/workspace/auto-round/tmp_autoround_llama_mxfp8"

llm = LLM(

# model="facebook/opt-125m"

model=model

)

llm = LLM(model="facebook/opt-125m")

gemini-code-assist · 2025-10-09T10:49:39Z

examples/offline_inference/basic/start_client.sh

+#  curl http://127.0.0.1:8088/metrics
+
+export no_proxy="localhost, 127.0.0.1, ::1"
+task_name=gsm8k
+batch_size=16
+# LIMIT=128
+timestamp=$(date +%Y%m%d_%H%M%S)
+EVAL_LOG_NAME="eval_${task_name}_${timestamp}"
+max_length=8192
+max_gen_toks=6144
+
+mkdir -p benchmark_logs
+model_path=/home/yliu7/workspace/inc/3rd-party/llm-compressor/examples/quantization_non_uniform/Llama-3.2-1B-Instruct-NVFP4-FP8-Dynamic
+model_path="/data5/yliu7/HF_HOME/qwen_moe_skip_lm_head"
+model_path=/data5/yliu7/HF_HOME/ByteDance-Seed/Seed-OSS-36B-Instruct
+model_path=/data5/yliu7/HF_HOME/GLM-4.5-Air-w8afp8-llmc/GLM-4.5-Air-w8afp8
+# model_path=/data5/yliu7/HF_HOME/Llama-3.2-1B-Instruct-NVFPP_B16/
+# model_path=/data5/yliu7/HF_HOME/meta-llama/Llama-3.2-1B-Instruct/
+model_path=/data5/yliu7/HF_HOME/Yi30/gpt-oss-20b-BF16-MXFP8/
+model_path=/data5/yliu7/HF_HOME/Yi30/gpt-oss-120b-BF16-unsloth-MXFP8
+# model_path=/models/DeepSeek-V2-Lite-Chat/
+port=8088


This script contains numerous hardcoded, user-specific paths (e.g., in model_path, include_path) and a large amount of commented-out code. This appears to be a personal testing script and is not suitable as a general-purpose example. Please remove this file or clean it up significantly to make it a generic, usable example for other users.

gemini-code-assist · 2025-10-09T10:49:39Z

examples/offline_inference/basic/start_vllm_cuda.sh

+export VLLM_LOGGING_LEVEL=DEBUG
+timestamp=$(date +%Y%m%d-%H%M%S)
+log_file=server.$timestamp.log
+model_path=/home/yiliu7/models/deepseek-ai/DeepSeek-R1
+model_path=/home/yliu7/workspace/inc/3rd-party/llm-compressor/examples/quantization_non_uniform/Llama-3.2-1B-Instruct-NVFP4-FP8-Dynamic
+model_path="/data5/yliu7/HF_HOME/qwen_moe_skip_lm_head"
+# model_path=/data5/yliu7/HF_HOME/ByteDance-Seed/Seed-OSS-36B-Instruct
+model_path=/data5/yliu7/HF_HOME/GLM-4.5-Air-w8afp8-llmc/GLM-4.5-Air-w8afp8
+# model_path=/data5/yliu7/HF_HOME/Llama-3.2-1B-Instruct-NVFPP_B16/
+# model_path=/data5/yliu7/HF_HOME/meta-llama/Llama-3.2-1B-Instruct/
+model_path=/data5/yliu7/HF_HOME/Yi30/gpt-oss-20b-BF16-MXFP8/
+model_path=/data5/yliu7/HF_HOME/Yi30/gpt-oss-120b-BF16-unsloth-MXFP8
+tp_size=4
+ep_size=2
+
+PYTHONPATH=/home/yliu7/workspace/inc/3rd-party/vllm/vllm/model_executor/layers/quantization/auto_round_vllm_extension/:$PYTHONPATH \
+VLLM_ENABLE_AR_EXT=1 \
+VLLM_USE_STATIC_MOE_HPU=1 \


Similar to start_client.sh, this script is filled with hardcoded, user-specific paths and commented-out experimental flags. It should be removed or generalized to serve as a proper example for other users.

gemini-code-assist · 2025-10-09T10:49:39Z

vllm/model_executor/layers/fused_moe/layer.py


        expert_data = param.data if full_load else param.data[expert_id]
+
+        # FIXME: @yiliu30 handle TP


The FIXME comment explicitly states that Tensor Parallelism (TP) is not handled for bias loading. This is a critical correctness issue that will likely cause incorrect behavior or crashes when tp_size > 1. The logic must be updated to correctly handle TP sharding for MoE biases.

gemini-code-assist · 2025-10-09T10:49:39Z

vllm/model_executor/layers/quantization/auto_round_vllm_extension/moe_impl_mxfp8.py

+    def process_weights_after_loading(self, layer: torch.nn.Module) -> None:
+        # TODO: @yiliu30 remove it
+        def check_nan(tensor):
+            return tensor.float().sum() == 0
+
+        if check_nan(layer.w13_weight):
+            logger.info("all zeros self.w13_weight")
+            breakpoint()
+
+        if check_nan(layer.w2_weight):
+            logger.info("NAN IN self.w2_weight")
+            breakpoint()
+
+        if check_nan(layer.w13_bias):
+            logger.info("NAN IN self.w13_bias")
+            breakpoint()
+        if check_nan(layer.w2_bias):
+            logger.info("NAN IN self.w2_bias")
+            breakpoint()
+        if check_nan(layer.w13_weight_scale):
+            logger.info("NAN IN self.w13_weight_scale")
+            breakpoint()
+        if check_nan(layer.w2_weight_scale):
+            logger.info("NAN IN self.w2_weight_scale")
+            breakpoint()
+


The process_weights_after_loading method contains multiple breakpoint() calls, which are debugging artifacts. These must be removed before merging. The checks for NaN or zero tensors could be converted to warnings if they are valuable for users, but breakpoints are not acceptable in production code.

Suggested change

def process_weights_after_loading(self, layer: torch.nn.Module) -> None:

# TODO: @yiliu30 remove it

def check_nan(tensor):

return tensor.float().sum() == 0

if check_nan(layer.w13_weight):

logger.info("all zeros self.w13_weight")

breakpoint()

if check_nan(layer.w2_weight):

logger.info("NAN IN self.w2_weight")

breakpoint()

if check_nan(layer.w13_bias):

logger.info("NAN IN self.w13_bias")

breakpoint()

if check_nan(layer.w2_bias):

logger.info("NAN IN self.w2_bias")

breakpoint()

if check_nan(layer.w13_weight_scale):

logger.info("NAN IN self.w13_weight_scale")

breakpoint()

if check_nan(layer.w2_weight_scale):

logger.info("NAN IN self.w2_weight_scale")

breakpoint()

def process_weights_after_loading(self, layer: torch.nn.Module) -> None:

pass

gemini-code-assist · 2025-10-09T10:49:39Z

vllm/model_executor/models/gpt_oss.py

+    def load_weights_ar(
+        self, weights: Iterable[tuple[str, torch.Tensor]]
+    ) -> set[str]:
+        # FIXME: @yiliu30: this break the bf16 path, fixme


The FIXME comment states that this weight loading path breaks for bf16. This is a critical correctness issue. The implementation must be fixed to correctly support all intended data types, including bfloat16, before this can be merged.

gemini-code-assist · 2025-10-09T10:49:39Z

vllm/model_executor/models/gpt_oss.py

+        params_dict = dict(self.named_parameters())
+        loaded_params: set[str] = set()
+        expert_params_mapping = self.get_expert_mapping()
+        # breakpoint()


A commented-out breakpoint() call is present in the code. This is a debugging artifact and must be removed.

gemini-code-assist · 2025-10-09T10:49:40Z

vllm/model_executor/layers/quantization/auto_round_vllm_extension/auto_round_ext.py

+    def from_config(cls, config: dict[str, Any]) -> AutoRoundConfig:
+        ar_config = super().from_config(config)
+
+        # FIXME: (Yi) parse the per-layer quant scheme


The FIXME comment indicates that per-layer quantization schemes are not parsed. The current implementation uses a single quantization scheme for all layers. This is a significant feature limitation that prevents applying different quantization strategies to different layers, which is often necessary for balancing performance and accuracy. This should be implemented to make the feature more flexible and useful.

gemini-code-assist · 2025-10-09T10:49:40Z

vllm/model_executor/layers/quantization/auto_round_vllm_extension/quant_methods.py

+        input_quant = None
+
+        # FIXME: @yiliu30: temporarily only support MXFP8
+        if 1 or quant_config._is_mxfp8_w8a8(weight_quant, input_quant):


The condition if 1 or ... hardcodes the execution path to the MXFP8 implementation. This is likely a temporary measure for development but should be replaced with the correct logic to dynamically select the quantization implementation based on the configuration, as the FIXME comment also suggests.

Suggested change

if 1 or quant_config._is_mxfp8_w8a8(weight_quant, input_quant):

if quant_config._is_mxfp8_w8a8(weight_quant, input_quant):

Signed-off-by: yiliu30 <yi4.liu@intel.com>

yiliu30 · 2025-10-10T03:46:37Z

/gemini review

Signed-off-by: yiliu30 <yi4.liu@intel.com>

gemini-code-assist

Code Review

This pull request introduces an extension for AutoRound quantization. The changes are extensive, adding the core framework for the extension, quantization method implementations for linear and MoE layers, and corresponding weight loading logic.

While this is a significant feature addition, the current state of the pull request suggests it is a work-in-progress and not ready for merging. My review has identified several critical issues:

The example scripts contain hardcoded, user-specific paths, making them non-reproducible.
There are debugging artifacts left in the code, such as breakpoint() calls and if 1 or ... conditions, which will cause runtime failures or incorrect behavior.
Several FIXME comments point to incomplete or incorrect implementations, particularly concerning Tensor Parallelism and bfloat16 support.
Some code paths are explicitly not implemented, which will lead to NotImplementedError exceptions.

These issues must be addressed before this PR can be considered for merging. The quantization implementations also appear to be emulations, which is acceptable for initial integration and correctness verification, but it would be beneficial to clarify the plan for introducing optimized kernels for performance.

gemini-code-assist · 2025-10-10T03:48:35Z

examples/offline_inference/basic/basic.py

+    model="/data5/yliu7/HF_HOME/OPEA/Qwen2.5-0.5B-Instruct-int4-sym-inc"
+    model="/home/yliu7/workspace/auto-round/tmp_autoround_llama_mxfp8"
+    llm = LLM(
+        # model="facebook/opt-125m"
+        model=model
+        )


This example script includes hardcoded, user-specific absolute paths for the model, and the model variable is immediately reassigned. This makes the example non-reproducible and confusing for other users. Please use a model identifier from the Hugging Face Hub or a placeholder path with instructions.

Suggested change

model="/data5/yliu7/HF_HOME/OPEA/Qwen2.5-0.5B-Instruct-int4-sym-inc"

model="/home/yliu7/workspace/auto-round/tmp_autoround_llama_mxfp8"

llm = LLM(

# model="facebook/opt-125m"

model=model

)

llm = LLM(model="facebook/opt-125m")

gemini-code-assist · 2025-10-10T03:48:35Z

examples/offline_inference/basic/start_client.sh

+model_path=/home/yliu7/workspace/inc/3rd-party/llm-compressor/examples/quantization_non_uniform/Llama-3.2-1B-Instruct-NVFP4-FP8-Dynamic
+model_path="/data5/yliu7/HF_HOME/qwen_moe_skip_lm_head"
+model_path=/data5/yliu7/HF_HOME/ByteDance-Seed/Seed-OSS-36B-Instruct
+model_path=/data5/yliu7/HF_HOME/GLM-4.5-Air-w8afp8-llmc/GLM-4.5-Air-w8afp8
+# model_path=/data5/yliu7/HF_HOME/Llama-3.2-1B-Instruct-NVFPP_B16/
+# model_path=/data5/yliu7/HF_HOME/meta-llama/Llama-3.2-1B-Instruct/
+model_path=/data5/yliu7/HF_HOME/Yi30/gpt-oss-20b-BF16-MXFP8/
+model_path=/data5/yliu7/HF_HOME/Yi30/gpt-oss-120b-BF16-unsloth-MXFP8
+# model_path=/models/DeepSeek-V2-Lite-Chat/


This example script contains multiple hardcoded, user-specific absolute paths. This makes the script non-portable and not reproducible for other users. Please replace these with a single placeholder and add comments explaining what path is expected.

Suggested change

model_path=/home/yliu7/workspace/inc/3rd-party/llm-compressor/examples/quantization_non_uniform/Llama-3.2-1B-Instruct-NVFP4-FP8-Dynamic

model_path="/data5/yliu7/HF_HOME/qwen_moe_skip_lm_head"

model_path=/data5/yliu7/HF_HOME/ByteDance-Seed/Seed-OSS-36B-Instruct

model_path=/data5/yliu7/HF_HOME/GLM-4.5-Air-w8afp8-llmc/GLM-4.5-Air-w8afp8

# model_path=/data5/yliu7/HF_HOME/Llama-3.2-1B-Instruct-NVFPP_B16/

# model_path=/data5/yliu7/HF_HOME/meta-llama/Llama-3.2-1B-Instruct/

model_path=/data5/yliu7/HF_HOME/Yi30/gpt-oss-20b-BF16-MXFP8/

model_path=/data5/yliu7/HF_HOME/Yi30/gpt-oss-120b-BF16-unsloth-MXFP8

# model_path=/models/DeepSeek-V2-Lite-Chat/

# Set model_path to the path of your model, for example:

# model_path="/path/to/your/model"

model_path="facebook/opt-125m"

gemini-code-assist · 2025-10-10T03:48:35Z

examples/offline_inference/basic/start_vllm_cuda.sh

+model_path=/home/yiliu7/models/deepseek-ai/DeepSeek-R1
+model_path=/home/yliu7/workspace/inc/3rd-party/llm-compressor/examples/quantization_non_uniform/Llama-3.2-1B-Instruct-NVFP4-FP8-Dynamic
+model_path="/data5/yliu7/HF_HOME/qwen_moe_skip_lm_head"
+# model_path=/data5/yliu7/HF_HOME/ByteDance-Seed/Seed-OSS-36B-Instruct
+model_path=/data5/yliu7/HF_HOME/GLM-4.5-Air-w8afp8-llmc/GLM-4.5-Air-w8afp8
+# model_path=/data5/yliu7/HF_HOME/Llama-3.2-1B-Instruct-NVFPP_B16/
+# model_path=/data5/yliu7/HF_HOME/meta-llama/Llama-3.2-1B-Instruct/
+model_path=/data5/yliu7/HF_HOME/Yi30/gpt-oss-20b-BF16-MXFP8/
+model_path=/data5/yliu7/HF_HOME/Yi30/gpt-oss-120b-BF16-unsloth-MXFP8


This example script contains multiple hardcoded, user-specific absolute paths. This makes the script non-portable and not reproducible for other users. Please replace these with a single placeholder and add comments explaining what path is expected.

Suggested change

model_path=/home/yiliu7/models/deepseek-ai/DeepSeek-R1

model_path=/home/yliu7/workspace/inc/3rd-party/llm-compressor/examples/quantization_non_uniform/Llama-3.2-1B-Instruct-NVFP4-FP8-Dynamic

model_path="/data5/yliu7/HF_HOME/qwen_moe_skip_lm_head"

# model_path=/data5/yliu7/HF_HOME/ByteDance-Seed/Seed-OSS-36B-Instruct

model_path=/data5/yliu7/HF_HOME/GLM-4.5-Air-w8afp8-llmc/GLM-4.5-Air-w8afp8

# model_path=/data5/yliu7/HF_HOME/Llama-3.2-1B-Instruct-NVFPP_B16/

# model_path=/data5/yliu7/HF_HOME/meta-llama/Llama-3.2-1B-Instruct/

model_path=/data5/yliu7/HF_HOME/Yi30/gpt-oss-20b-BF16-MXFP8/

model_path=/data5/yliu7/HF_HOME/Yi30/gpt-oss-120b-BF16-unsloth-MXFP8

# Set model_path to the path of your model, for example:

# model_path=/path/to/your/model

model_path="facebook/opt-125m"

gemini-code-assist · 2025-10-10T03:48:35Z

vllm/model_executor/layers/quantization/auto_round_vllm_extension/moe_impl_mxfp8.py

+
+        if check_nan(layer.w13_weight):
+            logger.info("all zeros self.w13_weight")
+            breakpoint()


This file contains multiple breakpoint() calls (lines 188, 192, 196, 199, 202, 205), which are debugging artifacts. These must be removed as they will halt program execution.

gemini-code-assist · 2025-10-10T03:48:35Z

vllm/model_executor/layers/quantization/auto_round_vllm_extension/quant_methods.py

+
+        #     impl = AutoRoundMoEMethodMXFP8(quant_config, layer.moe_config)
+        #     return impl
+        if 1 or quant_config._is_mxfp4_w4a8(weight_quant, input_quant):


The condition if 1 or ... unconditionally forces the MXFP4 implementation path. This is likely a debugging artifact and must be replaced with the correct logic to select the quantization method based on the configuration.

Suggested change

if 1 or quant_config._is_mxfp4_w4a8(weight_quant, input_quant):

if quant_config._is_mxfp4_w4a8(weight_quant, input_quant):

gemini-code-assist · 2025-10-10T03:48:35Z

vllm/model_executor/layers/fused_moe/layer.py

+        # FIXME: @yiliu30 handle TP
+        # ==-----------------------------------------------------------------==


The FIXME comment indicates that Tensor Parallelism (TP) is not correctly handled for bias loading. The current implementation for w1 and w3 biases does not appear to shard the bias tensor, which will lead to incorrect behavior or errors when tp_size > 1. The bias should be sharded along the intermediate dimension, similar to the weights.

gemini-code-assist · 2025-10-10T03:48:36Z

vllm/model_executor/layers/quantization/auto_round_vllm_extension/moe_impl_mxfp4.py

+        else:
+            raise NotImplementedError(
+                "process_weights_after_loading is not implemented for now."
+            )


The process_weights_after_loading function raises a NotImplementedError when VLLM_ENABLE_STATIC_MOE is not enabled. This indicates an incomplete implementation for the dynamic MoE path, which will cause runtime failures.

gemini-code-assist · 2025-10-10T03:48:36Z

vllm/model_executor/models/gpt_oss.py

+    def load_weights_ar(
+        self, weights: Iterable[tuple[str, torch.Tensor]]
+    ) -> set[str]:
+        # FIXME: @yiliu30: this break the bf16 path, fixme


The FIXME comment indicates that this new weight loading logic (load_weights_ar) is broken for bfloat16 models. This must be resolved to ensure correct behavior for all supported data types.

Signed-off-by: yiliu30 <yi4.liu@intel.com>

yiliu30 added 12 commits September 24, 2025 06:04

add ar ext

8c0dbe2

Signed-off-by: yiliu30 <yi4.liu@intel.com>

format

ff61fcc

Signed-off-by: yiliu30 <yi4.liu@intel.com>

Merge branch 'main' into ar-ext

48ff943

add moe

c6668f5

Signed-off-by: yiliu30 <yi4.liu@intel.com>

add bias for moe

eee602c

Signed-off-by: yiliu30 <yi4.liu@intel.com>

gpt-oss load mxfp8

1111a2f

Signed-off-by: yiliu30 <yi4.liu@intel.com>

remove torach dep

1f5714f

Signed-off-by: yiliu30 <yi4.liu@intel.com>

site customize

7f15be7

Signed-off-by: yiliu30 <yi4.liu@intel.com>

add log

ab9e3f6

Signed-off-by: yiliu30 <yi4.liu@intel.com>

add gpt oss gsm8k

f6ca4d6

Signed-off-by: yiliu30 <yi4.liu@intel.com>

merge main

2793115

Signed-off-by: yiliu30 <yi4.liu@intel.com>

gemini-code-assist bot reviewed Oct 9, 2025

View reviewed changes

yiliu30 added 3 commits October 9, 2025 06:31

refactor code

b0894f2

Signed-off-by: yiliu30 <yi4.liu@intel.com>

Merge branch 'ar-ext' of https://github.com/yiliu30/vllm-fork into ar…

a1cde9c

…-ext

format

31c2b2b

Signed-off-by: yiliu30 <yi4.liu@intel.com>

yiliu30 marked this pull request as ready for review October 9, 2025 10:47

gemini-code-assist bot reviewed Oct 9, 2025

View reviewed changes

yiliu30 added 4 commits October 9, 2025 06:57

rename impl

00fe1e4

Signed-off-by: yiliu30 <yi4.liu@intel.com>

add mxfp4

461467e

Signed-off-by: yiliu30 <yi4.liu@intel.com>

rename envs var

1ca1d20

Signed-off-by: yiliu30 <yi4.liu@intel.com>

clean code

e8ce10f

Signed-off-by: yiliu30 <yi4.liu@intel.com>

clean code

e45aeb6

Signed-off-by: yiliu30 <yi4.liu@intel.com>

gemini-code-assist bot reviewed Oct 10, 2025

View reviewed changes

yiliu30 added 3 commits October 10, 2025 06:35

unpack path

4bd0a36

Signed-off-by: yiliu30 <yi4.liu@intel.com>

update default val

2b5accd

Signed-off-by: yiliu30 <yi4.liu@intel.com>

add mask buffer

35c0215

Signed-off-by: yiliu30 <yi4.liu@intel.com>

yiliu30 added 7 commits October 10, 2025 21:20

update envs var default value

383ac3e

Signed-off-by: yiliu30 <yi4.liu@intel.com>

fix got-oss act

761f86c

Signed-off-by: yiliu30 <yi4.liu@intel.com>

clean qdq

54dab8b

Signed-off-by: yiliu30 <yi4.liu@intel.com>

update test cmd

1e103d7

Signed-off-by: yiliu30 <yi4.liu@intel.com>

use odular moe

db56026

Signed-off-by: yiliu30 <yi4.liu@intel.com>

naive dequant

0e30330

Signed-off-by: yiliu30 <yi4.liu@intel.com>

dequant

c894cfa

Signed-off-by: yiliu30 <yi4.liu@intel.com>


		expert_data = param.data if full_load else param.data[expert_id]

		# FIXME: @yiliu30 handle TP

	if 1 or quant_config._is_mxfp8_w8a8(weight_quant, input_quant):
	if quant_config._is_mxfp8_w8a8(weight_quant, input_quant):

	if 1 or quant_config._is_mxfp4_w4a8(weight_quant, input_quant):
	if quant_config._is_mxfp4_w4a8(weight_quant, input_quant):

		# FIXME: @yiliu30 handle TP
		# ==-----------------------------------------------------------------==

Uh oh!

add ar ext #75

Are you sure you want to change the base?

add ar ext #75

Uh oh!

Conversation

yiliu30 commented Sep 24, 2025

Purpose

Test Plan

Test Result

Uh oh!

yiliu30 commented Oct 9, 2025

Uh oh!

gemini-code-assist bot left a comment

Choose a reason for hiding this comment

Code Review

Uh oh!

gemini-code-assist bot Oct 9, 2025

Choose a reason for hiding this comment

Uh oh!

gemini-code-assist bot Oct 9, 2025

Choose a reason for hiding this comment

Uh oh!

gemini-code-assist bot Oct 9, 2025

Choose a reason for hiding this comment

Uh oh!

gemini-code-assist bot Oct 9, 2025

Choose a reason for hiding this comment

Uh oh!

gemini-code-assist bot Oct 9, 2025

Choose a reason for hiding this comment

Uh oh!

gemini-code-assist bot Oct 9, 2025

Choose a reason for hiding this comment

Uh oh!

gemini-code-assist bot Oct 9, 2025

Choose a reason for hiding this comment

Uh oh!

gemini-code-assist bot Oct 9, 2025

Choose a reason for hiding this comment

Uh oh!

yiliu30 commented Oct 9, 2025

Uh oh!

gemini-code-assist bot left a comment

Choose a reason for hiding this comment

Code Review

Uh oh!

gemini-code-assist bot Oct 9, 2025

Choose a reason for hiding this comment

Uh oh!

gemini-code-assist bot Oct 9, 2025

Choose a reason for hiding this comment

Uh oh!

gemini-code-assist bot Oct 9, 2025

Choose a reason for hiding this comment

Uh oh!

gemini-code-assist bot Oct 9, 2025

Choose a reason for hiding this comment

Uh oh!

gemini-code-assist bot Oct 9, 2025

Choose a reason for hiding this comment

Uh oh!

gemini-code-assist bot Oct 9, 2025

Choose a reason for hiding this comment

Uh oh!

gemini-code-assist bot Oct 9, 2025

Choose a reason for hiding this comment

Uh oh!

gemini-code-assist bot Oct 9, 2025

Choose a reason for hiding this comment

Uh oh!

gemini-code-assist bot Oct 9, 2025

Choose a reason for hiding this comment

Uh oh!

yiliu30 commented Oct 10, 2025

Uh oh!

gemini-code-assist bot left a comment

Choose a reason for hiding this comment

Code Review

Uh oh!