Skip to content

Conversation

@yiliu30
Copy link
Contributor

@yiliu30 yiliu30 commented Nov 12, 2025

User description

Signed-off-by: yiliu30 yi4.liu@intel.com


PR Type

Enhancement


Description

  • Added DS/QWEN quantization examples

  • Included quantization scripts for different schemes

  • Added generation script using vLLM


Diagram Walkthrough

flowchart LR
  A["Add quantize.py"] -- "DS/QWEN quantization" --> B["Add generate.py"]
  B -- "vLLM integration" --> C["Add run scripts"]
Loading

File Walkthrough

Relevant files
Enhancement
2 files
quantize.py
Added quantization script for DS/QWEN                                       
+149/-0 
generate.py
Added generation script using vLLM                                             
+73/-0   
Additional files
5 files
README.md +51/-0   
run_eval.sh +105/-0 
run_gen.sh +80/-0   
run_generate.sh +115/-0 
run_quant.sh +36/-0   

Signed-off-by: yiliu30 <yi4.liu@intel.com>
@yiliu30 yiliu30 changed the title Add DS/QWEN Ex.am.ples Add DS/QWEN Examples Nov 12, 2025
@PRAgent4INC
Copy link
Collaborator

PR Reviewer Guide 🔍

Here are some key observations to aid the review process:

⏱️ Estimated effort to review: 4 🔵🔵🔵🔵⚪
🧪 No relevant tests
🔒 Security concerns

Trust Remote Code:
The use of trust_remote_code=True when loading models and tokenizers can expose the system to security risks if the source is not trusted. Ensure that the models and tokenizers are from a reliable source.

⚡ Recommended focus areas for review

Hardcoded Iterations

The iters parameter is hardcoded to 0 for most configurations in topologies_config. This should be configurable via command-line arguments or another method to allow flexibility.

    "scheme": "MXFP8",
    "fp_layers": "lm_head",
    "iters": 0,
},
"ds_mxfp4": {
    "scheme": "MXFP4",
    "fp_layers": "lm_head,self_attn",
    "iters": 0,
},
"qwen_mxfp8": {
    "scheme": "MXFP8",
    "fp_layers": "lm_head,mlp.gate",
    "iters": 0,
},
"qwen_mxfp4": {
    "scheme": "MXFP4",
    "fp_layers": "lm_head,mlp.gate,self_attn",
    "iters": 0,  # TODO: set to 200 before merge
Unused Argument

The --skip_attn argument is defined but not used in the provided code. It should either be utilized or removed to avoid confusion.

    help="Skip quantize attention layers.",
)
parser.add_argument(
Trust Remote Code

The trust_remote_code=True flag is used when loading models and tokenizers. This can pose a security risk if the source of the model or tokenizer is not trusted. Consider adding a warning or making this configurable.

fp32_model = AutoModelForCausalLM.from_pretrained(
    model_name,
    device_map="cpu",
    trust_remote_code=True,
)
tokenizer = AutoTokenizer.from_pretrained(
    model_name,
    trust_remote_code=True,
)

@PRAgent4INC
Copy link
Collaborator

PR Code Suggestions ✨

Explore these optional code suggestions:

CategorySuggestion                                                                                                                                    Impact
General
Remove unused argument

This argument is not used in the current code. Either remove it or implement its
functionality.

examples/pytorch/nlp/huggingface_models/language-modeling/quantization/auto_round/quantize.py [130-133]

-+    "--skip_attn",
-+    action="store_true",
-+    help="Skip quantize attention layers.",
+# Removed unused argument
+# "--skip_attn",
+# action="store_true",
+# help="Skip quantize attention layers.",
Suggestion importance[1-10]: 8

__

Why: The suggestion identifies an unused argument and proposes either removal or implementation. This is important for maintaining clean and functional code.

Medium
Remove redundant argument

The iters argument is already defined in topologies_config. Consider removing this
argument to avoid redundancy.

examples/pytorch/nlp/huggingface_models/language-modeling/quantization/auto_round/quantize.py [135-139]

-+    "--iters",
-+    type=int,
-+    default=0,
-+    help="Number of iterations for quantization.",
+# Removed redundant argument
+# "--iters",
+# type=int,
+# default=0,
+# help="Number of iterations for quantization.",
Suggestion importance[1-10]: 8

__

Why: The suggestion correctly identifies redundancy in the iters argument and proposes its removal. This helps in avoiding confusion and maintaining consistency.

Medium
Set iters value directly

Remove the TODO comment and set the value directly if it's confirmed.

examples/pytorch/nlp/huggingface_models/language-modeling/quantization/auto_round/quantize.py [30]

-"iters": 0,  # TODO: set to 200 before merge
+"iters": 200,
Suggestion importance[1-10]: 7

__

Why: The suggestion correctly addresses the TODO comment and sets the value directly. However, it does not provide additional context or validation for the change.

Medium
Use logging for messages

Consider logging the message instead of printing it for better control over output
verbosity.

examples/pytorch/nlp/huggingface_models/language-modeling/quantization/auto_round/generate.py [5-8]

 try:
     from auto_round_extension.vllm_ext import apply as apply_auto_round_extension
     apply_auto_round_extension()
 except ImportError:
-    print("auto_round_extension.vllm_ext not found, proceeding without auto-round extension.")
+    import logging
+    logging.warning("auto_round_extension.vllm_ext not found, proceeding without auto-round extension.")
Suggestion importance[1-10]: 7

__

Why: Logging provides better control over output verbosity and is generally preferred over print statements in production code.

Medium
Externalize prompts

Externalize the prompts to a configuration file or environment variable for
flexibility.

examples/pytorch/nlp/huggingface_models/language-modeling/quantization/auto_round/generate.py [54-58]

-prompts = [
-    "Hello, my name is",
-    "The president of the United States is",
-    "The capital of France is",
-    "The future of AI is",
-]
+import os
+prompts_str = os.getenv("PROMPTS", '["Hello, my name is", "The president of the United States is", "The capital of France is", "The future of AI is"]')
+prompts = eval(prompts_str)
Suggestion importance[1-10]: 6

__

Why: Externalizing prompts can make the script more flexible and easier to configure without modifying the code.

Low
Use JSON for output

Format the output using a more structured approach, such as JSON, for easier
parsing.

examples/pytorch/nlp/huggingface_models/language-modeling/quantization/auto_round/generate.py [60-66]

+import json
 outputs = llm.generate(prompts, sampling_params)
 # Print the outputs.
-print("-" * 50)
-for output in outputs:
-    prompt = output.prompt
-    generated_text = output.outputs[0].text
-    print(f"Prompt: {prompt!r}\nGenerated text: {generated_text!r}")
-    print("-" * 50)
+print(json.dumps([{"prompt": output.prompt, "generated_text": output.outputs[0].text} for output in outputs], indent=4))
Suggestion importance[1-10]: 6

__

Why: Using JSON for output makes it easier to parse and integrate with other systems, but it may not be necessary for simple scripts.

Low
Eliminate duplicate assignments

Remove duplicate assignments to model_path.

examples/pytorch/nlp/huggingface_models/language-modeling/quantization/auto_round/run_eval.sh [6-9]

-model_path="/storage/yiliu7/quantized_model_ds_mxfp4"
-model_path="/storage/yiliu7/quantized_model_ds_mxfp4"
-model_path="/storage/yiliu7/quantized_model_qwen_mxfp4"
 model_path="/storage/yiliu7/quantized_model_qwen_mxfp8"
Suggestion importance[1-10]: 6

__

Why: The suggestion correctly identifies and removes duplicate assignments to model_path, improving code clarity and maintainability.

Low

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants