Skip to content

Conversation

@xin3he
Copy link
Contributor

@xin3he xin3he commented Dec 9, 2025

PR Type

Bug fix, Enhancement, Documentation


Description

  • Added gpu_memory_utilization parameter to prevent OOM

  • Removed lm_head quantization to support vLLM inference

  • Updated README with notes on quantization and accuracy


Diagram Walkthrough

flowchart LR
  A["Add gpu_memory_utilization"] -- "Prevent OOM" --> B["Update run_benchmark.sh"]
  C["Remove lm_head quantization"] -- "Support vLLM inference" --> D["Update run_quant.sh"]
  E["Add notes on quantization"] -- "Update README.md" --> F["Document changes"]
Loading

File Walkthrough

Relevant files
Enhancement
run_benchmark.sh
Add gpu_memory_utilization parameter                                         

examples/pytorch/nlp/huggingface_models/language-modeling/quantization/auto_round/llama3/run_benchmark.sh

  • Added gpu_memory_utilization=0.8 to model_args
+2/-2     
Bug fix
run_quant.sh
Remove lm_head quantization                                                           

examples/pytorch/nlp/huggingface_models/language-modeling/quantization/auto_round/llama3/run_quant.sh

  • Removed --quant_lm_head from quantization commands
+3/-6     
Documentation
README.md
Update README with quantization notes                                       

examples/pytorch/nlp/huggingface_models/language-modeling/quantization/auto_round/llama3/README.md

  • Added notes on quantization accuracy and lm_head support
+8/-0     

Signed-off-by: He, Xin3 <xin3.he@intel.com>
@xin3he xin3he changed the title fix OOM issue and lm_head unsupport issue fix llama3 OOM issue and lm_head unsupport issue Dec 9, 2025
@PRAgent4INC
Copy link
Collaborator

PR Reviewer Guide 🔍

Here are some key observations to aid the review process:

⏱️ Estimated effort to review: 3 🔵🔵🔵⚪⚪
🧪 No relevant tests
🔒 No security concerns identified
⚡ Recommended focus areas for review

Possible Issue

The removal of --quant_lm_head might affect the quantization process, especially if the quant_lm_head flag is necessary for certain configurations or models.

CMD="python quantize.py --model_name_or_path \"$INPUT_MODEL\" $COMMON_ARGS --dtype MXFP8 --iters 0 --export_path \"$OUTPUT_MODEL\""
echo "Executing command: $CMD"
python quantize.py \
    --model_name_or_path "$INPUT_MODEL" \
    $COMMON_ARGS \
    --dtype MXFP8 \
    --iters 0 \
    --export_path "$OUTPUT_MODEL"
;;
Hardcoded Value

The gpu_memory_utilization value is hardcoded to 0.8. This might not be suitable for all environments and could lead to suboptimal performance or OOM issues in different setups.

local cmd="lm_eval --model vllm --model_args pretrained=\"$MODEL_PATH\",add_bos_token=$add_bos_token,tensor_parallel_size=$TENSOR_PARALLEL_SIZE,gpu_memory_utilization=0.8,data_parallel_size=1 --tasks $tasks --batch_size $BATCH_SIZE"
echo "Executing command: $cmd"

lm_eval --model vllm \
    --model_args pretrained="$MODEL_PATH",add_bos_token=$add_bos_token,tensor_parallel_size=$TENSOR_PARALLEL_SIZE,gpu_memory_utilization=0.8,data_parallel_size=1 \
    --tasks $tasks \

@PRAgent4INC
Copy link
Collaborator

PR Code Suggestions ✨

Signed-off-by: He, Xin3 <xin3.he@intel.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants