Skip to content

Conversation

@xin3he
Copy link
Contributor

@xin3he xin3he commented Dec 23, 2025

PR Type

Enhancement


Description

  • Added static_kv_dtype argument for FP8 quantization

  • Updated run_benchmark.sh to support FP8 KV cache

  • Modified run_quant.sh to include static_kv_dtype

  • Updated README with instructions for enabling FP8 KV cache


Diagram Walkthrough

flowchart LR
  A["Add static_kv_dtype argument"] -- "to quantize.py" --> B["Update run_benchmark.sh"]
  B -- "Support FP8 KV cache" --> C["Modify run_quant.sh"]
  C -- "Include static_kv_dtype" --> D["Update README"]
  D -- "Instructions for FP8 KV cache" --> E["PR Complete"]
Loading

File Walkthrough

Relevant files
Enhancement
quantize.py
Add static_kv_dtype argument                                                         

examples/pytorch/nlp/huggingface_models/language-modeling/quantization/auto_round/llama3/quantize.py

  • Added static_kv_dtype argument to parser
  • Passed static_kv_dtype to load_recipe_results
+8/-0     
run_benchmark.sh
Update run_benchmark.sh for FP8 KV cache                                 

examples/pytorch/nlp/huggingface_models/language-modeling/quantization/auto_round/llama3/run_benchmark.sh

  • Added -kv option to handle KV cache dtype
  • Configured environment variables for FP8 KV cache
  • Updated lm_eval command to include kv_cache_dtype
+14/-2   
run_quant.sh
Modify run_quant.sh for FP8 KV cache                                         

examples/pytorch/nlp/huggingface_models/language-modeling/quantization/auto_round/llama3/run_quant.sh

  • Added -kv option to handle KV cache dtype
  • Conditionally added static_kv_dtype to COMMON_ARGS
+10/-1   
Documentation
README.md
Update README with FP8 KV cache instructions                         

examples/pytorch/nlp/huggingface_models/language-modeling/quantization/auto_round/llama3/README.md

  • Added note about enabling FP8 KV cache
+2/-0     

Signed-off-by: He, Xin3 <xin3.he@intel.com>
@PRAgent4INC
Copy link
Collaborator

PR Reviewer Guide 🔍

Here are some key observations to aid the review process:

⏱️ Estimated effort to review: 3 🔵🔵🔵⚪⚪
🧪 No relevant tests
🔒 No security concerns identified
⚡ Recommended focus areas for review

Default Value

The default value for static_kv_dtype is set to None. Ensure that this default behavior is intended and does not lead to unexpected behavior when the argument is not provided.

parser.add_argument(
    "--static_kv_dtype",
    default=None,
    type=str,
    choices=["fp8", "float8_e4m3fn"],
    help="Data type for static quantize key and value.",
)
Environment Variables

The script sets environment variables conditionally based on KV_CACHE_DTYPE. Verify that these settings are correct and do not conflict with other configurations or system settings.

if [[ "$KV_CACHE_DTYPE" == "fp8" ]]; then
    export VLLM_FLASHINFER_DISABLE_Q_QUANTIZATION=0
    export VLLM_ATTENTION_BACKEND="FLASHINFER"
    echo "Using FP8 for KV cache"
fi

@PRAgent4INC
Copy link
Collaborator

PR Code Suggestions ✨

Explore these optional code suggestions:

CategorySuggestion                                                                                                                                    Impact
General
Handle KV_CACHE_DTYPE default

Ensure that KV_CACHE_DTYPE is properly handled when it is set to auto. Consider
setting a default value or handling it explicitly.

examples/pytorch/nlp/huggingface_models/language-modeling/quantization/auto_round/llama3/run_benchmark.sh [90]

-local cmd="lm_eval --model vllm --model_args pretrained=\"$MODEL_PATH\",add_bos_token=$add_bos_token,tensor_parallel_size=$TENSOR_PARALLEL_SIZE,gpu_memory_utilization=$GPU_MEMORY_UTILIZATION,data_parallel_size=1,max_model_len=8192 --tasks $tasks --batch_size $BATCH_SIZE $extra_args"
+local cmd="lm_eval --model vllm --model_args pretrained=\"$MODEL_PATH\",add_bos_token=$add_bos_token,tensor_parallel_size=$TENSOR_PARALLEL_SIZE,gpu_memory_utilization=$GPU_MEMORY_UTILIZATION,data_parallel_size=1,max_model_len=8192,kv_cache_dtype=${KV_CACHE_DTYPE:-auto} --tasks $tasks --batch_size $BATCH_SIZE $extra_args"
Suggestion importance[1-10]: 5

__

Why: The suggestion proposes handling the KV_CACHE_DTYPE default value, which is a minor improvement for robustness. However, the existing code already sets KV_CACHE_DTYPE="auto" by default, so this change is not strictly necessary.

Low

Signed-off-by: He, Xin3 <xin3.he@intel.com>
Signed-off-by: He, Xin3 <xin3.he@intel.com>
Signed-off-by: He, Xin3 <xin3.he@intel.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants