add fp8 kv argument for llama3 example #2372

xin3he · 2025-12-23T08:39:27Z

PR Type

Enhancement

Description

Added static_kv_dtype argument for FP8 quantization
Updated run_benchmark.sh to support FP8 KV cache
Modified run_quant.sh to include static_kv_dtype
Updated README with instructions for enabling FP8 KV cache

Diagram Walkthrough

flowchart LR
  A["Add static_kv_dtype argument"] -- "to quantize.py" --> B["Update run_benchmark.sh"]
  B -- "Support FP8 KV cache" --> C["Modify run_quant.sh"]
  C -- "Include static_kv_dtype" --> D["Update README"]
  D -- "Instructions for FP8 KV cache" --> E["PR Complete"]

File Walkthrough

Relevant files

Enhancement

quantize.py `Add static_kv_dtype argument` examples/pytorch/nlp/huggingface_models/language-modeling/quantization/auto_round/llama3/quantize.py Added `static_kv_dtype` argument to parser Passed `static_kv_dtype` to `load_recipe_results`	+8/-0
run_benchmark.sh `Update run_benchmark.sh for FP8 KV cache` examples/pytorch/nlp/huggingface_models/language-modeling/quantization/auto_round/llama3/run_benchmark.sh Added `-kv` option to handle KV cache dtype Configured environment variables for FP8 KV cache Updated `lm_eval` command to include `kv_cache_dtype`	+14/-2
run_quant.sh `Modify run_quant.sh for FP8 KV cache` examples/pytorch/nlp/huggingface_models/language-modeling/quantization/auto_round/llama3/run_quant.sh Added `-kv` option to handle KV cache dtype Conditionally added `static_kv_dtype` to `COMMON_ARGS`	+10/-1

Documentation

README.md `Update README with FP8 KV cache instructions` examples/pytorch/nlp/huggingface_models/language-modeling/quantization/auto_round/llama3/README.md Added note about enabling FP8 KV cache	+2/-0

Signed-off-by: He, Xin3 <xin3.he@intel.com>

PRAgent4INC · 2025-12-23T08:40:12Z

PR Reviewer Guide 🔍

Here are some key observations to aid the review process:

⏱️ Estimated effort to review: 3 🔵🔵🔵⚪⚪
🧪 No relevant tests
🔒 No security concerns identified
⚡ Recommended focus areas for review Default Value The default value for `static_kv_dtype` is set to `None`. Ensure that this default behavior is intended and does not lead to unexpected behavior when the argument is not provided. parser.add_argument( "--static_kv_dtype", default=None, type=str, choices=["fp8", "float8_e4m3fn"], help="Data type for static quantize key and value.", ) Environment Variables The script sets environment variables conditionally based on `KV_CACHE_DTYPE`. Verify that these settings are correct and do not conflict with other configurations or system settings. if [[ "$KV_CACHE_DTYPE" == "fp8" ]]; then export VLLM_FLASHINFER_DISABLE_Q_QUANTIZATION=0 export VLLM_ATTENTION_BACKEND="FLASHINFER" echo "Using FP8 for KV cache" fi

PRAgent4INC · 2025-12-23T08:40:31Z

PR Code Suggestions ✨

Explore these optional code suggestions:

Category Suggestion Impact

General

Handle KV_CACHE_DTYPE default

Ensure that KV_CACHE_DTYPE is properly handled when it is set to auto. Consider
setting a default value or handling it explicitly.

examples/pytorch/nlp/huggingface_models/language-modeling/quantization/auto_round/llama3/run_benchmark.sh [90]

-local cmd="lm_eval --model vllm --model_args pretrained=\"$MODEL_PATH\",add_bos_token=$add_bos_token,tensor_parallel_size=$TENSOR_PARALLEL_SIZE,gpu_memory_utilization=$GPU_MEMORY_UTILIZATION,data_parallel_size=1,max_model_len=8192 --tasks $tasks --batch_size $BATCH_SIZE $extra_args"
+local cmd="lm_eval --model vllm --model_args pretrained=\"$MODEL_PATH\",add_bos_token=$add_bos_token,tensor_parallel_size=$TENSOR_PARALLEL_SIZE,gpu_memory_utilization=$GPU_MEMORY_UTILIZATION,data_parallel_size=1,max_model_len=8192,kv_cache_dtype=${KV_CACHE_DTYPE:-auto} --tasks $tasks --batch_size $BATCH_SIZE $extra_args"

Suggestion importance[1-10]: 5

__

Why: The suggestion proposes handling the KV_CACHE_DTYPE default value, which is a minor improvement for robustness. However, the existing code already sets KV_CACHE_DTYPE="auto" by default, so this change is not strictly necessary.

Low

Signed-off-by: He, Xin3 <xin3.he@intel.com>

add fp8 kv argument

ceff24d

Signed-off-by: He, Xin3 <xin3.he@intel.com>

PRAgent4INC added the Review effort 3/5 label Dec 23, 2025

xin3he added 3 commits December 23, 2025 08:47

use --static_kv_dtype= in shell script

6e06a3e

Signed-off-by: He, Xin3 <xin3.he@intel.com>

update readme

44a5527

Signed-off-by: He, Xin3 <xin3.he@intel.com>

disable q quantization

bd590cd

Signed-off-by: He, Xin3 <xin3.he@intel.com>

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

add fp8 kv argument for llama3 example #2372

add fp8 kv argument for llama3 example #2372

xin3he commented Dec 23, 2025 •

edited by PRAgent4INC

Loading

Uh oh!

PRAgent4INC commented Dec 23, 2025

Uh oh!

PRAgent4INC commented Dec 23, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

add fp8 kv argument for llama3 example #2372

Are you sure you want to change the base?

add fp8 kv argument for llama3 example #2372

Conversation

xin3he commented Dec 23, 2025 • edited by PRAgent4INC Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

PR Type

Description

Diagram Walkthrough

File Walkthrough

Uh oh!

PRAgent4INC commented Dec 23, 2025

PR Reviewer Guide 🔍

Uh oh!

PRAgent4INC commented Dec 23, 2025

PR Code Suggestions ✨

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

xin3he commented Dec 23, 2025 •

edited by PRAgent4INC

Loading