Skip to content

[Bug]: gemma3 shows degraded accuracy in vLLM v0.8.4 #17689

Open
@llsj14

Description

@llsj14

Your current environment

  • vLLM version: v0.8.4
  • docker image: nvcr.io/nvidia/pytorch/23.10-py3
  • install script
pip install -r ./requirements/build.txt
pip install -r ./requirements/common.txt
pip install -r ./requirements/cuda.txt
pip install flash_attn==2.7.4.post1
export VLLM_COMMIT=dc1b4a6f1300003ae27f033afbdff5e2683721ce
export VLLM_PRECOMPILED_WHEEL_LOCATION=https://wheels.vllm.ai/${VLLM_COMMIT}/vllm-1.0.0.dev-cp38-abi3-manylinux1_x86_64.whl
pip install -e .
pip install -U pynvml
Click to expand environment info Collecting environment information... PyTorch version: 2.6.0+cu124 Is debug build: False CUDA used to build PyTorch: 12.4 ROCM used to build PyTorch: N/A

OS: Ubuntu 22.04.3 LTS (x86_64)
GCC version: (Ubuntu 11.4.0-1ubuntu1~22.04) 11.4.0
Clang version: Could not collect
CMake version: version 4.0.0
Libc version: glibc-2.35

Python version: 3.10.12 (main, Jun 11 2023, 05:26:28) [GCC 11.4.0] (64-bit runtime)
Python platform: Linux-4.19.93-1.nbp.el7.x86_64-x86_64-with-glibc2.35
Is CUDA available: True
CUDA runtime version: 12.2.140
CUDA_MODULE_LOADING set to: LAZY

Nvidia driver version: 535.129.03
cuDNN version: Probably one of the following:
/usr/lib/x86_64-linux-gnu/libcudnn.so.8.9.5
/usr/lib/x86_64-linux-gnu/libcudnn_adv_infer.so.8.9.5
/usr/lib/x86_64-linux-gnu/libcudnn_adv_train.so.8.9.5
/usr/lib/x86_64-linux-gnu/libcudnn_cnn_infer.so.8.9.5
/usr/lib/x86_64-linux-gnu/libcudnn_cnn_train.so.8.9.5
/usr/lib/x86_64-linux-gnu/libcudnn_ops_infer.so.8.9.5
/usr/lib/x86_64-linux-gnu/libcudnn_ops_train.so.8.9.5
HIP runtime version: N/A
MIOpen runtime version: N/A
Is XNNPACK available: True

Versions of relevant libraries:
[pip3] numpy==2.2.5
[pip3] nvidia-cublas-cu12==12.4.5.8
[pip3] nvidia-cuda-cupti-cu12==12.4.127
[pip3] nvidia-cuda-nvrtc-cu12==12.4.127
[pip3] nvidia-cuda-runtime-cu12==12.4.127
[pip3] nvidia-cudnn-cu12==9.1.0.70
[pip3] nvidia-cufft-cu12==11.2.1.3
[pip3] nvidia-curand-cu12==10.3.5.147
[pip3] nvidia-cusolver-cu12==11.6.1.9
[pip3] nvidia-cusparse-cu12==12.3.1.170
[pip3] nvidia-cusparselt-cu12==0.6.2
[pip3] nvidia-dali-cuda120==1.30.0
[pip3] nvidia-ml-py==12.570.86
[pip3] nvidia-nccl-cu12==2.21.5
[pip3] nvidia-nvjitlink-cu12==12.4.127
[pip3] nvidia-nvtx-cu12==12.4.127
[pip3] nvidia-pyindex==1.0.9
[pip3] onnx==1.14.0
[pip3] pynvml==12.0.0
[pip3] pytorch-quantization==2.1.2
[pip3] pyzmq==26.4.0
[pip3] torch==2.6.0
[pip3] torch-tensorrt==0.0.0
[pip3] torchaudio==2.6.0
[pip3] torchdata==0.7.0a0
[pip3] torchtext==0.16.0a0
[pip3] torchvision==0.21.0
[pip3] transformers==4.51.3
[pip3] triton==3.2.0
[conda] Could not collect
ROCM Version: Could not collect
Neuron SDK Version: N/A
vLLM Version: 0.8.4
vLLM Build Flags:
CUDA Archs: 5.2 6.0 6.1 7.0 7.2 7.5 8.0 8.6 8.7 9.0+PTX; ROCm: Disabled; Neuron: Disabled

NVIDIA_VISIBLE_DEVICES=all
CUBLAS_VERSION=12.2.5.6
NVIDIA_REQUIRE_CUDA=cuda>=9.0
CUDA_CACHE_DISABLE=1
TORCH_CUDA_ARCH_LIST=5.2 6.0 6.1 7.0 7.2 7.5 8.0 8.6 8.7 9.0+PTX
NCCL_VERSION=2.19.3
NVIDIA_DRIVER_CAPABILITIES=compute,utility,video
NVIDIA_PRODUCT_NAME=PyTorch
CUDA_VERSION=12.2.2.009
PYTORCH_VERSION=2.1.0a0+32f93b1
PYTORCH_BUILD_NUMBER=0
CUDNN_VERSION=8.9.5.29
PYTORCH_HOME=/opt/pytorch/pytorch
LD_LIBRARY_PATH=/usr/local/lib/python3.10/dist-packages/torch/lib:/usr/local/lib/python3.10/dist-packages/torch_tensorrt/lib:/usr/local/cuda/compat/lib:/usr/local/nvidia/lib:/usr/local/nvidia/lib64
NVIDIA_BUILD_ID=71422337
CUDA_DRIVER_VERSION=535.104.05
PYTORCH_BUILD_VERSION=2.1.0a0+32f93b1
CUDA_HOME=/usr/local/cuda
CUDA_HOME=/usr/local/cuda
CUDA_MODULE_LOADING=LAZY
NVIDIA_REQUIRE_JETPACK_HOST_MOUNTS=
NVIDIA_PYTORCH_VERSION=23.10
TORCH_ALLOW_TF32_CUBLAS_OVERRIDE=1
NCCL_CUMEM_ENABLE=0
PYTORCH_NVML_BASED_CUDA_CHECK=1
TORCHINDUCTOR_COMPILE_THREADS=1

🐛 Describe the bug

I'm testing open-source models using vLLM v0.8.4 and lm-evaluation-harness. However, this version shows degraded accuracy when deploying the Gemma 3 series. There were no issues when testing the Qwen 2.5 model. I'm comparing results from vLLM against those from Hugging Face.

Qwen2.5-72B-Instruct

vLLM result

Tasks Version Filter n-shot Metric Value Stderr
hellaswag 1 none 0 acc 0.7042 ± 0.0046
none 0 acc_norm 0.8741 ± 0.0033
lambada_openai 1 none 0 acc 0.7613 ± 0.0059
none 0 perplexity 2.7697 ± 0.0560

HF result

gen_kwargs: (None), limit: None, num_fewshot: None, batch_size: 4

Tasks Version Filter n-shot Metric Value Stderr
hellaswag 1 none 0 acc 0.7031 ± 0.0046
none 0 acc_norm 0.8735 ± 0.0033
lambada_openai 1 none 0 acc 0.7518 ± 0.0060
none 0 perplexity 2.7677 ± 0.0559

gemma-3-27b-it

vLLM result

Tasks Version Filter n-shot Metric Value Stderr
hellaswag 1 none 0 acc 0.3863 ± 0.0049
none 0 acc_norm 0.4862 ± 0.0050
lambada_openai 1 none 0 acc 0.2251 ± 0.0058
none 0 perplexity 956.6927 ± 77.7910

HF resuit

Tasks Version Filter n-shot Metric Value Stderr
hellaswag 1 none 0 acc 0.6508 ± 0.0048
none 0 acc_norm 0.8405 ± 0.0037
lambada_openai 1 none 0 acc 0.6942 ± 0.0064
none 0 perplexity 3.7705 ± 0.1152

gemma-3-12b-it

vLLM result

Tasks Version Filter n-shot Metric Value Stderr
hellaswag 1 none 0 acc 0.3351 ± 0.0047
none 0 acc_norm 0.3999 ± 0.0049
lambada_openai 1 none 0 acc 0.2424 ± 0.0060
none 0 perplexity 7298.6355 ± 885.6837

HF result

Tasks Version Filter n-shot Metric Value Stderr
hellaswag 1 none 0 acc 0.6285 ± 0.0048
none 0 acc_norm 0.8187 ± 0.0038
lambada_openai 1 none 0 acc 0.6775 ± 0.0065
none 0 perplexity 4.1643 ± 0.1289

Before submitting a new issue...

  • Make sure you already searched for relevant issues, and asked the chatbot living at the bottom right corner of the documentation page, which can answer lots of frequently asked questions.

Metadata

Metadata

Assignees

No one assigned

    Labels

    bugSomething isn't working

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions