Description
Your current environment
- vLLM version: v0.8.4
- docker image: nvcr.io/nvidia/pytorch/23.10-py3
- install script
pip install -r ./requirements/build.txt
pip install -r ./requirements/common.txt
pip install -r ./requirements/cuda.txt
pip install flash_attn==2.7.4.post1
export VLLM_COMMIT=dc1b4a6f1300003ae27f033afbdff5e2683721ce
export VLLM_PRECOMPILED_WHEEL_LOCATION=https://wheels.vllm.ai/${VLLM_COMMIT}/vllm-1.0.0.dev-cp38-abi3-manylinux1_x86_64.whl
pip install -e .
pip install -U pynvml
Click to expand environment info
Collecting environment information... PyTorch version: 2.6.0+cu124 Is debug build: False CUDA used to build PyTorch: 12.4 ROCM used to build PyTorch: N/AOS: Ubuntu 22.04.3 LTS (x86_64)
GCC version: (Ubuntu 11.4.0-1ubuntu1~22.04) 11.4.0
Clang version: Could not collect
CMake version: version 4.0.0
Libc version: glibc-2.35
Python version: 3.10.12 (main, Jun 11 2023, 05:26:28) [GCC 11.4.0] (64-bit runtime)
Python platform: Linux-4.19.93-1.nbp.el7.x86_64-x86_64-with-glibc2.35
Is CUDA available: True
CUDA runtime version: 12.2.140
CUDA_MODULE_LOADING set to: LAZY
Nvidia driver version: 535.129.03
cuDNN version: Probably one of the following:
/usr/lib/x86_64-linux-gnu/libcudnn.so.8.9.5
/usr/lib/x86_64-linux-gnu/libcudnn_adv_infer.so.8.9.5
/usr/lib/x86_64-linux-gnu/libcudnn_adv_train.so.8.9.5
/usr/lib/x86_64-linux-gnu/libcudnn_cnn_infer.so.8.9.5
/usr/lib/x86_64-linux-gnu/libcudnn_cnn_train.so.8.9.5
/usr/lib/x86_64-linux-gnu/libcudnn_ops_infer.so.8.9.5
/usr/lib/x86_64-linux-gnu/libcudnn_ops_train.so.8.9.5
HIP runtime version: N/A
MIOpen runtime version: N/A
Is XNNPACK available: True
Versions of relevant libraries:
[pip3] numpy==2.2.5
[pip3] nvidia-cublas-cu12==12.4.5.8
[pip3] nvidia-cuda-cupti-cu12==12.4.127
[pip3] nvidia-cuda-nvrtc-cu12==12.4.127
[pip3] nvidia-cuda-runtime-cu12==12.4.127
[pip3] nvidia-cudnn-cu12==9.1.0.70
[pip3] nvidia-cufft-cu12==11.2.1.3
[pip3] nvidia-curand-cu12==10.3.5.147
[pip3] nvidia-cusolver-cu12==11.6.1.9
[pip3] nvidia-cusparse-cu12==12.3.1.170
[pip3] nvidia-cusparselt-cu12==0.6.2
[pip3] nvidia-dali-cuda120==1.30.0
[pip3] nvidia-ml-py==12.570.86
[pip3] nvidia-nccl-cu12==2.21.5
[pip3] nvidia-nvjitlink-cu12==12.4.127
[pip3] nvidia-nvtx-cu12==12.4.127
[pip3] nvidia-pyindex==1.0.9
[pip3] onnx==1.14.0
[pip3] pynvml==12.0.0
[pip3] pytorch-quantization==2.1.2
[pip3] pyzmq==26.4.0
[pip3] torch==2.6.0
[pip3] torch-tensorrt==0.0.0
[pip3] torchaudio==2.6.0
[pip3] torchdata==0.7.0a0
[pip3] torchtext==0.16.0a0
[pip3] torchvision==0.21.0
[pip3] transformers==4.51.3
[pip3] triton==3.2.0
[conda] Could not collect
ROCM Version: Could not collect
Neuron SDK Version: N/A
vLLM Version: 0.8.4
vLLM Build Flags:
CUDA Archs: 5.2 6.0 6.1 7.0 7.2 7.5 8.0 8.6 8.7 9.0+PTX; ROCm: Disabled; Neuron: Disabled
NVIDIA_VISIBLE_DEVICES=all
CUBLAS_VERSION=12.2.5.6
NVIDIA_REQUIRE_CUDA=cuda>=9.0
CUDA_CACHE_DISABLE=1
TORCH_CUDA_ARCH_LIST=5.2 6.0 6.1 7.0 7.2 7.5 8.0 8.6 8.7 9.0+PTX
NCCL_VERSION=2.19.3
NVIDIA_DRIVER_CAPABILITIES=compute,utility,video
NVIDIA_PRODUCT_NAME=PyTorch
CUDA_VERSION=12.2.2.009
PYTORCH_VERSION=2.1.0a0+32f93b1
PYTORCH_BUILD_NUMBER=0
CUDNN_VERSION=8.9.5.29
PYTORCH_HOME=/opt/pytorch/pytorch
LD_LIBRARY_PATH=/usr/local/lib/python3.10/dist-packages/torch/lib:/usr/local/lib/python3.10/dist-packages/torch_tensorrt/lib:/usr/local/cuda/compat/lib:/usr/local/nvidia/lib:/usr/local/nvidia/lib64
NVIDIA_BUILD_ID=71422337
CUDA_DRIVER_VERSION=535.104.05
PYTORCH_BUILD_VERSION=2.1.0a0+32f93b1
CUDA_HOME=/usr/local/cuda
CUDA_HOME=/usr/local/cuda
CUDA_MODULE_LOADING=LAZY
NVIDIA_REQUIRE_JETPACK_HOST_MOUNTS=
NVIDIA_PYTORCH_VERSION=23.10
TORCH_ALLOW_TF32_CUBLAS_OVERRIDE=1
NCCL_CUMEM_ENABLE=0
PYTORCH_NVML_BASED_CUDA_CHECK=1
TORCHINDUCTOR_COMPILE_THREADS=1
🐛 Describe the bug
I'm testing open-source models using vLLM v0.8.4 and lm-evaluation-harness. However, this version shows degraded accuracy when deploying the Gemma 3 series. There were no issues when testing the Qwen 2.5 model. I'm comparing results from vLLM against those from Hugging Face.
Qwen2.5-72B-Instruct
vLLM result
Tasks | Version | Filter | n-shot | Metric | Value | Stderr | ||
---|---|---|---|---|---|---|---|---|
hellaswag | 1 | none | 0 | acc | ↑ | 0.7042 | ± | 0.0046 |
none | 0 | acc_norm | ↑ | 0.8741 | ± | 0.0033 | ||
lambada_openai | 1 | none | 0 | acc | ↑ | 0.7613 | ± | 0.0059 |
none | 0 | perplexity | ↓ | 2.7697 | ± | 0.0560 |
HF result
gen_kwargs: (None), limit: None, num_fewshot: None, batch_size: 4
Tasks | Version | Filter | n-shot | Metric | Value | Stderr | ||
---|---|---|---|---|---|---|---|---|
hellaswag | 1 | none | 0 | acc | ↑ | 0.7031 | ± | 0.0046 |
none | 0 | acc_norm | ↑ | 0.8735 | ± | 0.0033 | ||
lambada_openai | 1 | none | 0 | acc | ↑ | 0.7518 | ± | 0.0060 |
none | 0 | perplexity | ↓ | 2.7677 | ± | 0.0559 |
gemma-3-27b-it
vLLM result
Tasks | Version | Filter | n-shot | Metric | Value | Stderr | ||
---|---|---|---|---|---|---|---|---|
hellaswag | 1 | none | 0 | acc | ↑ | 0.3863 | ± | 0.0049 |
none | 0 | acc_norm | ↑ | 0.4862 | ± | 0.0050 | ||
lambada_openai | 1 | none | 0 | acc | ↑ | 0.2251 | ± | 0.0058 |
none | 0 | perplexity | ↓ | 956.6927 | ± | 77.7910 |
HF resuit
Tasks | Version | Filter | n-shot | Metric | Value | Stderr | ||
---|---|---|---|---|---|---|---|---|
hellaswag | 1 | none | 0 | acc | ↑ | 0.6508 | ± | 0.0048 |
none | 0 | acc_norm | ↑ | 0.8405 | ± | 0.0037 | ||
lambada_openai | 1 | none | 0 | acc | ↑ | 0.6942 | ± | 0.0064 |
none | 0 | perplexity | ↓ | 3.7705 | ± | 0.1152 |
gemma-3-12b-it
vLLM result
Tasks | Version | Filter | n-shot | Metric | Value | Stderr | ||
---|---|---|---|---|---|---|---|---|
hellaswag | 1 | none | 0 | acc | ↑ | 0.3351 | ± | 0.0047 |
none | 0 | acc_norm | ↑ | 0.3999 | ± | 0.0049 | ||
lambada_openai | 1 | none | 0 | acc | ↑ | 0.2424 | ± | 0.0060 |
none | 0 | perplexity | ↓ | 7298.6355 | ± | 885.6837 |
HF result
Tasks | Version | Filter | n-shot | Metric | Value | Stderr | ||
---|---|---|---|---|---|---|---|---|
hellaswag | 1 | none | 0 | acc | ↑ | 0.6285 | ± | 0.0048 |
none | 0 | acc_norm | ↑ | 0.8187 | ± | 0.0038 | ||
lambada_openai | 1 | none | 0 | acc | ↑ | 0.6775 | ± | 0.0065 |
none | 0 | perplexity | ↓ | 4.1643 | ± | 0.1289 |
Before submitting a new issue...
- Make sure you already searched for relevant issues, and asked the chatbot living at the bottom right corner of the documentation page, which can answer lots of frequently asked questions.