Description
Your current environment
The output of `python collect_env.py`
Your output of `python collect_env.py` here
Description
In vLLM 0.7 and before, using a high temperature (10) with a random input string always returns "max_tokens" number of tokens (random output of the correct length)
With a temperature of 0, it returns something similar to "It seems like you've entered a string of characters that doesn't appear to be a meaningful word, phrase, or question."
Using the docker image 0.8.0 or 0.8.1, no matter the temperature, it always answers something like "It seems like you've entered a string of characters that doesn't appear to be a meaningful word, phrase, or question."
Details
I tried with multiple models and the temperature seems to be ignored for all of them
🐛 Describe the bug
Reproduction
Starting a Docker container with:
docker run --gpus all \ --entrypoint bash \ -v ~/.cache/huggingface:/root/.cache/huggingface \ --ipc=host \ -p 8000:8000 \ -it \ vllm/vllm-openai:v0.7.3
and running
python3 -m vllm.entrypoints.openai.api_server --model Qwen/Qwen2.5-VL-7B-Instruct --trust-remote-code --max-model-len 32768 --tensor-parallel-size 2 --gpu-memory-utilization 0.95
on the server-side, and
import string
import time
from openai import OpenAI
model_name = "Qwen/Qwen2.5-VL-7B-Instruct"
client = OpenAI(
api_key="EMPTY",
base_url="http://localhost:8000/v1",
)
client.chat.completions.create(
model = model_name,
max_tokens = 1000,
temperature = 10,
messages = [
{"role": "system", "content": "You are Qwen."},
{
"role": "user",
"content": "".join(random.choices(string.ascii_letters + string.digits, k=10)),
},
],
)```
### Before submitting a new issue...
- [x] Make sure you already searched for relevant issues, and asked the chatbot living at the bottom right corner of the [documentation page](https://docs.vllm.ai/en/latest/), which can answer lots of frequently asked questions.