Description
Your current environment
Using the latest version of vLLM on 2 L4 GPUs.
How would you like to use vllm
I was trying to utilize vLLM to deploy meta-llama/Meta-Llama-3-8B-Instruct model and use OpenAI compatible server with the latest docker image. When I did, it was not stopping generation for a while when max_tokens=None
. I saw that it's generating <|eot_id|>
token which is its eos token apparently but in their tokenizer_config
and in other configs it is <|end_of_text|>
.
I can fix this by setting the eos_token
parameter in tokenizer_config.json
as <|eot_id|>
or using
response = client.chat.completions.create(
model="meta-llama/Meta-Llama-3-8B-Instruct",
messages=[{"role": "user",
"content": "Write a function for fibonacci sequence. Use LRUCache"}],
max_tokens=700,
stream=False,
extra_body={"stop_token_ids":[128009]}
)
stop_token_ids
in my request. I wanted to ask the optimal way to solve this problem.
There is an existing discussion/PR in their repo which is updating the generation_config.json
but unless I clone myself, I saw that vLLM does not install the generation_config.json
file. I also tried with this revision
but it still was not stopping generating after <|eot_id|>
. Moreover, I tried with this revision
as well but it did not stop generating as well.
tldr; Llama-3-8B-Instruct model does not stop generation because of the eos token
.
- Updating
generation_config.json
does not work. - Updating
config.json
also does not work. - Updating
tokenizer_config.json
works but it overwrites the existingeos_token
. Is this problematic or is there a more elegant way to solve this?
May I ask the optimal way to solve this issue?