Closed
Description
Hello,
I have been trying vllm 0.1.3 with LLaMA 2 model,
python -m vllm.entrypoints.openai.api_server --model meta-llama/Llama-2-70b-chat-hf --port 6011 --tensor-parallel-size 8 --tokenizer hf-internal-testing/llama-tokenizer
I noticed the server will hang and stop processing further requests after a prompt being too long. The last output is:
WARNING 08-15 11:44:47 scheduler.py:130] Input prompt (2715 tokens) is too long and exceeds limit of 2560
I see there is a related fix: #273 but not sure why this still happens.
Is there a way to change this behavior and just discard this request instead?
Thanks!
Metadata
Assignees
Labels
No labels