Skip to content

Server hanging after prompt exceeds limit with LLaMA 2 models #765

Closed
@dexterju27

Description

Hello,

I have been trying vllm 0.1.3 with LLaMA 2 model,

python  -m vllm.entrypoints.openai.api_server  --model meta-llama/Llama-2-70b-chat-hf --port 6011 --tensor-parallel-size 8 --tokenizer hf-internal-testing/llama-tokenizer

I noticed the server will hang and stop processing further requests after a prompt being too long. The last output is:

WARNING 08-15 11:44:47 scheduler.py:130] Input prompt (2715 tokens) is too long and exceeds limit of 2560

I see there is a related fix: #273 but not sure why this still happens.

Is there a way to change this behavior and just discard this request instead?

Thanks!

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions