Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add option to completion API to truncate prompt tokens #3144

Merged
merged 13 commits into from
Apr 5, 2024

Conversation

tdoublep
Copy link
Member

@tdoublep tdoublep commented Mar 1, 2024

Hey vLLM team - thanks for the awesome project.

This PR adds an additional option to the OpenAI completion API that allows one to truncate the number of input tokens for an individual request. We find this feature extremely useful for benchmarking and performance evaluation.

Without this option, if we want precise control of the number of input tokens, we need to implement tokenization on the client-side (e.g., in our load test environment) which introduces a bunch of dependencies. In this sense, we can live without this feature but it is super convenient to be able to do truncation on the server-side.

I have tried to keep the changes to a minimum, but if there is interest I have also implemented this for the AsyncLLMEngine and LLMEngine.

@tdoublep tdoublep changed the title Add option to completions API to truncate input tokens for each request Add option to completion API to truncate input tokens for each request Mar 1, 2024
Copy link
Member

@njhill njhill left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks @tdoublep! I'm wondering whether to align with existing vLLM parlance it might be better to rename this to truncate_prompt_tokens?

vllm/entrypoints/openai/serving_engine.py Outdated Show resolved Hide resolved
vllm/entrypoints/openai/serving_engine.py Outdated Show resolved Hide resolved
tdoublep and others added 3 commits March 1, 2024 19:52
tokenizer_kwargs: more efficient allocation

Co-authored-by: Nick Hill <nickhill@us.ibm.com>
@tdoublep tdoublep changed the title Add option to completion API to truncate input tokens for each request Add option to completion API to truncate prompt tokens Mar 1, 2024
njhill added a commit to njhill/vllm that referenced this pull request Mar 19, 2024
vllm-project#2879 added support for using ray to offload tokenization from the asyncio event loop.

This PR extends that to support using a thread pool instead of ray, and makes that the default, with the default pool size determined based on the number of available CPU cores and the tensor parallel size.

The main thing to note is that separate tokenizer instances are used per thread. This is because officially the HF tokenizers are not thread-safe. In practice I think they are unless you're making use of padding/truncation, which we aren't currently but may want to soon (see for example vllm-project#3144).

Also includes some type hint additions to related parts of the code.

This replaces the original PR vllm-project#3206 from before vllm-project#2879 was reworked and merged.
Copy link
Collaborator

@simon-mo simon-mo left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

lgtm! i believe there are merge conflict since documentations were added to protocol.py

@@ -183,6 +183,7 @@ class CompletionRequest(BaseModel):
guided_json: Optional[Union[str, dict, BaseModel]] = None
guided_regex: Optional[str] = None
guided_choice: Optional[List[str]] = None
truncate_prompt_tokens: Optional[int] = None
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit can you make this a constrained integer: https://docs.pydantic.dev/2.3/api/types/#pydantic.types.conint

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

done

@tdoublep
Copy link
Member Author

CI errors do not look related to these changes

@njhill njhill mentioned this pull request Mar 27, 2024
@diego898
Copy link

Apologies @tdoublep - it seems I submitted a simple comment as a review. That was not my intention.

I think this is great and would be very excited to have this merged as we have also had to implement tokenization clientside to prevent vllm rejecting requests that are too long.

My previous comment is about truncation side, as for various reasons/formats we'd either want to trim from the left or right as well and since it already a parameter that has to get set, may make sense to add both together.

Thanks!

@simon-mo
Copy link
Collaborator

My previous comment is about truncation side, as for various reasons/formats we'd either want to trim from the left or right as well and since it already a parameter that has to get set, may make sense to add both together.

If this is a valid use case, maybe let's support it in this PR as well?

@diego898
Copy link

For reference, this is the HF tokenizer docs around truncation_side. This PR hard-codes that to "left"

@tdoublep
Copy link
Member Author

tdoublep commented Apr 3, 2024

@diego898 @simon-mo I made the truncation_side configurable via a new option in the ModelConfig. Please take a look and let me know what you think.

@njhill
Copy link
Member

njhill commented Apr 3, 2024

My previous comment is about truncation side, as for various reasons/formats we'd either want to trim from the left or right as well and since it already a parameter that has to get set, may make sense to add both together.

@diego898 @simon-mo could you give an example of a case where it would make sense to truncate on the right for autoregressive text generation? I can't think of one but could be a failure of imagination.

I don't think the fact that HF tokenizers support this in itself is a good reason to support it here, they are much more general and used in many other contexts such as data prep and training, and with different kinds of models.

@diego898
Copy link

diego898 commented Apr 3, 2024

@njhill - great point. I guess I was thinking it would depend how someone structures their context window. Ex: for RAG:

|--[----chat history----]-[System prompt]--[----------------Context docs------------------]-[current question]-|

You may want to only to take from the left. but if isntead you did:

|[System prompt]--[----------------Context docs------------------]-[current question]---[----chat history----]-|

You may want trim from the right? Or

|[System prompt]-[----chat history----]---[current question]--[----------------Context docs------------------]|

again from the right?

I'm not really suggesting any of these layouts are better/worse. Several wouldnt even make sense.... Just relaying what I was thinking when suggsting to make it confirgurable.

But, if vLLM decides left-side truncation is the norm/default, that would also be fne and end-users can structure their context windows accordingly!

@njhill
Copy link
Member

njhill commented Apr 4, 2024

Thanks @diego898 TBH I don't think it would make sense to use this truncate option at all in conjunction with a system prompt. Some other form of truncation would need to be used, i.e. when applying the chat template, to retain the system prompt and then exclude the beginning of the user prompt or conversation.

And I don't think that right-truncation would be sensible either for any of the examples you gave. In a chat the last thing you would want to exclude is the most recent messages, and if you truncated in the middle of some arbitrary context docs, the LM would just attempt to continue writing that particular doc.

I would still vote for keeping this left-truncated only. If a concrete need arose in future it would still be easy to add the option to configure the side.

@tdoublep
Copy link
Member Author

tdoublep commented Apr 4, 2024

@simon-mo Do you agree with @njhill that it makes sense to hard-code truncation_side to left? If so, I will revert this branch to the earlier commit.

@diego898
Copy link

diego898 commented Apr 4, 2024

I apologize for the confusion my request caused! I 100% agree with you @njhill - that is in fact what we do - truncate the chat history and not either side.

I stretched my brain to try and describe a situation where right truncation may make sense, but didn't convince even myself!

I confess, my initial request was based solely on the fact that HF has it configurable.

I apologize @tdoublep for the extra work this caused you!

@simon-mo
Copy link
Collaborator

simon-mo commented Apr 4, 2024

I trust @njhill to decide and merge.

@tdoublep
Copy link
Member Author

tdoublep commented Apr 5, 2024

@njhill I've reverted back to the version with truncation side fixed to left, and resolves some minor conflicts with changes on main branch. I think it is ready to merge.

@tdoublep
Copy link
Member Author

tdoublep commented Apr 5, 2024

IDK why ruff is failing - the formatting checks are passing for me locally:

$ bash format.sh 
vLLM yapf: Done
vLLM codespell: Done
vLLM ruff:
vLLM isort: Done

Copy link
Member

@njhill njhill left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the contribution @tdoublep!

@njhill njhill merged commit 1d7c940 into vllm-project:main Apr 5, 2024
35 checks passed
@satpalsr
Copy link

satpalsr commented Apr 9, 2024

Why not also add it for Chat Completion API?

@m-harmonic
Copy link

@tdoublep Thanks for your earlier work. Just to check, am I seeing correctly that truncate_prompt_tokens was added to SamplingParams but that it has no effect other than when using VLLM for OpenAI? Is there any plan to add support in the normal VLLM (async) engine? Thanks

@tdoublep
Copy link
Member Author

@m-harmonic There was a PR that tried to address this: #4598

Maybe we can check with @yecohn regarding the status. I'm happy to help finish it if required.

Temirulan pushed a commit to Temirulan/vllm-whisper that referenced this pull request Sep 6, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

6 participants