-
-
Notifications
You must be signed in to change notification settings - Fork 4.4k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Asynchronous tokenization #2879
Conversation
@Yard1 did you consider using a |
@njhill I agree that a thread based solution should work in principle for the most popular models - it would be good to confirm that, though. Would you be interested in trying it out using the API here? |
@Yard1 sure! |
@njhill @cadedaniel updated, ptal |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM, one design comment and some nits
vllm/transformers_utils/tokenizer_group/base_tokenizer_group.py
Outdated
Show resolved
Hide resolved
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Small nits, looks great!
vllm/transformers_utils/tokenizer_group/base_tokenizer_group.py
Outdated
Show resolved
Hide resolved
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
some nits around code style
if not lora_request or not self.enable_lora: | ||
return self.tokenizer |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
readability wise, it would be helpful to move these up in corresponding encode
function
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Those are not private methods, I think it makes sense to keep this logic here as it's relevant.
if not lora_request or not self.enable_lora: | ||
return self.tokenizer |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
same
@abstractmethod | ||
def get_lora_tokenizer( | ||
self, | ||
lora_request: Optional[LoRARequest]) -> "PreTrainedTokenizer": | ||
"""Get a tokenizer for a LoRA request.""" | ||
pass | ||
|
||
@abstractmethod | ||
async def get_lora_tokenizer_async( | ||
self, | ||
lora_request: Optional[LoRARequest]) -> "PreTrainedTokenizer": | ||
"""Get a tokenizer for a LoRA request.""" | ||
pass |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Are these method called externally at all? if not I would not put them in base class
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
they are called in LLMEngine
if you can address the code style that would be great. automerge is enabled, once test passes (should be if you merge main), it will be merged. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM
Very cool! Any benchmark for the improvement? |
vllm-project#2879 added support for using ray to offload tokenization from the asyncio event loop. This PR extends that to support using a thread pool instead of ray, and makes that the default, with the default pool size determined based on the number of available CPU cores and the tensor parallel size. The main thing to note is that separate tokenizer instances are used per thread. This is because officially the HF tokenizers are not thread-safe. In practice I think they are unless you're making use of padding/truncation, which we aren't currently but may want to soon (see for example vllm-project#3144). Also includes some type hint additions to related parts of the code. This replaces the original PR vllm-project#3206 from before vllm-project#2879 was reworked and merged.
Currently, vLLM tokenizes incoming requests synchronously inside the Engine. This has a detrimental effect for serving with AsyncLLMEngine as tokenization of long prompts will block the event loop, causing both token generation and request handling to slow down, especially in high QPS scenarios.
This PR introduces an optional Ray-based TokenizerGroupPool that will maintain a pool of RayActors that will do the tokenization. Since now tokenization is ran in separate processes, the event loop will not be blocked (as Ray futures can simply be awaited). This removes the bottleneck described above.
Note that detokenization is not changed, as overheads from serialization/deserialization would be too great there. In case of tokenization, they are negligible.