Asynchronous tokenization #2879

Yard1 · 2024-02-14T22:40:51Z

Currently, vLLM tokenizes incoming requests synchronously inside the Engine. This has a detrimental effect for serving with AsyncLLMEngine as tokenization of long prompts will block the event loop, causing both token generation and request handling to slow down, especially in high QPS scenarios.

This PR introduces an optional Ray-based TokenizerGroupPool that will maintain a pool of RayActors that will do the tokenization. Since now tokenization is ran in separate processes, the event loop will not be blocked (as Ray futures can simply be awaited). This removes the bottleneck described above.

Note that detokenization is not changed, as overheads from serialization/deserialization would be too great there. In case of tokenization, they are negligible.

njhill · 2024-02-14T23:28:38Z

@Yard1 did you consider using a ThreadPoolExecutor of some size in conjunction with loop.run_in_executor()? I expect this would give similar benefits without the IPC and serializtion/deserialization overhead. The tokenization itself should not hold the GIL since it's in rust for most tokenizers. I expect the modifications needed would also be much simpler.

Yard1 · 2024-02-19T01:43:54Z

@njhill I agree that a thread based solution should work in principle for the most popular models - it would be good to confirm that, though. Would you be interested in trying it out using the API here?

njhill · 2024-02-19T18:25:17Z

@Yard1 sure!

njhill · 2024-03-05T15:54:00Z

@Yard1 sorry for the delay with this, I've now opened #3206.

vllm/engine/llm_engine.py

vllm/transformers_utils/tokenizer.py

vllm/engine/llm_engine.py

vllm/engine/arg_utils.py

tests/async_engine/test_api_server.py

vllm/engine/llm_engine.py

Yard1 · 2024-03-13T23:16:17Z

@njhill @cadedaniel updated, ptal

njhill

Thanks @Yard1. I can rebase #3206 when this is ready, would be good to get your thoughts on that too!

vllm/transformers_utils/tokenizer_group/__init__.py

vllm/transformers_utils/tokenizer_group/ray_tokenizer_group.py

cadedaniel

LGTM, one design comment and some nits

vllm/config.py

vllm/engine/arg_utils.py

vllm/transformers_utils/tokenizer_group/ray_tokenizer_group.py

vllm/transformers_utils/tokenizer_group/base_tokenizer_group.py

vllm/transformers_utils/tokenizer.py

cadedaniel

Small nits, looks great!

vllm/transformers_utils/tokenizer_group/ray_tokenizer_group.py

vllm/transformers_utils/tokenizer_group/base_tokenizer_group.py

vllm/transformers_utils/tokenizer_group/ray_tokenizer_group.py

simon-mo

some nits around code style

simon-mo · 2024-03-15T21:53:28Z

vllm/transformers_utils/tokenizer_group/tokenizer_group.py

+        if not lora_request or not self.enable_lora:
+            return self.tokenizer


readability wise, it would be helpful to move these up in corresponding encode function

Those are not private methods, I think it makes sense to keep this logic here as it's relevant.

simon-mo · 2024-03-15T21:53:32Z

vllm/transformers_utils/tokenizer_group/tokenizer_group.py

+        if not lora_request or not self.enable_lora:
+            return self.tokenizer


simon-mo · 2024-03-15T21:54:14Z

vllm/transformers_utils/tokenizer_group/base_tokenizer_group.py

+    @abstractmethod
+    def get_lora_tokenizer(
+            self,
+            lora_request: Optional[LoRARequest]) -> "PreTrainedTokenizer":
+        """Get a tokenizer for a LoRA request."""
+        pass
+
+    @abstractmethod
+    async def get_lora_tokenizer_async(
+            self,
+            lora_request: Optional[LoRARequest]) -> "PreTrainedTokenizer":
+        """Get a tokenizer for a LoRA request."""
+        pass


Are these method called externally at all? if not I would not put them in base class

they are called in LLMEngine

simon-mo · 2024-03-15T21:55:07Z

if you can address the code style that would be great. automerge is enabled, once test passes (should be if you merge main), it will be merged.

njhill

LGTM

flexwang · 2024-03-18T06:51:57Z

Very cool! Any benchmark for the improvement?

vllm-project#2879 added support for using ray to offload tokenization from the asyncio event loop. This PR extends that to support using a thread pool instead of ray, and makes that the default, with the default pool size determined based on the number of available CPU cores and the tensor parallel size. The main thing to note is that separate tokenizer instances are used per thread. This is because officially the HF tokenizers are not thread-safe. In practice I think they are unless you're making use of padding/truncation, which we aren't currently but may want to soon (see for example vllm-project#3144). Also includes some type hint additions to related parts of the code. This replaces the original PR vllm-project#3206 from before vllm-project#2879 was reworked and merged.

Asynchronous tokenization

e323b92

Yard1 requested review from zhuohan123, simon-mo and WoosukKwon February 14, 2024 22:42

This was referenced Mar 4, 2024

Push logprob generation to LLMEngine #3065

Merged

Async tokenization using thread pool #3206

Closed

Merge branch 'main' into async_tokenization

d58f2bf

zhuohan123 self-assigned this Mar 6, 2024

cadedaniel self-assigned this Mar 13, 2024

cadedaniel reviewed Mar 13, 2024

View reviewed changes

Yard1 added 2 commits March 13, 2024 14:11

Merge branch 'main' into async_tokenization

6188ac1

WIP

a29f4e6

Yard1 commented Mar 13, 2024

View reviewed changes

vllm/engine/llm_engine.py Outdated Show resolved Hide resolved

Yard1 added 5 commits March 13, 2024 14:59

Update vllm/engine/llm_engine.py

5cfa7fc

WIP

ba38a0b

Improve test

368e28b

WIP

9d23373

Fix

100074f

Yard1 requested review from cadedaniel and njhill March 13, 2024 23:16

njhill reviewed Mar 13, 2024

View reviewed changes

vllm/transformers_utils/tokenizer_group/__init__.py Show resolved Hide resolved

vllm/transformers_utils/tokenizer_group/ray_tokenizer_group.py Show resolved Hide resolved

Yard1 added 2 commits March 13, 2024 17:12

Feedback

e303c5f

Lint

dd0162d

cadedaniel reviewed Mar 14, 2024

View reviewed changes

Apply feedback from code review

5602685

Nit

4cd7769

Yard1 commented Mar 14, 2024

View reviewed changes

vllm/transformers_utils/tokenizer.py Outdated Show resolved Hide resolved

Update vllm/transformers_utils/tokenizer.py

84cade1

Yard1 requested review from cadedaniel and njhill March 14, 2024 17:32

Merge branch 'vllm-project:main' into async_tokenization

63f1591

cadedaniel approved these changes Mar 14, 2024

View reviewed changes

Yard1 added 2 commits March 14, 2024 11:48

Tweak

fc0b04c

Nits

e8241f7

simon-mo self-assigned this Mar 15, 2024

simon-mo approved these changes Mar 15, 2024

View reviewed changes

simon-mo enabled auto-merge (squash) March 15, 2024 21:54

njhill approved these changes Mar 15, 2024

View reviewed changes

Merge branch 'vllm-project:main' into async_tokenization

eda86ac

simon-mo merged commit fb96c1e into vllm-project:main Mar 15, 2024
23 of 24 checks passed

Yard1 deleted the async_tokenization branch March 15, 2024 23:41

njhill mentioned this pull request Mar 16, 2024

[Core] Support thread-based async tokenizer pools #3449

Open

Temirulan pushed a commit to Temirulan/vllm-whisper that referenced this pull request Sep 6, 2024

Asynchronous tokenization (vllm-project#2879)

9b0a1af

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Asynchronous tokenization #2879

Asynchronous tokenization #2879

Yard1 commented Feb 14, 2024

njhill commented Feb 14, 2024

Yard1 commented Feb 19, 2024

njhill commented Feb 19, 2024

njhill commented Mar 5, 2024

Yard1 commented Mar 13, 2024

njhill left a comment

cadedaniel left a comment

cadedaniel left a comment

simon-mo left a comment

simon-mo Mar 15, 2024

Yard1 Mar 15, 2024

simon-mo Mar 15, 2024

simon-mo Mar 15, 2024

Yard1 Mar 15, 2024

simon-mo commented Mar 15, 2024

njhill left a comment

flexwang commented Mar 18, 2024

		if not lora_request or not self.enable_lora:
		return self.tokenizer

Asynchronous tokenization #2879

Asynchronous tokenization #2879

Conversation

Yard1 commented Feb 14, 2024

njhill commented Feb 14, 2024

Yard1 commented Feb 19, 2024

njhill commented Feb 19, 2024

njhill commented Mar 5, 2024

Yard1 commented Mar 13, 2024

njhill left a comment

Choose a reason for hiding this comment

cadedaniel left a comment

Choose a reason for hiding this comment

cadedaniel left a comment

Choose a reason for hiding this comment

simon-mo left a comment

Choose a reason for hiding this comment

simon-mo Mar 15, 2024

Choose a reason for hiding this comment

Yard1 Mar 15, 2024

Choose a reason for hiding this comment

simon-mo Mar 15, 2024

Choose a reason for hiding this comment

simon-mo Mar 15, 2024

Choose a reason for hiding this comment

Yard1 Mar 15, 2024

Choose a reason for hiding this comment

simon-mo commented Mar 15, 2024

njhill left a comment

Choose a reason for hiding this comment

flexwang commented Mar 18, 2024