-
Notifications
You must be signed in to change notification settings - Fork 994
Description
Py03 unlocks the GIL, which is great. Most LLM inference servers have many cores (>200), but are blocked by the GIL.
Also, most servers are async, by nature that python thread based parallelism isn't great.
As most tokenization is looking something like this:
def _encode(self, prompt: str) -> List[int]:
"""Encode using the rust tokenizer directly, while relaizing gil"""
return self.tokenizer.encode(prompt, add_special_tokens=True)
async def encode_prompt(self, prompt: str) -> List[int]:
f len(prompt) > 2_000:
# offload to thread to avoid blocking the event loop
loop = asyncio.get_running_loop()
tokenized = await loop.run_in_executor(self._threadpool, self._encode, prompt)
else:
tokenized = self._encode(
prompt,
)
return tokenized[1:]
Proposal: Adding Py03-async-runtimes, a async runtime option for encode/decode.
Its potentially worth it for every operation that takes >1ms, or every encode step.
Similar async vs sync usage:
https://github.com/basetenlabs/truss/blob/0816876a474b0c4910eaa3f869ed4c685f7a7570/baseten-performance-client/src/lib.rs#L659C1-L760C20
also, e.g. sglang uses the primitive and could be directly pluged in there. https://github.com/sgl-project/sglang/blob/777688b8929c877e4e28c2eac208d776abe4c3af/python/sglang/srt/managers/tokenizer_manager.py#L454