Skip to content

py03 async bindings for encode/decode in rust #1797

@michaelfeil

Description

@michaelfeil

Py03 unlocks the GIL, which is great. Most LLM inference servers have many cores (>200), but are blocked by the GIL.
Also, most servers are async, by nature that python thread based parallelism isn't great.

As most tokenization is looking something like this:

    def _encode(self, prompt: str) -> List[int]:
        """Encode using the rust tokenizer directly, while relaizing gil"""
        return self.tokenizer.encode(prompt, add_special_tokens=True)

    async def encode_prompt(self, prompt: str) -> List[int]:
        f len(prompt) > 2_000:
            # offload to thread to avoid blocking the event loop
            loop = asyncio.get_running_loop()
            tokenized = await loop.run_in_executor(self._threadpool, self._encode, prompt)
        else:
            tokenized = self._encode(
                prompt,
            )
        return tokenized[1:]

Proposal: Adding Py03-async-runtimes, a async runtime option for encode/decode.
Its potentially worth it for every operation that takes >1ms, or every encode step.

Similar async vs sync usage:
https://github.com/basetenlabs/truss/blob/0816876a474b0c4910eaa3f869ed4c685f7a7570/baseten-performance-client/src/lib.rs#L659C1-L760C20
also, e.g. sglang uses the primitive and could be directly pluged in there. https://github.com/sgl-project/sglang/blob/777688b8929c877e4e28c2eac208d776abe4c3af/python/sglang/srt/managers/tokenizer_manager.py#L454

Metadata

Metadata

Assignees

No one assigned

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions