py03 async bindings for encode/decode in rust

Py03 unlocks the GIL, which is great. Most LLM inference servers have many cores (>200), but are blocked by the GIL. 
Also, most servers are async, by nature that python thread based parallelism isn't great. 

As most tokenization is looking something like this:

```
    def _encode(self, prompt: str) -> List[int]:
        """Encode using the rust tokenizer directly, while relaizing gil"""
        return self.tokenizer.encode(prompt, add_special_tokens=True)

    async def encode_prompt(self, prompt: str) -> List[int]:
        f len(prompt) > 2_000:
            # offload to thread to avoid blocking the event loop
            loop = asyncio.get_running_loop()
            tokenized = await loop.run_in_executor(self._threadpool, self._encode, prompt)
        else:
            tokenized = self._encode(
                prompt,
            )
        return tokenized[1:]
```

Proposal: Adding Py03-async-runtimes, a async runtime option for encode/decode. 
Its potentially worth it for every operation that takes >1ms, or every encode step. 

Similar async vs sync usage:
https://github.com/basetenlabs/truss/blob/0816876a474b0c4910eaa3f869ed4c685f7a7570/baseten-performance-client/src/lib.rs#L659C1-L760C20 
also, e.g. sglang uses the primitive and could be directly pluged in there. https://github.com/sgl-project/sglang/blob/777688b8929c877e4e28c2eac208d776abe4c3af/python/sglang/srt/managers/tokenizer_manager.py#L454 


Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

py03 async bindings for encode/decode in rust #1797

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

py03 async bindings for encode/decode in rust #1797

Description

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions