-
Notifications
You must be signed in to change notification settings - Fork 993
Tokenizer: Add native async bindings, via py03-async-runtimes. #1843
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Tokenizer: Add native async bindings, via py03-async-runtimes. #1843
Conversation
|
Thanks for the PR, will have a look! |
ArthurZucker
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Okay! I had to ask help from @McPatate as I am not super super familiar with all this!
- This is def something we want to adress: indeed if you have a big batch, or a just one very long request,
tokenizerswill block the python thread, which can be non optimal - Let's just add
async_encodeas well, to also showcase good example of how we can do this in a non batch manner for example - Can you detail the test a little bit with
long_batchthat would have longer text?
Otherwise happy to merge
|
The docs for this PR live here. All of your documentation changes will be reflected on that endpoint. The docs are available until 30 days after the last update. |
|
@michaelfeil I took the liberty to commit my updates as I want to release today! |
…eil/tokenizers into mf/add-async-tokenizer-bindings
38045b6 to
c4eb850
Compare
|
thanks @michaelfeil |
This PR adds native support via py03-async bindings. https://github.com/PyO3/pyo3-async-runtimes :
gives access to:
Why is this relevant:
This is mostly relevant for online-inference (vllm, sglang, trt-llm, ..) that have only 1 python thread.
A common scenario, is that e.g. few users will request very long input (e.g 160k tokens), which typically will take >0.5s to process.
A solution is to use the
batch_encode()py03 api, which releases GIL. Since the operation is still blocking, you would still starve the asyncio python runtime, which has only 1 thread, for a single task.Relive would be the use of e.g. ray workers, or threadpools.
Quote: vLLM docs: https://docs.vllm.ai/en/v0.8.3/serving/openai_compatible_server.html
The dependency of ray is much heavier than e.g. py03. For e.g. a small project of mine (github.com/michaelfeil/infinity), it seems overkill.
Summary:
How is it implemented
I use the same approach that has been tested in py03 here: https://github.com/basetenlabs/truss/tree/main/baseten-performance-client/python_bindings.
The py03_async runtime requires a non-main thread rust runtime to be initialized. This is done via LazyInit.
https://github.com/PyO3/pyo3-async-runtimes
Performance:
Performance is okay ish. Probably better than the current pools, but worse than the sync bindings. If you have many threads available, and can contend your gil without running the gpu in the same pid, its propably faster
Below a example of what you need to do in async inference endinges;
#1797 cc @ArthurZucker