Tokenizer: Add native async bindings, via py03-async-runtimes. #1843

michaelfeil · 2025-08-10T03:05:00Z

This PR adds native support via py03-async bindings. https://github.com/PyO3/pyo3-async-runtimes :

pip install tokenizers

gives access to:

from tokenizers import Tokenizer
tokenizer = Tokenizer.from_pretrained("gpt2") 

out = await tokenizer.async_encode("Hey how are you?")

Why is this relevant:

This is mostly relevant for online-inference (vllm, sglang, trt-llm, ..) that have only 1 python thread.
A common scenario, is that e.g. few users will request very long input (e.g 160k tokens), which typically will take >0.5s to process.

A solution is to use the batch_encode() py03 api, which releases GIL. Since the operation is still blocking, you would still starve the asyncio python runtime, which has only 1 thread, for a single task.
Relive would be the use of e.g. ray workers, or threadpools.

Quote: vLLM docs: https://docs.vllm.ai/en/v0.8.3/serving/openai_compatible_server.html

--tokenizer-pool-size
Size of tokenizer pool to use for asynchronous tokenization. If 0, will use synchronous tokenization.

Default: 0

--tokenizer-pool-type
Type of tokenizer pool to use for asynchronous tokenization. Ignored if tokenizer_pool_size is 0.

Default: “ray”

--tokenizer-pool-extra-config
Extra config for tokenizer pool. This should be a JSON string that will be parsed into a dictionary. Ignored if tokenizer_pool_size is 0.

The dependency of ray is much heavier than e.g. py03. For e.g. a small project of mine (github.com/michaelfeil/infinity), it seems overkill.

Summary:

benefits: async training
benefits: inference server
not interesting for: pretraining, dataset preprocessing, torch dataloader etc.

How is it implemented

I use the same approach that has been tested in py03 here: https://github.com/basetenlabs/truss/tree/main/baseten-performance-client/python_bindings.
The py03_async runtime requires a non-main thread rust runtime to be initialized. This is done via LazyInit.
https://github.com/PyO3/pyo3-async-runtimes

Performance:

Performance is okay ish. Probably better than the current pools, but worse than the sync bindings. If you have many threads available, and can contend your gil without running the gpu in the same pid, its propably faster

Below a example of what you need to do in async inference endinges;

await loop.run_in_executor(
                    executor, 
                    lambda: self.tokenizer.encode_batch_fast(large_batch)
                )

try:
            executor = concurrent.futures.ThreadPoolExecutor(
                max_workers=2048
            )
            loop = asyncio.get_running_loop()

            async def encode_sync_with_executor(_):
                # Use the pre-initialized executor
                return await loop.run_in_executor(
                    executor, 
                    lambda: self.tokenizer.encode_batch_fast(large_batch)
                )
                
            async def encode_to_thread_sync(_):
                return await asyncio.to_thread(
                    self.tokenizer.encode_batch_fast, large_batch
                )
            
            async def encode_async(_):
                return await self.tokenizer.async_encode_batch_fast(large_batch)
            
            await asyncio.gather(*[encode_sync_with_executor(i) for i in range(2048)])
            await asyncio.gather(*[encode_async(i) for i in range(2048)])
            
            for n_tasks in [1, 2, 4, 8, 16, 32, 64, 128, 256, 512, 1024, 2048]:
                # Measure sync performance with pre-initialized executor
                # Warm up
                await asyncio.gather(*[encode_sync_with_executor(i) for i in range(10)])
                time.sleep(0.03)
                # Actual measurement
                start = time.perf_counter()
                await asyncio.gather(*[encode_sync_with_executor(i) for i in range(n_tasks)])
                sync_time = time.perf_counter() - start

                # Measure async performance
                # Warm up
                await asyncio.gather(*[encode_async(i) for i in range(10)])
                
                # Actual measurement
                time.sleep(0.03)
                start = time.perf_counter()
                await asyncio.gather(*[encode_async(i) for i in range(n_tasks)])
                async_time = time.perf_counter() - start
                
                # Log times
                print(f"sync vs async processing times: {sync_time:.4f}s vs {async_time:.4f}s for {n_tasks} tasks")
                results_sync.append(sync_time)
                results_async.append(async_time)
        finally:
            # Make sure we shut down the executor properly
            executor.shutdown(wait=False)

sync vs async processing times: 0.0003s vs 0.0004s for 1 tasks
sync vs async processing times: 0.0004s vs 0.0005s for 2 tasks
sync vs async processing times: 0.0004s vs 0.0007s for 4 tasks
sync vs async processing times: 0.0014s vs 0.0011s for 8 tasks
sync vs async processing times: 0.0028s vs 0.0022s for 16 tasks
sync vs async processing times: 0.0057s vs 0.0040s for 32 tasks
sync vs async processing times: 0.0097s vs 0.0065s for 64 tasks
sync vs async processing times: 0.0172s vs 0.0127s for 128 tasks
sync vs async processing times: 0.0324s vs 0.0293s for 256 tasks
sync vs async processing times: 0.0528s vs 0.0487s for 512 tasks
sync vs async processing times: 0.1549s vs 0.0960s for 1024 tasks
sync vs async processing times: 0.3044s vs 0.1846s for 2048 tasks

#1797 cc @ArthurZucker

ArthurZucker · 2025-08-28T06:57:14Z

Thanks for the PR, will have a look!

ArthurZucker

Okay! I had to ask help from @McPatate as I am not super super familiar with all this!

This is def something we want to adress: indeed if you have a big batch, or a just one very long request, tokenizers will block the python thread, which can be non optimal
Let's just add async_encode as well, to also showcase good example of how we can do this in a non batch manner for example
Can you detail the test a little bit with long_batch that would have longer text?

Otherwise happy to merge

HuggingFaceDocBuilderDev · 2025-08-28T11:03:48Z

The docs for this PR live here. All of your documentation changes will be reflected on that endpoint. The docs are available until 30 days after the last update.

ArthurZucker · 2025-08-29T07:58:09Z

@michaelfeil I took the liberty to commit my updates as I want to release today!

…eil/tokenizers into mf/add-async-tokenizer-bindings

bindings/python/Cargo.toml

ArthurZucker · 2025-08-29T10:35:00Z

thanks @michaelfeil

add async bindings

b961131

ArthurZucker approved these changes Aug 28, 2025

View reviewed changes

update based on review!

43bdd5b

ArthurZucker and others added 5 commits August 29, 2025 10:01

us hf internal testing for testing

2207511

Merge branch 'main' into mf/add-async-tokenizer-bindings

f04163f

reduce burden for the CI

7ad58a9

Merge branch 'mf/add-async-tokenizer-bindings' of github.com:michaelf…

242f91f

…eil/tokenizers into mf/add-async-tokenizer-bindings

asyn is not necessarily fast

c4eb850

ArthurZucker reviewed Aug 29, 2025

View reviewed changes

bindings/python/Cargo.toml Show resolved Hide resolved

ArthurZucker force-pushed the mf/add-async-tokenizer-bindings branch from 38045b6 to c4eb850 Compare August 29, 2025 09:02

remove comments

00e634c

ArthurZucker merged commit bd1149c into huggingface:main Aug 29, 2025
27 checks passed

bbrowning mentioned this pull request Sep 26, 2025

[Bug]: Health endpoint not responsive while preprocessing a chat request (and other requests types too probably) vllm-project/vllm#24910

Closed

1 task

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Tokenizer: Add native async bindings, via py03-async-runtimes. #1843

Tokenizer: Add native async bindings, via py03-async-runtimes. #1843

Uh oh!

michaelfeil commented Aug 10, 2025 •

edited by ArthurZucker

Loading

Uh oh!

ArthurZucker commented Aug 28, 2025

Uh oh!

ArthurZucker left a comment

Uh oh!

HuggingFaceDocBuilderDev commented Aug 28, 2025

Uh oh!

ArthurZucker commented Aug 29, 2025

Uh oh!

Uh oh!

Uh oh!

ArthurZucker commented Aug 29, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

Tokenizer: Add native async bindings, via py03-async-runtimes. #1843

Tokenizer: Add native async bindings, via py03-async-runtimes. #1843

Uh oh!

Conversation

michaelfeil commented Aug 10, 2025 • edited by ArthurZucker Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Why is this relevant:

How is it implemented

Performance:

Uh oh!

ArthurZucker commented Aug 28, 2025

Uh oh!

ArthurZucker left a comment

Choose a reason for hiding this comment

Uh oh!

HuggingFaceDocBuilderDev commented Aug 28, 2025

Uh oh!

ArthurZucker commented Aug 29, 2025

Uh oh!

Uh oh!

Uh oh!

ArthurZucker commented Aug 29, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

michaelfeil commented Aug 10, 2025 •

edited by ArthurZucker

Loading