Skip to content

Conversation

@michaelfeil
Copy link
Contributor

@michaelfeil michaelfeil commented Aug 10, 2025

This PR adds native support via py03-async bindings. https://github.com/PyO3/pyo3-async-runtimes :

pip install tokenizers

gives access to:

from tokenizers import Tokenizer
tokenizer = Tokenizer.from_pretrained("gpt2") 

out = await tokenizer.async_encode("Hey how are you?")

Why is this relevant:

This is mostly relevant for online-inference (vllm, sglang, trt-llm, ..) that have only 1 python thread.
A common scenario, is that e.g. few users will request very long input (e.g 160k tokens), which typically will take >0.5s to process.

A solution is to use the batch_encode() py03 api, which releases GIL. Since the operation is still blocking, you would still starve the asyncio python runtime, which has only 1 thread, for a single task.
Relive would be the use of e.g. ray workers, or threadpools.

Quote: vLLM docs: https://docs.vllm.ai/en/v0.8.3/serving/openai_compatible_server.html

--tokenizer-pool-size
Size of tokenizer pool to use for asynchronous tokenization. If 0, will use synchronous tokenization.

Default: 0

--tokenizer-pool-type
Type of tokenizer pool to use for asynchronous tokenization. Ignored if tokenizer_pool_size is 0.

Default: “ray”

--tokenizer-pool-extra-config
Extra config for tokenizer pool. This should be a JSON string that will be parsed into a dictionary. Ignored if tokenizer_pool_size is 0.

The dependency of ray is much heavier than e.g. py03. For e.g. a small project of mine (github.com/michaelfeil/infinity), it seems overkill.

Summary:

  • benefits: async training
  • benefits: inference server
  • not interesting for: pretraining, dataset preprocessing, torch dataloader etc.

How is it implemented

I use the same approach that has been tested in py03 here: https://github.com/basetenlabs/truss/tree/main/baseten-performance-client/python_bindings.
The py03_async runtime requires a non-main thread rust runtime to be initialized. This is done via LazyInit.
https://github.com/PyO3/pyo3-async-runtimes

Performance:

Performance is okay ish. Probably better than the current pools, but worse than the sync bindings. If you have many threads available, and can contend your gil without running the gpu in the same pid, its propably faster

Below a example of what you need to do in async inference endinges;

await loop.run_in_executor(
                    executor, 
                    lambda: self.tokenizer.encode_batch_fast(large_batch)
                )
try:
            executor = concurrent.futures.ThreadPoolExecutor(
                max_workers=2048
            )
            loop = asyncio.get_running_loop()

            async def encode_sync_with_executor(_):
                # Use the pre-initialized executor
                return await loop.run_in_executor(
                    executor, 
                    lambda: self.tokenizer.encode_batch_fast(large_batch)
                )
                
            async def encode_to_thread_sync(_):
                return await asyncio.to_thread(
                    self.tokenizer.encode_batch_fast, large_batch
                )
            
            async def encode_async(_):
                return await self.tokenizer.async_encode_batch_fast(large_batch)
            
            await asyncio.gather(*[encode_sync_with_executor(i) for i in range(2048)])
            await asyncio.gather(*[encode_async(i) for i in range(2048)])
            
            for n_tasks in [1, 2, 4, 8, 16, 32, 64, 128, 256, 512, 1024, 2048]:
                # Measure sync performance with pre-initialized executor
                # Warm up
                await asyncio.gather(*[encode_sync_with_executor(i) for i in range(10)])
                time.sleep(0.03)
                # Actual measurement
                start = time.perf_counter()
                await asyncio.gather(*[encode_sync_with_executor(i) for i in range(n_tasks)])
                sync_time = time.perf_counter() - start

                # Measure async performance
                # Warm up
                await asyncio.gather(*[encode_async(i) for i in range(10)])
                
                # Actual measurement
                time.sleep(0.03)
                start = time.perf_counter()
                await asyncio.gather(*[encode_async(i) for i in range(n_tasks)])
                async_time = time.perf_counter() - start
                
                # Log times
                print(f"sync vs async processing times: {sync_time:.4f}s vs {async_time:.4f}s for {n_tasks} tasks")
                results_sync.append(sync_time)
                results_async.append(async_time)
        finally:
            # Make sure we shut down the executor properly
            executor.shutdown(wait=False)
sync vs async processing times: 0.0003s vs 0.0004s for 1 tasks
sync vs async processing times: 0.0004s vs 0.0005s for 2 tasks
sync vs async processing times: 0.0004s vs 0.0007s for 4 tasks
sync vs async processing times: 0.0014s vs 0.0011s for 8 tasks
sync vs async processing times: 0.0028s vs 0.0022s for 16 tasks
sync vs async processing times: 0.0057s vs 0.0040s for 32 tasks
sync vs async processing times: 0.0097s vs 0.0065s for 64 tasks
sync vs async processing times: 0.0172s vs 0.0127s for 128 tasks
sync vs async processing times: 0.0324s vs 0.0293s for 256 tasks
sync vs async processing times: 0.0528s vs 0.0487s for 512 tasks
sync vs async processing times: 0.1549s vs 0.0960s for 1024 tasks
sync vs async processing times: 0.3044s vs 0.1846s for 2048 tasks

#1797 cc @ArthurZucker

@ArthurZucker
Copy link
Collaborator

Thanks for the PR, will have a look!

Copy link
Collaborator

@ArthurZucker ArthurZucker left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Okay! I had to ask help from @McPatate as I am not super super familiar with all this!

  1. This is def something we want to adress: indeed if you have a big batch, or a just one very long request, tokenizers will block the python thread, which can be non optimal
  2. Let's just add async_encode as well, to also showcase good example of how we can do this in a non batch manner for example
  3. Can you detail the test a little bit with long_batch that would have longer text?

Otherwise happy to merge

@HuggingFaceDocBuilderDev

The docs for this PR live here. All of your documentation changes will be reflected on that endpoint. The docs are available until 30 days after the last update.

@ArthurZucker
Copy link
Collaborator

@michaelfeil I took the liberty to commit my updates as I want to release today!

@ArthurZucker ArthurZucker force-pushed the mf/add-async-tokenizer-bindings branch from 38045b6 to c4eb850 Compare August 29, 2025 09:02
@ArthurZucker ArthurZucker merged commit bd1149c into huggingface:main Aug 29, 2025
27 checks passed
@ArthurZucker
Copy link
Collaborator

thanks @michaelfeil

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants