feat: add vLLM remote tokenizer with engine integration #1328

ae86zhizhi · 2025-07-27T19:39:41Z

feature: Add vLLM Remote Tokenizer with Engine Integration

Summary

This PR introduces support for vLLM remote tokenizers that can leverage the tokenization capabilities directly from vLLM engine instances. This feature enables more
accurate tokenization by using the same tokenizer as the serving engine, ensuring consistency between token counting and actual model processing.

Motivation

Tokenizer Consistency: Using the same tokenizer as the vLLM engine ensures accurate token counting for routing decisions
Model-Specific Tokenization: Different models may use different tokenizers - this feature automatically uses the correct tokenizer for each model
Dynamic Scaling: Remote tokenizers can scale with the vLLM instances, eliminating the need to maintain separate tokenizer deployments

What's Changed

Added TokenizerPool for managing model-specific remote tokenizers with health checking and connection pooling
Integrated vLLM HTTP tokenization endpoints into the prefix cache routing algorithm
Added configuration options for remote tokenizer behavior
Implemented fallback mechanism to maintain backward compatibility
Added utility functions LoadEnvDuration and LoadEnvBool for configuration parsing

How to Enable

To enable the vLLM remote tokenizer feature, set the following environment variable:

AIBRIX_ENABLE_VLLM_REMOTE_TOKENIZER=true

Configuration Options

| Environment Variable                    | Default        | Description                                                            |
|-----------------------------------------|----------------|------------------------------------------------------------------------|
| AIBRIX_ENABLE_VLLM_REMOTE_TOKENIZER     | false          | Enable/disable vLLM remote tokenizer feature                           |
| AIBRIX_VLLM_TOKENIZER_ENDPOINT_TEMPLATE | http://%s:8000 | URL template for vLLM tokenizer endpoints (%s is replaced with pod IP) |
| AIBRIX_TOKENIZER_HEALTH_CHECK_PERIOD    | 30s            | How often to check tokenizer health                                    |
| AIBRIX_TOKENIZER_TTL                    | 5m             | Time-to-live for tokenizer connections                                 |
| AIBRIX_MAX_TOKENIZERS_PER_POOL          | 100            | Maximum number of tokenizers to maintain in the pool                   |
| AIBRIX_TOKENIZER_REQUEST_TIMEOUT        | 10s            | Timeout for tokenization requests                                      |

Usage Example

1. Deploy with the feature enabled using Kustomize:
kubectl apply -k config/features/vllm-remote-tokenizer

2. Or manually set the environment variable in your gateway deployment:
env:
  - name: AIBRIX_ENABLE_VLLM_REMOTE_TOKENIZER
    value: "true"
  - name: AIBRIX_VLLM_TOKENIZER_ENDPOINT_TEMPLATE
    value: "http://%s:8000"  # Adjust if your vLLM uses a different port

Backward Compatibility

- When AIBRIX_ENABLE_VLLM_REMOTE_TOKENIZER=false (default), the system uses the original local tokenizer
- Existing deployments will continue to work without any changes
- The feature gracefully falls back to local tokenization if remote tokenizers are unavailable

Testing

- Tested with vLLM 0.5.x and 0.6.x
- Verified fallback behavior when vLLM endpoints are unavailable
- Load tested with multiple concurrent requests across different models
- Confirmed proper health checking and connection recycling

gemini-code-assist

Summary of Changes

Hello @ae86zhizhi, I'm Gemini Code Assist¹! I'm currently reviewing this pull request and will post my feedback shortly. In the meantime, here's a summary to help you and other reviewers quickly get up to speed!

This pull request significantly enhances the AIBrix gateway's tokenization capabilities by integrating a remote tokenizer system that leverages vLLM engine instances. The primary goal is to ensure tokenization consistency with the serving models, which is crucial for accurate routing decisions and model-specific processing. This feature introduces a robust TokenizerPool for managing remote connections, complete with health checks and dynamic discovery of vLLM endpoints, while maintaining backward compatibility through a fallback mechanism.

Highlights

Remote Tokenizer Integration: I've implemented support for vLLM remote tokenizers, enabling the system to use the same tokenizer as the serving engine for improved consistency and accuracy in token counting, which is crucial for routing decisions.
Dynamic Tokenizer Management: I've introduced a TokenizerPool to dynamically manage model-specific remote tokenizers. This includes connection pooling, active health checking, and automatic cleanup of stale connections to ensure efficient resource utilization.
Configurable Behavior: I've added new environment variables that allow for fine-grained control over the remote tokenizer feature, such as enabling/disabling it, specifying the vLLM endpoint template, configuring health check periods, and setting request timeouts.
Robust Fallback Mechanism: To ensure backward compatibility and system resilience, I've implemented a graceful fallback to local tokenizers if remote tokenizers are disabled, unavailable, or become unhealthy, preventing service disruption.
Kubernetes Deployment Support: I've provided Kustomize configurations to simplify the deployment and enablement of the remote tokenizer feature within a Kubernetes environment, making it easy to integrate into existing setups.

Using Gemini Code Assist

The full guide for Gemini Code Assist can be found on our documentation page, here are some quick tips.

Invoking Gemini

You can request assistance from Gemini at any point in your pull request via creating an issue comment (i.e. comment on the pull request page) using either /gemini <command> or @gemini-code-assist <command>. Below is a summary of the supported commands.

Feature	Command	Description
Code Review	`/gemini review`	Performs a code review for the current pull request in its current state.
Pull Request Summary	`/gemini summary`	Provides a summary of the current pull request in its current state.
Comment	@gemini-code-assist	Responds in comments when explicitly tagged, both in issue comments and review comments.
Help	`/gemini help`	Displays a list of available commands.

Customization

To customize Gemini Code Assist for GitHub experience, repository maintainers can create a configuration file and/or provide a custom code review style guide (such as PEP-8 for Python) by creating and adding files to a .gemini/ folder in the base of the repository. Detailed instructions can be found here.

Limitations & Feedback

Gemini Code Assist may make mistakes. Please leave feedback on any instances where its feedback is incorrect or counter productive. You can react with 👍 and 👎 on @gemini-code-assist comments or fill out our survey to provide feedback.

You can also get AI-powered code generation, chat, as well as code reviews directly in the IDE at no cost with the Gemini Code Assist IDE Extension.

Review the Privacy Notices, Generative AI Prohibited Use Policy, Terms of Service, and learn how to configure Gemini Code Assist in GitHub here. Gemini can make mistakes, so double check it and use code with caution. ↩

gemini-code-assist

Code Review

This pull request introduces a vLLM remote tokenizer feature, which is a great addition for ensuring tokenizer consistency. The implementation includes a TokenizerPool for managing remote tokenizers, complete with configuration, health checking, and metrics. The code is well-structured, but I've identified a few critical and high-severity issues related to concurrency and potential deadlocks that should be addressed. Specifically, there's a data race in GetTokenizer and a risk of long-held locks in cleanup and shutdown routines. I've also suggested an efficiency improvement to avoid unnecessary data copying.

pkg/plugins/gateway/algorithms/tokenizer_pool.go

pkg/plugins/gateway/algorithms/prefix_cache.go

pkg/plugins/gateway/algorithms/tokenizer_pool.go

Jeffwan · 2025-07-28T23:13:16Z

Tested with vLLM 0.5.x and 0.6.x

the version is too low. v0.10.0 is out.. but I believe the tokenization interface doesn't have any change in latest version but it's always good to test with latest version

Jeffwan · 2025-07-28T23:14:49Z

Seems the feature is protected by the environment flag. It should be safe to merge the code after comment have been addressed. We can leave the test to later phase.

Jeffwan · 2025-07-28T23:16:36Z

AIBRIX_VLLM_TOKENIZER_ENDPOINT_TEMPLATE | http://%s:8000

this is a little bit tricky, it's hard to specify an additional model endpoint. Technically, cache can detect all the pods and it can randomly pick up an endpoint and get response. the /tokenizer endpoint is exact same as /chat/completion or /completion, I do think this design part needs to be refactored.

Due to limited time, we can merge it first and then make the changes

pkg/plugins/gateway/algorithms/prefix_cache.go

Jeffwan · 2025-07-28T23:22:06Z

pkg/plugins/gateway/algorithms/prefix_cache.go

-		klog.Error("fail to get cache store in prefix cache router")
-		return nil, err
+	// Initialize TokenizerPool for vLLM remote tokenizer support
+	poolConfig := TokenizerPoolConfig{


actually, we should leverage cache information to populate the same model pod endpoint and orchestrate the request

Jeffwan · 2025-07-28T23:23:53Z

pkg/plugins/gateway/algorithms/prefix_cache.go

+	// Initialize TokenizerPool for vLLM remote tokenizer support
+	poolConfig := TokenizerPoolConfig{
+		EnableVLLMRemote:     enableVLLMRemoteTokenizer,
+		EndpointTemplate:     vllmTokenizerEndpointTemplate,


we probably need to add model -> engine mapping from model.aibrix.ai/engine: "vllm" in the cache so consumer can easily distinguish engine

/cc @happyandslow @varungup90

ae86zhizhi · 2025-07-31T01:12:39Z

PR Summary: Remote vLLM Tokenizer Integration Review Fixes

Completed Review Comments

Default Value Inconsistency (Internal review)

Changed AIBRIX_ENABLE_VLLM_REMOTE_TOKENIZER default from true to false to align with production-readiness expectations
Updated documentation and ensured consistency across codebase

Race Condition Fix (@DwyaneShi: "what if the tokenizer is cleaned up in the between?")

Replaced read lock with write lock to prevent concurrent tokenizer creation
Added TODO comment for future optimization using double-checked locking pattern

Lock Duration Optimization (@DwyaneShi: "holding a lock for such a long time is not suggested")

Implemented double-checked locking pattern to move 5-second health checks outside of lock
Reduced lock contention from seconds to microseconds, significantly improving performance

time.Now() Optimization (@DwyaneShi: "time.Now() could be time consuming on some env")

Cached time.Now() results when setting multiple timestamp fields to ensure consistency
Added benchmark tests showing ~50% performance improvement

Label Constants (@DwyaneShi: "better to define constants for 'aibrix.ai/model'")

Created centralized pkg/apis/constants/labels.go package for all label definitions
Implemented backward compatibility helpers supporting both old (aibrix.ai/*) and new (model.aibrix.ai/*) formats
Replaced hardcoded strings throughout the codebase

Deferred for Future PRs

vLLM Version Compatibility (@Jeffwan: "the version is too low. v0.10.0 is out")

Will expand test matrix to include multiple vLLM versions for compatibility testing
Important for ensuring the tokenizer works across different vLLM releases

Endpoint Template Design (@Jeffwan: "this is a little bit tricky")

Technical debt to make endpoint configuration more flexible and maintainable
Will support custom ports and protocols beyond the current hardcoded format

Cache-based Endpoint Orchestration (Enhancement opportunity)

Will utilize the existing cache system for smarter endpoint selection and load balancing
Improves performance by avoiding redundant service discovery

Model-to-Engine Mapping (Enhancement opportunity)

Will implement configuration for mapping specific models to preferred inference engines
Provides flexibility for heterogeneous deployments with multiple engine types

All completed fixes have been thoroughly tested with unit tests and integration tests passing. The code follows project conventions and maintains backward compatibility where needed.

pkg/apis/constants/labels.go

Add support for using vLLM's remote tokenizer endpoint to enable tokenization without loading models in gateway plugins. This feature allows the gateway to delegate tokenization to vLLM engine instances, reducing memory usage and improving scalability. ## Key Features - Integrate vLLM's /tokenize endpoint for remote tokenization - Implement TokenizerPool for managing per-model tokenizer connections - Support health checking and automatic failover to local tokenizer - Add caching and connection pooling for performance - Support both vLLM and other inference engines through pod label detection ## Implementation Details - New remote tokenizer client with retry logic and timeout handling - TokenizerPool with concurrent access support and automatic cleanup - Health monitoring with 5-second timeout for tokenizer endpoints - Fallback to local character tokenizer when remote unavailable - Prometheus metrics for monitoring tokenizer pool status ## Configuration - AIBRIX_ENABLE_VLLM_REMOTE_TOKENIZER: Feature flag (default: false) - AIBRIX_VLLM_TOKENIZER_ENDPOINT_TEMPLATE: Endpoint format (default: "http://%s:8000") - AIBRIX_TOKENIZER_HEALTH_CHECK_PERIOD: Health check interval (default: 30s) - AIBRIX_TOKENIZER_TTL: Unused tokenizer cleanup time (default: 5m) - AIBRIX_MAX_TOKENIZERS_PER_POOL: Pool size limit (default: 100) ## Review Feedback Addressed - Changed default to disabled for production safety - Fixed race conditions in concurrent access - Optimized lock contention with double-checked locking - Added comprehensive test coverage including benchmarks - Created centralized constants package for Kubernetes labels Tested with vLLM v0.4.0 and includes backward compatibility support. Co-authored-by: DwyaneShi <dyshi@microsoft.com> Co-authored-by: Jeffwan <jeffwan@amazon.com> Signed-off-by: ae86zhizhi <550149470@qq.com>

autopear · 2025-07-31T03:21:16Z

pkg/apis/constants/labels.go

@@ -0,0 +1,54 @@
+/*
+Copyright 2024 The Aibrix Team.


All new files' copyright should be 2025

autopear · 2025-07-31T03:24:23Z

pkg/cache/cache_metrics.go

@@ -21,6 +21,7 @@ import (

 	prometheusv1 "github.com/prometheus/client_golang/api/prometheus/v1"
 	dto "github.com/prometheus/client_model/go"
+	"github.com/vllm-project/aibrix/pkg/apis/constants"


这个没必要放在 apis/constants 里面吧，直接 pkg/defines/lablels.go

autopear · 2025-07-31T03:27:33Z

pkg/plugins/gateway/algorithms/prefix_cache.go

@@ -49,41 +58,60 @@ func init() {

 type prefixCacheRouter struct {
 	cache              cache.Cache
-	tokenizer          tokenizer.Tokenizer
+	tokenizer          tokenizer.Tokenizer // Fallback tokenizer for backward compatibility


A better name would be Default tokenizer instead of Fallback

autopear · 2025-07-31T03:30:11Z

pkg/plugins/gateway/algorithms/tokenizer_pool.go

+
+	"github.com/prometheus/client_golang/prometheus"
+	"github.com/prometheus/client_golang/prometheus/promauto"
+	"github.com/vllm-project/aibrix/pkg/apis/constants"


Is github.com/vllm-project/aibrix/pkg required here? What if someone forks this repo and use it internally without accessing to github?

autopear · 2025-07-31T03:32:35Z

pkg/plugins/gateway/algorithms/tokenizer_pool.go

+	// Acquire write lock directly to avoid race condition
+	// TODO: Consider implementing reference counting or double-checked locking
+	// to improve concurrency performance while maintaining thread safety
+	p.mu.Lock()


use defer p.mu.Unlock() instead of explicit p.mu.Unlock() whenever possible

Jeffwan · 2025-07-31T04:43:23Z

@ae86zhizhi please address @autopear's PR in follow up PR. Due to urgent release timeline, I will merge this one first

Signed-off-by: Qizhong Mao <qizhong.mao@bytedance.com>

…-project#1328)" This reverts commit b0eebc1.

…-project#1328)" This reverts commit b0eebc1. Signed-off-by: Qizhong Mao <qizhong.mao@bytedance.com>

gemini-code-assist bot reviewed Jul 27, 2025

View reviewed changes

ae86zhizhi force-pushed the feature/remote-vllm-tokenizer-with-engine-integration branch from cc8e23f to ce594ec Compare July 27, 2025 19:41

gemini-code-assist bot reviewed Jul 27, 2025

View reviewed changes

ae86zhizhi force-pushed the feature/remote-vllm-tokenizer-with-engine-integration branch from ce594ec to 1325266 Compare July 28, 2025 02:36