-
Notifications
You must be signed in to change notification settings - Fork 434
feat: add vLLM remote tokenizer with engine integration #1328
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
feat: add vLLM remote tokenizer with engine integration #1328
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Summary of Changes
Hello @ae86zhizhi, I'm Gemini Code Assist1! I'm currently reviewing this pull request and will post my feedback shortly. In the meantime, here's a summary to help you and other reviewers quickly get up to speed!
This pull request significantly enhances the AIBrix gateway's tokenization capabilities by integrating a remote tokenizer system that leverages vLLM engine instances. The primary goal is to ensure tokenization consistency with the serving models, which is crucial for accurate routing decisions and model-specific processing. This feature introduces a robust TokenizerPool
for managing remote connections, complete with health checks and dynamic discovery of vLLM endpoints, while maintaining backward compatibility through a fallback mechanism.
Highlights
- Remote Tokenizer Integration: I've implemented support for vLLM remote tokenizers, enabling the system to use the same tokenizer as the serving engine for improved consistency and accuracy in token counting, which is crucial for routing decisions.
- Dynamic Tokenizer Management: I've introduced a
TokenizerPool
to dynamically manage model-specific remote tokenizers. This includes connection pooling, active health checking, and automatic cleanup of stale connections to ensure efficient resource utilization. - Configurable Behavior: I've added new environment variables that allow for fine-grained control over the remote tokenizer feature, such as enabling/disabling it, specifying the vLLM endpoint template, configuring health check periods, and setting request timeouts.
- Robust Fallback Mechanism: To ensure backward compatibility and system resilience, I've implemented a graceful fallback to local tokenizers if remote tokenizers are disabled, unavailable, or become unhealthy, preventing service disruption.
- Kubernetes Deployment Support: I've provided Kustomize configurations to simplify the deployment and enablement of the remote tokenizer feature within a Kubernetes environment, making it easy to integrate into existing setups.
Using Gemini Code Assist
The full guide for Gemini Code Assist can be found on our documentation page, here are some quick tips.
Invoking Gemini
You can request assistance from Gemini at any point in your pull request via creating an issue comment (i.e. comment on the pull request page) using either /gemini <command>
or @gemini-code-assist <command>
. Below is a summary of the supported commands.
Feature | Command | Description |
---|---|---|
Code Review | /gemini review |
Performs a code review for the current pull request in its current state. |
Pull Request Summary | /gemini summary |
Provides a summary of the current pull request in its current state. |
Comment | @gemini-code-assist | Responds in comments when explicitly tagged, both in issue comments and review comments. |
Help | /gemini help |
Displays a list of available commands. |
Customization
To customize Gemini Code Assist for GitHub experience, repository maintainers can create a configuration file and/or provide a custom code review style guide (such as PEP-8 for Python) by creating and adding files to a .gemini/
folder in the base of the repository. Detailed instructions can be found here.
Limitations & Feedback
Gemini Code Assist may make mistakes. Please leave feedback on any instances where its feedback is incorrect or counter productive. You can react with 👍 and 👎 on @gemini-code-assist comments or fill out our survey to provide feedback.
You can also get AI-powered code generation, chat, as well as code reviews directly in the IDE at no cost with the Gemini Code Assist IDE Extension.
Footnotes
-
Review the Privacy Notices, Generative AI Prohibited Use Policy, Terms of Service, and learn how to configure Gemini Code Assist in GitHub here. Gemini can make mistakes, so double check it and use code with caution. ↩
cc8e23f
to
ce594ec
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Code Review
This pull request introduces a vLLM remote tokenizer feature, which is a great addition for ensuring tokenizer consistency. The implementation includes a TokenizerPool
for managing remote tokenizers, complete with configuration, health checking, and metrics. The code is well-structured, but I've identified a few critical and high-severity issues related to concurrency and potential deadlocks that should be addressed. Specifically, there's a data race in GetTokenizer
and a risk of long-held locks in cleanup and shutdown routines. I've also suggested an efficiency improvement to avoid unnecessary data copying.
ce594ec
to
1325266
Compare
f958235
to
7c2b3b1
Compare
the version is too low. v0.10.0 is out.. but I believe the tokenization interface doesn't have any change in latest version but it's always good to test with latest version |
Seems the feature is protected by the environment flag. It should be safe to merge the code after comment have been addressed. We can leave the test to later phase. |
this is a little bit tricky, it's hard to specify an additional model endpoint. Technically, cache can detect all the pods and it can randomly pick up an endpoint and get response. the Due to limited time, we can merge it first and then make the changes |
klog.Error("fail to get cache store in prefix cache router") | ||
return nil, err | ||
// Initialize TokenizerPool for vLLM remote tokenizer support | ||
poolConfig := TokenizerPoolConfig{ |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
actually, we should leverage cache information to populate the same model pod endpoint and orchestrate the request
// Initialize TokenizerPool for vLLM remote tokenizer support | ||
poolConfig := TokenizerPoolConfig{ | ||
EnableVLLMRemote: enableVLLMRemoteTokenizer, | ||
EndpointTemplate: vllmTokenizerEndpointTemplate, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
we probably need to add model -> engine mapping from model.aibrix.ai/engine: "vllm"
in the cache so consumer can easily distinguish engine
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
4760341
to
1e2d733
Compare
PR Summary: Remote vLLM Tokenizer Integration Review FixesCompleted Review CommentsDefault Value Inconsistency (Internal review)
Race Condition Fix (@DwyaneShi: "what if the tokenizer is cleaned up in the between?")
Lock Duration Optimization (@DwyaneShi: "holding a lock for such a long time is not suggested")
time.Now() Optimization (@DwyaneShi: "time.Now() could be time consuming on some env")
Label Constants (@DwyaneShi: "better to define constants for 'aibrix.ai/model'")
Deferred for Future PRsvLLM Version Compatibility (@Jeffwan: "the version is too low. v0.10.0 is out")
Endpoint Template Design (@Jeffwan: "this is a little bit tricky")
Cache-based Endpoint Orchestration (Enhancement opportunity)
Model-to-Engine Mapping (Enhancement opportunity)
All completed fixes have been thoroughly tested with unit tests and integration tests passing. The code follows project conventions and maintains backward compatibility where needed. |
d08b7bf
to
84f8e95
Compare
84f8e95
to
b0accc5
Compare
Add support for using vLLM's remote tokenizer endpoint to enable tokenization without loading models in gateway plugins. This feature allows the gateway to delegate tokenization to vLLM engine instances, reducing memory usage and improving scalability. ## Key Features - Integrate vLLM's /tokenize endpoint for remote tokenization - Implement TokenizerPool for managing per-model tokenizer connections - Support health checking and automatic failover to local tokenizer - Add caching and connection pooling for performance - Support both vLLM and other inference engines through pod label detection ## Implementation Details - New remote tokenizer client with retry logic and timeout handling - TokenizerPool with concurrent access support and automatic cleanup - Health monitoring with 5-second timeout for tokenizer endpoints - Fallback to local character tokenizer when remote unavailable - Prometheus metrics for monitoring tokenizer pool status ## Configuration - AIBRIX_ENABLE_VLLM_REMOTE_TOKENIZER: Feature flag (default: false) - AIBRIX_VLLM_TOKENIZER_ENDPOINT_TEMPLATE: Endpoint format (default: "http://%s:8000") - AIBRIX_TOKENIZER_HEALTH_CHECK_PERIOD: Health check interval (default: 30s) - AIBRIX_TOKENIZER_TTL: Unused tokenizer cleanup time (default: 5m) - AIBRIX_MAX_TOKENIZERS_PER_POOL: Pool size limit (default: 100) ## Review Feedback Addressed - Changed default to disabled for production safety - Fixed race conditions in concurrent access - Optimized lock contention with double-checked locking - Added comprehensive test coverage including benchmarks - Created centralized constants package for Kubernetes labels Tested with vLLM v0.4.0 and includes backward compatibility support. Co-authored-by: DwyaneShi <dyshi@microsoft.com> Co-authored-by: Jeffwan <jeffwan@amazon.com> Signed-off-by: ae86zhizhi <550149470@qq.com>
b0accc5
to
c203159
Compare
@@ -0,0 +1,54 @@ | |||
/* | |||
Copyright 2024 The Aibrix Team. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
All new files' copyright should be 2025
@@ -21,6 +21,7 @@ import ( | |||
|
|||
prometheusv1 "github.com/prometheus/client_golang/api/prometheus/v1" | |||
dto "github.com/prometheus/client_model/go" | |||
"github.com/vllm-project/aibrix/pkg/apis/constants" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
这个没必要放在 apis/constants 里面吧,直接 pkg/defines/lablels.go
@@ -49,41 +58,60 @@ func init() { | |||
|
|||
type prefixCacheRouter struct { | |||
cache cache.Cache | |||
tokenizer tokenizer.Tokenizer | |||
tokenizer tokenizer.Tokenizer // Fallback tokenizer for backward compatibility |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
A better name would be Default
tokenizer instead of Fallback
|
||
"github.com/prometheus/client_golang/prometheus" | ||
"github.com/prometheus/client_golang/prometheus/promauto" | ||
"github.com/vllm-project/aibrix/pkg/apis/constants" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is github.com/vllm-project/aibrix/pkg
required here? What if someone forks this repo and use it internally without accessing to github?
// Acquire write lock directly to avoid race condition | ||
// TODO: Consider implementing reference counting or double-checked locking | ||
// to improve concurrency performance while maintaining thread safety | ||
p.mu.Lock() |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
use defer p.mu.Unlock()
instead of explicit p.mu.Unlock()
whenever possible
@ae86zhizhi please address @autopear's PR in follow up PR. Due to urgent release timeline, I will merge this one first |
Signed-off-by: Qizhong Mao <qizhong.mao@bytedance.com>
Signed-off-by: Qizhong Mao <qizhong.mao@bytedance.com>
Signed-off-by: Qizhong Mao <qizhong.mao@bytedance.com>
Signed-off-by: Qizhong Mao <qizhong.mao@bytedance.com>
…-project#1328)" This reverts commit b0eebc1.
…-project#1328)" This reverts commit b0eebc1. Signed-off-by: Qizhong Mao <qizhong.mao@bytedance.com>
…-project#1328)" This reverts commit b0eebc1. Signed-off-by: Qizhong Mao <qizhong.mao@bytedance.com>
…-project#1328)" This reverts commit b0eebc1. Signed-off-by: Qizhong Mao <qizhong.mao@bytedance.com>
…-project#1328)" This reverts commit b0eebc1. Signed-off-by: Qizhong Mao <qizhong.mao@bytedance.com>
feature: Add vLLM Remote Tokenizer with Engine Integration
Summary
This PR introduces support for vLLM remote tokenizers that can leverage the tokenization capabilities directly from vLLM engine instances. This feature enables more
accurate tokenization by using the same tokenizer as the serving engine, ensuring consistency between token counting and actual model processing.
Motivation
What's Changed
TokenizerPool
for managing model-specific remote tokenizers with health checking and connection poolingLoadEnvDuration
andLoadEnvBool
for configuration parsingHow to Enable
To enable the vLLM remote tokenizer feature, set the following environment variable: