Skip to content

Commit b0eebc1

Browse files
authored
feat: add vLLM remote tokenizer with engine integration (#1328)
Add support for using vLLM's remote tokenizer endpoint to enable tokenization without loading models in gateway plugins. This feature allows the gateway to delegate tokenization to vLLM engine instances, reducing memory usage and improving scalability. ## Key Features - Integrate vLLM's /tokenize endpoint for remote tokenization - Implement TokenizerPool for managing per-model tokenizer connections - Support health checking and automatic failover to local tokenizer - Add caching and connection pooling for performance - Support both vLLM and other inference engines through pod label detection ## Implementation Details - New remote tokenizer client with retry logic and timeout handling - TokenizerPool with concurrent access support and automatic cleanup - Health monitoring with 5-second timeout for tokenizer endpoints - Fallback to local character tokenizer when remote unavailable - Prometheus metrics for monitoring tokenizer pool status ## Configuration - AIBRIX_ENABLE_VLLM_REMOTE_TOKENIZER: Feature flag (default: false) - AIBRIX_VLLM_TOKENIZER_ENDPOINT_TEMPLATE: Endpoint format (default: "http://%s:8000") - AIBRIX_TOKENIZER_HEALTH_CHECK_PERIOD: Health check interval (default: 30s) - AIBRIX_TOKENIZER_TTL: Unused tokenizer cleanup time (default: 5m) - AIBRIX_MAX_TOKENIZERS_PER_POOL: Pool size limit (default: 100) ## Review Feedback Addressed - Changed default to disabled for production safety - Fixed race conditions in concurrent access - Optimized lock contention with double-checked locking - Added comprehensive test coverage including benchmarks - Created centralized constants package for Kubernetes labels Tested with vLLM v0.4.0 and includes backward compatibility support. Signed-off-by: ae86zhizhi <550149470@qq.com>
1 parent 8400d0a commit b0eebc1

File tree

9 files changed

+1471
-21
lines changed

9 files changed

+1471
-21
lines changed
Lines changed: 68 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,68 @@
1+
# vLLM Remote Tokenizer Feature
2+
3+
This feature enables model-aware remote tokenizer support for vLLM inference engines in AIBrix gateway.
4+
5+
## Quick Start
6+
7+
Enable vLLM remote tokenizer with one command:
8+
9+
```bash
10+
kubectl apply -k config/features/vllm-remote-tokenizer/
11+
```
12+
13+
## Configuration
14+
15+
The following environment variables are configured:
16+
17+
| Variable | Default | Description |
18+
|----------|---------|-------------|
19+
| AIBRIX_ENABLE_VLLM_REMOTE_TOKENIZER | false | Enable remote tokenizer feature |
20+
| AIBRIX_VLLM_TOKENIZER_ENDPOINT_TEMPLATE | http://%s:8000 | URL template for vLLM endpoints |
21+
| AIBRIX_TOKENIZER_HEALTH_CHECK_PERIOD | 30s | Health check interval |
22+
| AIBRIX_TOKENIZER_TTL | 5m | Tokenizer cache TTL |
23+
| AIBRIX_MAX_TOKENIZERS_PER_POOL | 100 | Maximum tokenizers per pool |
24+
| AIBRIX_TOKENIZER_REQUEST_TIMEOUT | 10s | Request timeout |
25+
26+
## Customization
27+
28+
To use custom values, copy this directory and modify `gateway-plugins-env-patch.yaml`:
29+
30+
```bash
31+
cp -r config/features/vllm-remote-tokenizer/ config/features/my-vllm-config/
32+
# Edit config/features/my-vllm-config/gateway-plugins-env-patch.yaml
33+
kubectl apply -k config/features/my-vllm-config/
34+
```
35+
36+
## Enable the Feature
37+
38+
To enable vLLM remote tokenizer after installation:
39+
40+
```bash
41+
kubectl set env deployment/gateway-plugins -n aibrix-system AIBRIX_ENABLE_VLLM_REMOTE_TOKENIZER=true
42+
```
43+
44+
Or use a custom Kustomization overlay with the environment variable set to `true`.
45+
46+
## Disable
47+
48+
To disable, set the environment variable to false:
49+
50+
```bash
51+
kubectl set env deployment/gateway-plugins -n aibrix-system AIBRIX_ENABLE_VLLM_REMOTE_TOKENIZER=false
52+
```
53+
54+
## Verification
55+
56+
Check if enabled:
57+
58+
```bash
59+
kubectl get deployment gateway-plugins -n aibrix-system -o json | \
60+
jq '.spec.template.spec.containers[0].env[] | select(.name | startswith("AIBRIX_ENABLE_VLLM"))'
61+
```
62+
63+
Check metrics:
64+
65+
```bash
66+
kubectl port-forward -n aibrix-system svc/gateway-plugins 8080:8080
67+
curl http://localhost:8080/metrics | grep aibrix_tokenizer_pool
68+
```
Lines changed: 23 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,23 @@
1+
apiVersion: apps/v1
2+
kind: Deployment
3+
metadata:
4+
name: gateway-plugins
5+
namespace: system
6+
spec:
7+
template:
8+
spec:
9+
containers:
10+
- name: gateway-plugin
11+
env:
12+
- name: AIBRIX_ENABLE_VLLM_REMOTE_TOKENIZER
13+
value: "false"
14+
- name: AIBRIX_VLLM_TOKENIZER_ENDPOINT_TEMPLATE
15+
value: "http://%s:8000"
16+
- name: AIBRIX_TOKENIZER_HEALTH_CHECK_PERIOD
17+
value: "30s"
18+
- name: AIBRIX_TOKENIZER_TTL
19+
value: "5m"
20+
- name: AIBRIX_MAX_TOKENIZERS_PER_POOL
21+
value: "100"
22+
- name: AIBRIX_TOKENIZER_REQUEST_TIMEOUT
23+
value: "10s"
Lines changed: 16 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,16 @@
1+
apiVersion: kustomize.config.k8s.io/v1beta1
2+
kind: Kustomization
3+
4+
namespace: aibrix-system
5+
6+
# This overlay enables vLLM remote tokenizer support
7+
# Apply with: kubectl apply -k config/features/vllm-remote-tokenizer/
8+
9+
resources:
10+
- ../../gateway/gateway-plugin
11+
12+
patches:
13+
- path: gateway-plugins-env-patch.yaml
14+
target:
15+
kind: Deployment
16+
name: gateway-plugins

pkg/apis/constants/labels.go

Lines changed: 54 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,54 @@
1+
/*
2+
Copyright 2024 The Aibrix Team.
3+
4+
Licensed under the Apache License, Version 2.0 (the "License");
5+
you may not use this file except in compliance with the License.
6+
You may obtain a copy of the License at
7+
8+
http://www.apache.org/licenses/LICENSE-2.0
9+
10+
Unless required by applicable law or agreed to in writing, software
11+
distributed under the License is distributed on an "AS IS" BASIS,
12+
WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
13+
See the License for the specific language governing permissions and
14+
limitations under the License.
15+
*/
16+
17+
package constants
18+
19+
// Label keys used by the Aibrix system.
20+
// The format `resource.aibrix.ai/attribute` is the standard.
21+
22+
const (
23+
// ModelNameLabel is the label for identifying the model name
24+
// Example: "model.aibrix.ai/name": "deepseek-llm-7b-chat"
25+
ModelNameLabel = "model.aibrix.ai/name"
26+
27+
// ModelEngineLabel is the label for identifying the inference engine
28+
// Example: "model.aibrix.ai/engine": "vllm"
29+
ModelEngineLabel = "model.aibrix.ai/engine"
30+
31+
// ModelMetricPortLabel is the label for specifying the metrics port
32+
// Example: "model.aibrix.ai/metric-port": "8000"
33+
ModelMetricPortLabel = "model.aibrix.ai/metric-port"
34+
35+
// ModelPortLabel is the label for specifying the service port
36+
// Example: "model.aibrix.ai/port": "8080"
37+
ModelPortLabel = "model.aibrix.ai/port"
38+
)
39+
40+
// GetModelName retrieves the model name from pod labels
41+
func GetModelName(labels map[string]string) string {
42+
if model, ok := labels[ModelNameLabel]; ok {
43+
return model
44+
}
45+
return ""
46+
}
47+
48+
// GetInferenceEngine retrieves the inference engine from pod labels
49+
func GetInferenceEngine(labels map[string]string) string {
50+
if engine, ok := labels[ModelEngineLabel]; ok {
51+
return engine
52+
}
53+
return ""
54+
}

pkg/cache/cache_metrics.go

Lines changed: 5 additions & 6 deletions
Original file line numberDiff line numberDiff line change
@@ -21,6 +21,7 @@ import (
2121

2222
prometheusv1 "github.com/prometheus/client_golang/api/prometheus/v1"
2323
dto "github.com/prometheus/client_model/go"
24+
"github.com/vllm-project/aibrix/pkg/apis/constants"
2425
"github.com/vllm-project/aibrix/pkg/metrics"
2526
"github.com/vllm-project/aibrix/pkg/utils"
2627
"k8s.io/klog/v2"
@@ -29,10 +30,8 @@ import (
2930
const (
3031
// When the engine's HTTP proxy is separated from the engine itself,
3132
// the request port and metrics port may differ, so a dedicated metrics port is required.
32-
MetricPortLabel = "model.aibrix.ai/metric-port"
33-
engineLabel = "model.aibrix.ai/engine"
34-
portLabel = "model.aibrix.ai/port"
35-
modelLabel = "model.aibrix.ai/name"
33+
// Note: Using MetricPortLabel for backward compatibility, but it's the same as constants.ModelMetricPortLabel
34+
MetricPortLabel = constants.ModelMetricPortLabel
3635
defaultMetricPort = 8000
3736
defaultEngineLabelValue = "vllm"
3837
defaultPodMetricRefreshIntervalInMS = 50
@@ -337,7 +336,7 @@ func (c *Store) fetchMetrics(pod *Pod, allMetrics map[string]*dto.MetricFamily,
337336
klog.V(4).Infof("Cannot find labelMetricName %v in collected metrics names", labelMetricName)
338337
return nil, false
339338
}
340-
engineType, err := getPodLabel(pod, engineLabel)
339+
engineType, err := getPodLabel(pod, constants.ModelEngineLabel)
341340
if engineType == "" {
342341
klog.V(4).Infof(err.Error())
343342
engineType = defaultEngineLabelValue
@@ -363,7 +362,7 @@ func (c *Store) updatePodRecord(pod *Pod, modelName string, metricName string, s
363362
} else if scope == metrics.PodModelMetricScope {
364363
var err error
365364
if modelName == "" {
366-
modelName, err = getPodLabel(pod, modelLabel)
365+
modelName, err = getPodLabel(pod, constants.ModelNameLabel)
367366
if err != nil {
368367
return fmt.Errorf("modelName should not be empty for scope %v", scope)
369368
}

pkg/plugins/gateway/algorithms/prefix_cache.go

Lines changed: 53 additions & 15 deletions
Original file line numberDiff line numberDiff line change
@@ -20,6 +20,7 @@ import (
2020
"math"
2121
"math/rand"
2222
"sort"
23+
"time"
2324

2425
"github.com/vllm-project/aibrix/pkg/cache"
2526
"github.com/vllm-project/aibrix/pkg/types"
@@ -41,6 +42,14 @@ var (
4142
tokenizerType = utils.LoadEnv("AIBRIX_PREFIX_CACHE_TOKENIZER_TYPE", "character")
4243
podRunningRequestImbalanceAbsCount int = utils.LoadEnvInt("AIBRIX_PREFIX_CACHE_POD_RUNNING_REQUEST_IMBALANCE_ABS_COUNT", defaultPodRunningRequestImbalanceAbsCount)
4344
standardDeviationFactor int = utils.LoadEnvInt("AIBRIX_PREFIX_CACHE_STANDARD_DEVIATION_FACTOR", defaultStandardDeviationFactor)
45+
46+
// vLLM Remote Tokenizer configuration
47+
enableVLLMRemoteTokenizer = utils.LoadEnvBool("AIBRIX_ENABLE_VLLM_REMOTE_TOKENIZER", false)
48+
vllmTokenizerEndpointTemplate = utils.LoadEnv("AIBRIX_VLLM_TOKENIZER_ENDPOINT_TEMPLATE", "http://%s:8000")
49+
tokenizerHealthCheckPeriod = utils.LoadEnvDuration("AIBRIX_TOKENIZER_HEALTH_CHECK_PERIOD", 30*time.Second)
50+
tokenizerTTL = utils.LoadEnvDuration("AIBRIX_TOKENIZER_TTL", 5*time.Minute)
51+
maxTokenizersPerPool = utils.LoadEnvInt("AIBRIX_MAX_TOKENIZERS_PER_POOL", 100)
52+
tokenizerRequestTimeout = utils.LoadEnvDuration("AIBRIX_TOKENIZER_REQUEST_TIMEOUT", 10*time.Second)
4453
)
4554

4655
func init() {
@@ -49,41 +58,60 @@ func init() {
4958

5059
type prefixCacheRouter struct {
5160
cache cache.Cache
52-
tokenizer tokenizer.Tokenizer
61+
tokenizer tokenizer.Tokenizer // Fallback tokenizer for backward compatibility
62+
tokenizerPool *TokenizerPool // Model-aware tokenizer pool
5363
prefixCacheIndexer *prefixcacheindexer.PrefixHashTable
5464
}
5565

5666
func NewPrefixCacheRouter() (types.Router, error) {
57-
// Create tokenizer based on type
67+
c, err := cache.Get()
68+
if err != nil {
69+
klog.Error("fail to get cache store in prefix cache router")
70+
return nil, err
71+
}
72+
73+
// Create fallback tokenizer based on type
5874
// Supported tokenizers: ["character", "tiktoken"]
5975
// Default: "character" for any unrecognized type
60-
// TODO: Add support for "remote" and "vllm" tokenizer types in a future PR.
61-
// This will require proper configuration handling for remote endpoints.
62-
var tokenizerObj tokenizer.Tokenizer
76+
var fallbackTokenizer tokenizer.Tokenizer
6377
if tokenizerType == "tiktoken" {
64-
tokenizerObj = tokenizer.NewTiktokenTokenizer()
78+
fallbackTokenizer = tokenizer.NewTiktokenTokenizer()
6579
} else {
6680
// Default to character tokenizer for backward compatibility
6781
if tokenizerType != "character" {
6882
klog.InfoS("unrecognized tokenizer type, defaulting to character", "type", tokenizerType)
6983
}
70-
tokenizerObj = tokenizer.NewCharacterTokenizer()
84+
fallbackTokenizer = tokenizer.NewCharacterTokenizer()
7185
}
7286

73-
c, err := cache.Get()
74-
if err != nil {
75-
klog.Error("fail to get cache store in prefix cache router")
76-
return nil, err
87+
// Initialize TokenizerPool for vLLM remote tokenizer support
88+
poolConfig := TokenizerPoolConfig{
89+
EnableVLLMRemote: enableVLLMRemoteTokenizer,
90+
EndpointTemplate: vllmTokenizerEndpointTemplate,
91+
HealthCheckPeriod: tokenizerHealthCheckPeriod,
92+
TokenizerTTL: tokenizerTTL,
93+
MaxTokenizersPerPool: maxTokenizersPerPool,
94+
FallbackTokenizer: fallbackTokenizer,
95+
Timeout: tokenizerRequestTimeout,
96+
ModelServiceMap: make(map[string]string), // Can be populated from config later
7797
}
7898

99+
pool := NewTokenizerPool(poolConfig, c)
100+
79101
klog.InfoS("prefix_cache_configurations",
80102
"tokenizer_type", tokenizerType,
81103
"pod_running_request_imbalance_abs_count", podRunningRequestImbalanceAbsCount,
82-
"matched_pods_running_requests_standard_deviation_factor", standardDeviationFactor)
104+
"matched_pods_running_requests_standard_deviation_factor", standardDeviationFactor,
105+
"enable_vllm_remote_tokenizer", enableVLLMRemoteTokenizer,
106+
"vllm_tokenizer_endpoint_template", vllmTokenizerEndpointTemplate,
107+
"tokenizer_health_check_period", tokenizerHealthCheckPeriod,
108+
"tokenizer_ttl", tokenizerTTL,
109+
"max_tokenizers_per_pool", maxTokenizersPerPool)
83110

84111
return prefixCacheRouter{
85112
cache: c,
86-
tokenizer: tokenizerObj,
113+
tokenizer: fallbackTokenizer, // Keep for backward compatibility
114+
tokenizerPool: pool,
87115
prefixCacheIndexer: prefixcacheindexer.NewPrefixHashTable(),
88116
}, nil
89117
}
@@ -93,12 +121,22 @@ func (p prefixCacheRouter) Route(ctx *types.RoutingContext, readyPodList types.P
93121
var matchedPods map[string]int
94122
var targetPod *v1.Pod
95123

96-
tokens, err := p.tokenizer.TokenizeInputText(ctx.Message)
124+
readyPods := readyPodList.All()
125+
126+
// Get tokenizer - use pool only if vLLM remote tokenizer is enabled
127+
var tokenizerObj tokenizer.Tokenizer
128+
if enableVLLMRemoteTokenizer {
129+
tokenizerObj = p.tokenizerPool.GetTokenizer(ctx.Model, readyPods)
130+
} else {
131+
// Use the original tokenizer for backward compatibility
132+
tokenizerObj = p.tokenizer
133+
}
134+
135+
tokens, err := tokenizerObj.TokenizeInputText(ctx.Message)
97136
if err != nil {
98137
return "", err
99138
}
100139

101-
readyPods := readyPodList.All()
102140
readyPodsMap := map[string]struct{}{}
103141
for _, pod := range readyPods {
104142
readyPodsMap[pod.Name] = struct{}{}

0 commit comments

Comments
 (0)