Performance improvements for prefix cache routing #933

varungup90 · 2025-04-02T05:33:39Z

Pull Request Description

Use running request tracked within gateway plugin. Previous approach of fetching active requests from inference engine was prone to burst request.
Add load aware in pre-match-prefix and post-match-prefix. Pre match prefix uses absolute difference in max and min running requests count. Post match prefix uses standard deviation to restrict over-loading on prefix match pod. Lastly as a fallback least-request routing approach is used.

Env

Model: llama3-8B
Cluster: L20*8
Input dataset: python3 benchmark/generator/generate_realistic_prefix_share_workload.py

Output: prefix_cache is ~45% faster compared to random routing.

Output summaries and detailed pdf report

random routing

random.pdf

least-request

least-request.pdf

prefix_cache

prefix_cache.pdf

Blips occur when prefix_match_pods are overloaded and fallback to non prefix cache pod happens.

prefix_cache with 2*std_dev load factor

prefix_cache_load_factor_2.pdf

If we use second standard deviation i.e. pack more requests on pods with prefix cache present then #blips reduce as expected and overall P50 or P99 TTFT improve ~10% compare to first standard deviation.

Related Issues

Resolves: #[Insert issue number(s)]

Important: Before submitting, please complete the description above and review the checklist below.

Contribution Guidelines (Expand for Details)

We appreciate your contribution to aibrix! To ensure a smooth review process and maintain high code quality, please adhere to the following guidelines:

Pull Request Title Format

Your PR title should start with one of these prefixes to indicate the nature of the change:

[Bug]: Corrections to existing functionality
[CI]: Changes to build process or CI pipeline
[Docs]: Updates or additions to documentation
[API]: Modifications to aibrix's API or interface
[CLI]: Changes or additions to the Command Line Interface
[Misc]: For changes not covered above (use sparingly)

Note: For changes spanning multiple categories, use multiple prefixes in order of importance.

Submission Checklist

PR title includes appropriate prefix(es)
Changes are clearly explained in the PR description
New and existing tests pass successfully
Code adheres to project style and best practices
Documentation updated to reflect changes (if applicable)
Thorough testing completed, no regressions introduced

By submitting this PR, you confirm that you've read these guidelines and your changes align with the project's contribution standards.

Copilot

Pull Request Overview

This PR improves the performance of prefix cache routing by tracking active requests through the gateway plugin and adding load-aware routing based on pod request metrics. Key changes include:

Refactoring the prefix cache indexer with a new hash-based tokenizer and improved prefix matching logic.
Implementing load imbalance detection using mean and standard deviation of pod request counts.
Adjusting test cases, configuration parameters, and related utilities to support these new routing features.

Reviewed Changes

Copilot reviewed 34 out of 34 changed files in this pull request and generated 1 comment.

Show a summary per file

File	Description
pkg/plugins/gateway/algorithms/tokenizer/characters.go	Removed in-file tests and refactored tokenizer initialization.
pkg/plugins/gateway/algorithms/router_test.go	Updated test contexts and model parameters for routing tests.
pkg/plugins/gateway/algorithms/prefixcacheindexer/{hash.go, tree_test.go, hash_test.go}	Modified indexing logic, prefix matching functions, and test stubs for E2E testing.
pkg/plugins/gateway/algorithms/prefix_cache_test.go	Enhanced E2E tests with additional routing scenarios and logging for diagnostics.
pkg/plugins/gateway/algorithms/prefix_cache_and_load.go	Fixed import paths to align with new module structure.
pkg/plugins/gateway/algorithms/prefix_cache.go	Introduced load-aware routing using mean and std. deviation; refactored fallback logic.
pkg/plugins/gateway/algorithms/least_request.go	Updated to use realtime metrics in the least-request routing strategy.
pkg/cache/{cache_metrics.go, cache_init.go}	Adjusted pod metric updates and test cache initialization to include model context.
config/{overlays/release/default_patch.yaml, gateway-plugin.yaml}	Revised environment variable settings for prefix cache block size.
benchmarks/client/client.py	Captured and logged request IDs from response headers for better traceability.

Comments suppressed due to low confidence (1)

pkg/plugins/gateway/algorithms/prefix_cache.go:58

The use of fmt.Println for logging in production code could clutter standard output; consider using the logging library (e.g., klog) consistently for debug and info logging instead.

fmt.Println(input)

Copilot · 2025-04-02T21:53:53Z

pkg/plugins/gateway/algorithms/prefixcacheindexer/hash.go

 	"k8s.io/klog/v2"
 )

 const (
 	defaultPrefixCacheBlockNumber            = 200000
-	defaultPrefixCacheBlockSize              = 16
+	defaultPrefixCacheBlockSize              = 4


The constant defaultPrefixCacheBlockSize is hardcoded to 4, but configuration overlays set AIBRIX_PREFIX_CACHE_BLOCK_SIZE to other values (e.g., 128 and 32). Consider reading this value from an environment variable so that the runtime configuration is consistent with the expectations in deployment.

Suggested change

defaultPrefixCacheBlockSize = 4

defaultPrefixCacheBlockSize = getDefaultPrefixCacheBlockSize()

Jeffwan · 2025-04-02T22:04:11Z

Seems most of the request hit the cache through the prefix-cache. there're some requests in the middle results in higher TTFT. Not sure if there's room to keep improving it.

varungup90 · 2025-04-02T22:22:54Z

Seems most of the request hit the cache through the prefix-cache. there're some requests in the middle results in higher TTFT. Not sure if there's room to keep improving it.

Yea, I am debugging that part. One of the scenario, I came across is as follows.

T1: Request (R1) - Prefix_Match_Pods: [P1: 97%, P2: 97%], Active_Request_Count: [P1: 2, P2: 2, P3: 0]
For Request (R1) - In post prefix_match, load is imbalanced as P1 & P2 request count is higher > mean + std_dev (for all requests) and fallback to least_request count will happen. Selected target_pod = P3.

Before R1 completes, R2 and R3 arrive with prefix_match_pods: [P3: 90%], active_request_count: [P1: 2, P2: 2, P3: 1 (or 2 for R3)]
For R2 & R3, P3 is selected. Behavior is as expected for this scenario. I am checking all scenario, to see for potential improvements.

config/gateway/gateway-plugin/gateway-plugin.yaml

pkg/cache/cache_metrics.go

pkg/plugins/gateway/algorithms/least_request.go

pkg/plugins/gateway/algorithms/prefix_cache.go

Jeffwan · 2025-04-02T22:33:08Z

pkg/plugins/gateway/algorithms/prefix_cache.go

-func getPrefixCacheMatchThresholdPercent() int {
-	value := utils.LoadEnv("AIBRIX_PREFIX_CACHE_MATCH_THRESHOLD_PERCENT", "")
+func getPodRequestImbalanceAbsCount() int {
+	value := utils.LoadEnv("AIBRIX_PREFIX_CACHE_POD_RUNNING_REQUEST_IMBALANCE_ABS_COUNT", "")


Let's update the docs for new added envs

Different GPU has different compute capacity, using absolute numbers looks like not that scalable. How can we generate this number?

We can leverage GPU optimizer to generate recommendations for abs numbers to define imbalance.

We do not need to block this PR but this is something we have to consider in future. If we do such way, that will introduce the dependency on gpu-optimizer which makes the solution a little bit heavy. /cc @zhangjyr

pkg/plugins/gateway/algorithms/tokenizer/characters.go

pkg/plugins/gateway/algorithms/prefix_cache.go

varungup90 · 2025-04-03T21:43:25Z

Seems most of the request hit the cache through the prefix-cache. there're some requests in the middle results in higher TTFT. Not sure if there's room to keep improving it.

Yea, I am debugging that part. One of the scenario, I came across is as follows.

T1: Request (R1) - Prefix_Match_Pods: [P1: 97%, P2: 97%], Active_Request_Count: [P1: 2, P2: 2, P3: 0] For Request (R1) - In post prefix_match, load is imbalanced as P1 & P2 request count is higher > mean + std_dev (for all requests) and fallback to least_request count will happen. Selected target_pod = P3.

Before R1 completes, R2 and R3 arrive with prefix_match_pods: [P3: 90%], active_request_count: [P1: 2, P2: 2, P3: 1 (or 2 for R3)] For R2 & R3, P3 is selected. Behavior is as expected for this scenario. I am checking all scenario, to see for potential improvements.

@Jeffwan I tried with second standard deviation to load balance for prefix matched pods and TTFT improves 10% more, number of blips reduce. Initially my concern was that packing more requests on same pod can have adverse effect but since these extra packed requests already have kv cache so overall performance is not negatively affected. Updated the PR description with datapoints and result.

Signed-off-by: Varun Gupta <varungup90@gmail.com>

…ad env var Signed-off-by: Varun Gupta <varungup90@gmail.com>

Signed-off-by: Varun Gupta <varungup90@gmail.com>

… todos Signed-off-by: Varun Gupta <varungup90@gmail.com>

Signed-off-by: Varun Gupta <varungup90@gmail.com>

… and release Signed-off-by: Varun Gupta <varungup90@gmail.com>

Jeffwan

/lgtm excellent!

vie-serendipity · 2025-04-08T08:36:49Z

pkg/utils/lrustore/lru_store.go

@@ -88,12 +89,10 @@ func (e *LRUStore[K, V]) Put(key K, value V) bool {
 }

 func (e *LRUStore[K, V]) Get(key K) (V, bool) {


I don't quite understand why the relevant entry isn't updated here. Shouldn't corresponding entry be moved to the front of the lruList when there is a matching request?

Two reasons to remove the update here.

Right now LRU store is only used by hash table implementation. In MatchPrefix, it call the get method to retrieve the blocks but all these blocks will be re-iterated in AddPrefix which call put method for LRUStore where they will be updated with current time and moved to the head.

After removing update from get method, its lock can be changed to RLock and improve the performance a little. Once the LRUStore implements sync map and does not require locking, we can add back update + move to head back in Get method.

* Performance improvements for prefix cache routing Signed-off-by: Varun Gupta <varungup90@gmail.com>

justadogistaken · 2025-07-10T05:49:17Z

I would like to inquire about the two algorithms: prefix-cache and prefix-cache-preble. Which one performs better, and how significant is the difference?

* Performance improvements for prefix cache routing Signed-off-by: Varun Gupta <varungup90@gmail.com>

varungup90 requested review from zhangjyr and Jeffwan April 2, 2025 05:33

varungup90 force-pushed the prefix-cache-improvements branch from 9cf6406 to 0f4e773 Compare April 2, 2025 19:45

Jeffwan requested a review from Copilot April 2, 2025 21:53

Copilot AI reviewed Apr 2, 2025

View reviewed changes

Jeffwan reviewed Apr 2, 2025

View reviewed changes

This was referenced Apr 4, 2025

[API] [Misc]: Support LRU cache with TTL for prefix cache indexer #905

Merged

v0.3.0 roadmap #698

Closed

Jeffwan mentioned this pull request Apr 7, 2025

chore: add help func for get Env value #941

Merged

varungup90 added 9 commits April 7, 2025 11:26

Performance improvements for prefix cache routing

b2e8dfb

Signed-off-by: Varun Gupta <varungup90@gmail.com>

fix lint error

de22856

Signed-off-by: Varun Gupta <varungup90@gmail.com>

change default block size to 32 in gateway-plugin.yaml

92904f3

Signed-off-by: Varun Gupta <varungup90@gmail.com>

bug fix for randomizing least-request pods

dc24536

Signed-off-by: Varun Gupta <varungup90@gmail.com>

bug fix for randomizing pre-prefix-match imbalance pods

35599cf

Signed-off-by: Varun Gupta <varungup90@gmail.com>

update readme with prefix_cache configurations and nit refactor to re…

bb5269d

…ad env var Signed-off-by: Varun Gupta <varungup90@gmail.com>

update lru store get method to read lock

8ad58c6

Signed-off-by: Varun Gupta <varungup90@gmail.com>

add more uts for prefix_cache

59387a2

Signed-off-by: Varun Gupta <varungup90@gmail.com>

address review comments: dir changes, sync yaml configs, add notes or…

76d0551

… todos Signed-off-by: Varun Gupta <varungup90@gmail.com>

varungup90 force-pushed the prefix-cache-improvements branch from 32d931a to 76d0551 Compare April 7, 2025 21:26

varungup90 added 3 commits April 7, 2025 14:32

nit spelling

550a877

Signed-off-by: Varun Gupta <varungup90@gmail.com>

nit

8d0d384

Signed-off-by: Varun Gupta <varungup90@gmail.com>

nit update in routing algorithm readme to use same config for default…

cf15f92

… and release Signed-off-by: Varun Gupta <varungup90@gmail.com>

Jeffwan approved these changes Apr 8, 2025

View reviewed changes

varungup90 merged commit d546795 into vllm-project:main Apr 8, 2025
11 checks passed

This was referenced Apr 8, 2025

prefix cache aware routing is not truly prefix cache #917

Closed

imbalance issues found in least-of-request or any other policies. #646

Closed

Jeffwan deleted the prefix-cache-improvements branch April 8, 2025 04:55

vie-serendipity reviewed Apr 8, 2025

View reviewed changes

gangmuk pushed a commit to gangmuk/aibrix-gangmuk that referenced this pull request Jun 21, 2025

Performance improvements for prefix cache routing (vllm-project#933)

cbbc863

* Performance improvements for prefix cache routing Signed-off-by: Varun Gupta <varungup90@gmail.com>

Yaegaki1Erika pushed a commit to Yaegaki1Erika/aibrix that referenced this pull request Jul 23, 2025

Performance improvements for prefix cache routing (vllm-project#933)

edb7424

* Performance improvements for prefix cache routing Signed-off-by: Varun Gupta <varungup90@gmail.com>

	defaultPrefixCacheBlockSize = 4
	defaultPrefixCacheBlockSize = getDefaultPrefixCacheBlockSize()

		@@ -88,12 +89,10 @@ func (e *LRUStore[K, V]) Put(key K, value V) bool {
		}

		func (e *LRUStore[K, V]) Get(key K) (V, bool) {

Performance improvements for prefix cache routing #933

Performance improvements for prefix cache routing #933

Uh oh!

Conversation

varungup90 commented Apr 2, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Pull Request Description

Related Issues

Pull Request Title Format

Submission Checklist

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull Request Overview

Reviewed Changes

Uh oh!

Copilot AI Apr 2, 2025

Choose a reason for hiding this comment

Uh oh!

Jeffwan commented Apr 2, 2025

Uh oh!

varungup90 commented Apr 2, 2025

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Jeffwan Apr 2, 2025

Choose a reason for hiding this comment

Uh oh!

Jeffwan Apr 2, 2025

Choose a reason for hiding this comment

Uh oh!

varungup90 Apr 2, 2025

Choose a reason for hiding this comment

Uh oh!

Jeffwan Apr 7, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

varungup90 commented Apr 3, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Jeffwan left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

vie-serendipity Apr 8, 2025

Choose a reason for hiding this comment

Uh oh!

varungup90 Apr 8, 2025

Choose a reason for hiding this comment

Uh oh!

justadogistaken commented Jul 10, 2025

Uh oh!

Uh oh!

varungup90 commented Apr 2, 2025 •

edited

Loading

Jeffwan Apr 7, 2025 •

edited

Loading

varungup90 commented Apr 3, 2025 •

edited

Loading