Skip to content

Conversation

varungup90
Copy link
Collaborator

@varungup90 varungup90 commented Apr 2, 2025

Pull Request Description

  • Use running request tracked within gateway plugin. Previous approach of fetching active requests from inference engine was prone to burst request.
  • Add load aware in pre-match-prefix and post-match-prefix. Pre match prefix uses absolute difference in max and min running requests count. Post match prefix uses standard deviation to restrict over-loading on prefix match pod. Lastly as a fallback least-request routing approach is used.

Env

  • Model: llama3-8B
  • Cluster: L20*8
  • Input dataset: python3 benchmark/generator/generate_realistic_prefix_share_workload.py

Output: prefix_cache is ~45% faster compared to random routing.

Output summaries and detailed pdf report

random routing
image
random.pdf

least-request
image
least-request.pdf

prefix_cache
image
prefix_cache.pdf

Blips occur when prefix_match_pods are overloaded and fallback to non prefix cache pod happens.
image

prefix_cache with 2*std_dev load factor
image
prefix_cache_load_factor_2.pdf

If we use second standard deviation i.e. pack more requests on pods with prefix cache present then #blips reduce as expected and overall P50 or P99 TTFT improve ~10% compare to first standard deviation.
image

Related Issues

Resolves: #[Insert issue number(s)]

Important: Before submitting, please complete the description above and review the checklist below.


Contribution Guidelines (Expand for Details)

We appreciate your contribution to aibrix! To ensure a smooth review process and maintain high code quality, please adhere to the following guidelines:

Pull Request Title Format

Your PR title should start with one of these prefixes to indicate the nature of the change:

  • [Bug]: Corrections to existing functionality
  • [CI]: Changes to build process or CI pipeline
  • [Docs]: Updates or additions to documentation
  • [API]: Modifications to aibrix's API or interface
  • [CLI]: Changes or additions to the Command Line Interface
  • [Misc]: For changes not covered above (use sparingly)

Note: For changes spanning multiple categories, use multiple prefixes in order of importance.

Submission Checklist

  • PR title includes appropriate prefix(es)
  • Changes are clearly explained in the PR description
  • New and existing tests pass successfully
  • Code adheres to project style and best practices
  • Documentation updated to reflect changes (if applicable)
  • Thorough testing completed, no regressions introduced

By submitting this PR, you confirm that you've read these guidelines and your changes align with the project's contribution standards.

@varungup90 varungup90 requested review from zhangjyr and Jeffwan April 2, 2025 05:33
@varungup90 varungup90 force-pushed the prefix-cache-improvements branch from 9cf6406 to 0f4e773 Compare April 2, 2025 19:45
@Jeffwan Jeffwan requested a review from Copilot April 2, 2025 21:53
Copy link
Contributor

@Copilot Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull Request Overview

This PR improves the performance of prefix cache routing by tracking active requests through the gateway plugin and adding load-aware routing based on pod request metrics. Key changes include:

  • Refactoring the prefix cache indexer with a new hash-based tokenizer and improved prefix matching logic.
  • Implementing load imbalance detection using mean and standard deviation of pod request counts.
  • Adjusting test cases, configuration parameters, and related utilities to support these new routing features.

Reviewed Changes

Copilot reviewed 34 out of 34 changed files in this pull request and generated 1 comment.

Show a summary per file
File Description
pkg/plugins/gateway/algorithms/tokenizer/characters.go Removed in-file tests and refactored tokenizer initialization.
pkg/plugins/gateway/algorithms/router_test.go Updated test contexts and model parameters for routing tests.
pkg/plugins/gateway/algorithms/prefixcacheindexer/{hash.go, tree_test.go, hash_test.go} Modified indexing logic, prefix matching functions, and test stubs for E2E testing.
pkg/plugins/gateway/algorithms/prefix_cache_test.go Enhanced E2E tests with additional routing scenarios and logging for diagnostics.
pkg/plugins/gateway/algorithms/prefix_cache_and_load.go Fixed import paths to align with new module structure.
pkg/plugins/gateway/algorithms/prefix_cache.go Introduced load-aware routing using mean and std. deviation; refactored fallback logic.
pkg/plugins/gateway/algorithms/least_request.go Updated to use realtime metrics in the least-request routing strategy.
pkg/cache/{cache_metrics.go, cache_init.go} Adjusted pod metric updates and test cache initialization to include model context.
config/{overlays/release/default_patch.yaml, gateway-plugin.yaml} Revised environment variable settings for prefix cache block size.
benchmarks/client/client.py Captured and logged request IDs from response headers for better traceability.
Comments suppressed due to low confidence (1)

pkg/plugins/gateway/algorithms/prefix_cache.go:58

  • The use of fmt.Println for logging in production code could clutter standard output; consider using the logging library (e.g., klog) consistently for debug and info logging instead.
fmt.Println(input)

"k8s.io/klog/v2"
)

const (
defaultPrefixCacheBlockNumber = 200000
defaultPrefixCacheBlockSize = 16
defaultPrefixCacheBlockSize = 4
Copy link
Preview

Copilot AI Apr 2, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The constant defaultPrefixCacheBlockSize is hardcoded to 4, but configuration overlays set AIBRIX_PREFIX_CACHE_BLOCK_SIZE to other values (e.g., 128 and 32). Consider reading this value from an environment variable so that the runtime configuration is consistent with the expectations in deployment.

Suggested change
defaultPrefixCacheBlockSize = 4
defaultPrefixCacheBlockSize = getDefaultPrefixCacheBlockSize()

Copilot uses AI. Check for mistakes.

@Jeffwan
Copy link
Collaborator

Jeffwan commented Apr 2, 2025

image
Seems most of the request hit the cache through the prefix-cache. there're some requests in the middle results in higher TTFT. Not sure if there's room to keep improving it.

@varungup90
Copy link
Collaborator Author

image Seems most of the request hit the cache through the prefix-cache. there're some requests in the middle results in higher TTFT. Not sure if there's room to keep improving it.

Yea, I am debugging that part. One of the scenario, I came across is as follows.

T1: Request (R1) - Prefix_Match_Pods: [P1: 97%, P2: 97%], Active_Request_Count: [P1: 2, P2: 2, P3: 0]
For Request (R1) - In post prefix_match, load is imbalanced as P1 & P2 request count is higher > mean + std_dev (for all requests) and fallback to least_request count will happen. Selected target_pod = P3.

Before R1 completes, R2 and R3 arrive with prefix_match_pods: [P3: 90%], active_request_count: [P1: 2, P2: 2, P3: 1 (or 2 for R3)]
For R2 & R3, P3 is selected. Behavior is as expected for this scenario. I am checking all scenario, to see for potential improvements.

func getPrefixCacheMatchThresholdPercent() int {
value := utils.LoadEnv("AIBRIX_PREFIX_CACHE_MATCH_THRESHOLD_PERCENT", "")
func getPodRequestImbalanceAbsCount() int {
value := utils.LoadEnv("AIBRIX_PREFIX_CACHE_POD_RUNNING_REQUEST_IMBALANCE_ABS_COUNT", "")
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Let's update the docs for new added envs

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Different GPU has different compute capacity, using absolute numbers looks like not that scalable. How can we generate this number?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We can leverage GPU optimizer to generate recommendations for abs numbers to define imbalance.

Copy link
Collaborator

@Jeffwan Jeffwan Apr 7, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We do not need to block this PR but this is something we have to consider in future. If we do such way, that will introduce the dependency on gpu-optimizer which makes the solution a little bit heavy. /cc @zhangjyr

@varungup90
Copy link
Collaborator Author

varungup90 commented Apr 3, 2025

image Seems most of the request hit the cache through the prefix-cache. there're some requests in the middle results in higher TTFT. Not sure if there's room to keep improving it.

Yea, I am debugging that part. One of the scenario, I came across is as follows.

T1: Request (R1) - Prefix_Match_Pods: [P1: 97%, P2: 97%], Active_Request_Count: [P1: 2, P2: 2, P3: 0] For Request (R1) - In post prefix_match, load is imbalanced as P1 & P2 request count is higher > mean + std_dev (for all requests) and fallback to least_request count will happen. Selected target_pod = P3.

Before R1 completes, R2 and R3 arrive with prefix_match_pods: [P3: 90%], active_request_count: [P1: 2, P2: 2, P3: 1 (or 2 for R3)] For R2 & R3, P3 is selected. Behavior is as expected for this scenario. I am checking all scenario, to see for potential improvements.

@Jeffwan I tried with second standard deviation to load balance for prefix matched pods and TTFT improves 10% more, number of blips reduce. Initially my concern was that packing more requests on same pod can have adverse effect but since these extra packed requests already have kv cache so overall performance is not negatively affected. Updated the PR description with datapoints and result.

Signed-off-by: Varun Gupta <varungup90@gmail.com>
Signed-off-by: Varun Gupta <varungup90@gmail.com>
Signed-off-by: Varun Gupta <varungup90@gmail.com>
Signed-off-by: Varun Gupta <varungup90@gmail.com>
Signed-off-by: Varun Gupta <varungup90@gmail.com>
…ad env var

Signed-off-by: Varun Gupta <varungup90@gmail.com>
Signed-off-by: Varun Gupta <varungup90@gmail.com>
Signed-off-by: Varun Gupta <varungup90@gmail.com>
… todos

Signed-off-by: Varun Gupta <varungup90@gmail.com>
@varungup90 varungup90 force-pushed the prefix-cache-improvements branch from 32d931a to 76d0551 Compare April 7, 2025 21:26
Signed-off-by: Varun Gupta <varungup90@gmail.com>
Signed-off-by: Varun Gupta <varungup90@gmail.com>
… and release

Signed-off-by: Varun Gupta <varungup90@gmail.com>
Copy link
Collaborator

@Jeffwan Jeffwan left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

/lgtm excellent!

@varungup90 varungup90 merged commit d546795 into vllm-project:main Apr 8, 2025
11 checks passed
@Jeffwan Jeffwan deleted the prefix-cache-improvements branch April 8, 2025 04:55
@@ -88,12 +89,10 @@ func (e *LRUStore[K, V]) Put(key K, value V) bool {
}

func (e *LRUStore[K, V]) Get(key K) (V, bool) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't quite understand why the relevant entry isn't updated here. Shouldn't corresponding entry be moved to the front of the lruList when there is a matching request?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Two reasons to remove the update here.

  1. Right now LRU store is only used by hash table implementation. In MatchPrefix, it call the get method to retrieve the blocks but all these blocks will be re-iterated in AddPrefix which call put method for LRUStore where they will be updated with current time and moved to the head.
  2. After removing update from get method, its lock can be changed to RLock and improve the performance a little. Once the LRUStore implements sync map and does not require locking, we can add back update + move to head back in Get method.

gangmuk pushed a commit to gangmuk/aibrix-gangmuk that referenced this pull request Jun 21, 2025
* Performance improvements for prefix cache routing

Signed-off-by: Varun Gupta <varungup90@gmail.com>
@justadogistaken
Copy link
Contributor

I would like to inquire about the two algorithms: prefix-cache and prefix-cache-preble. Which one performs better, and how significant is the difference?

Yaegaki1Erika pushed a commit to Yaegaki1Erika/aibrix that referenced this pull request Jul 23, 2025
* Performance improvements for prefix cache routing

Signed-off-by: Varun Gupta <varungup90@gmail.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants