-
Notifications
You must be signed in to change notification settings - Fork 434
Performance improvements for prefix cache routing #933
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Performance improvements for prefix cache routing #933
Conversation
9cf6406
to
0f4e773
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Pull Request Overview
This PR improves the performance of prefix cache routing by tracking active requests through the gateway plugin and adding load-aware routing based on pod request metrics. Key changes include:
- Refactoring the prefix cache indexer with a new hash-based tokenizer and improved prefix matching logic.
- Implementing load imbalance detection using mean and standard deviation of pod request counts.
- Adjusting test cases, configuration parameters, and related utilities to support these new routing features.
Reviewed Changes
Copilot reviewed 34 out of 34 changed files in this pull request and generated 1 comment.
Show a summary per file
File | Description |
---|---|
pkg/plugins/gateway/algorithms/tokenizer/characters.go | Removed in-file tests and refactored tokenizer initialization. |
pkg/plugins/gateway/algorithms/router_test.go | Updated test contexts and model parameters for routing tests. |
pkg/plugins/gateway/algorithms/prefixcacheindexer/{hash.go, tree_test.go, hash_test.go} | Modified indexing logic, prefix matching functions, and test stubs for E2E testing. |
pkg/plugins/gateway/algorithms/prefix_cache_test.go | Enhanced E2E tests with additional routing scenarios and logging for diagnostics. |
pkg/plugins/gateway/algorithms/prefix_cache_and_load.go | Fixed import paths to align with new module structure. |
pkg/plugins/gateway/algorithms/prefix_cache.go | Introduced load-aware routing using mean and std. deviation; refactored fallback logic. |
pkg/plugins/gateway/algorithms/least_request.go | Updated to use realtime metrics in the least-request routing strategy. |
pkg/cache/{cache_metrics.go, cache_init.go} | Adjusted pod metric updates and test cache initialization to include model context. |
config/{overlays/release/default_patch.yaml, gateway-plugin.yaml} | Revised environment variable settings for prefix cache block size. |
benchmarks/client/client.py | Captured and logged request IDs from response headers for better traceability. |
Comments suppressed due to low confidence (1)
pkg/plugins/gateway/algorithms/prefix_cache.go:58
- The use of fmt.Println for logging in production code could clutter standard output; consider using the logging library (e.g., klog) consistently for debug and info logging instead.
fmt.Println(input)
"k8s.io/klog/v2" | ||
) | ||
|
||
const ( | ||
defaultPrefixCacheBlockNumber = 200000 | ||
defaultPrefixCacheBlockSize = 16 | ||
defaultPrefixCacheBlockSize = 4 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The constant defaultPrefixCacheBlockSize is hardcoded to 4, but configuration overlays set AIBRIX_PREFIX_CACHE_BLOCK_SIZE to other values (e.g., 128 and 32). Consider reading this value from an environment variable so that the runtime configuration is consistent with the expectations in deployment.
defaultPrefixCacheBlockSize = 4 | |
defaultPrefixCacheBlockSize = getDefaultPrefixCacheBlockSize() |
Copilot uses AI. Check for mistakes.
func getPrefixCacheMatchThresholdPercent() int { | ||
value := utils.LoadEnv("AIBRIX_PREFIX_CACHE_MATCH_THRESHOLD_PERCENT", "") | ||
func getPodRequestImbalanceAbsCount() int { | ||
value := utils.LoadEnv("AIBRIX_PREFIX_CACHE_POD_RUNNING_REQUEST_IMBALANCE_ABS_COUNT", "") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Let's update the docs for new added envs
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Different GPU has different compute capacity, using absolute numbers looks like not that scalable. How can we generate this number?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We can leverage GPU optimizer to generate recommendations for abs numbers to define imbalance.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We do not need to block this PR but this is something we have to consider in future. If we do such way, that will introduce the dependency on gpu-optimizer which makes the solution a little bit heavy. /cc @zhangjyr
@Jeffwan I tried with second standard deviation to load balance for prefix matched pods and TTFT improves 10% more, number of blips reduce. Initially my concern was that packing more requests on same pod can have adverse effect but since these extra packed requests already have kv cache so overall performance is not negatively affected. Updated the PR description with datapoints and result. |
Signed-off-by: Varun Gupta <varungup90@gmail.com>
Signed-off-by: Varun Gupta <varungup90@gmail.com>
Signed-off-by: Varun Gupta <varungup90@gmail.com>
Signed-off-by: Varun Gupta <varungup90@gmail.com>
Signed-off-by: Varun Gupta <varungup90@gmail.com>
…ad env var Signed-off-by: Varun Gupta <varungup90@gmail.com>
Signed-off-by: Varun Gupta <varungup90@gmail.com>
Signed-off-by: Varun Gupta <varungup90@gmail.com>
… todos Signed-off-by: Varun Gupta <varungup90@gmail.com>
32d931a
to
76d0551
Compare
Signed-off-by: Varun Gupta <varungup90@gmail.com>
… and release Signed-off-by: Varun Gupta <varungup90@gmail.com>
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
/lgtm excellent!
@@ -88,12 +89,10 @@ func (e *LRUStore[K, V]) Put(key K, value V) bool { | |||
} | |||
|
|||
func (e *LRUStore[K, V]) Get(key K) (V, bool) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't quite understand why the relevant entry isn't updated here. Shouldn't corresponding entry be moved to the front of the lruList when there is a matching request?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Two reasons to remove the update here.
- Right now LRU store is only used by hash table implementation. In MatchPrefix, it call the get method to retrieve the blocks but all these blocks will be re-iterated in AddPrefix which call put method for LRUStore where they will be updated with current time and moved to the head.
- After removing update from get method, its lock can be changed to RLock and improve the performance a little. Once the LRUStore implements sync map and does not require locking, we can add back update + move to head back in Get method.
* Performance improvements for prefix cache routing Signed-off-by: Varun Gupta <varungup90@gmail.com>
I would like to inquire about the two algorithms: prefix-cache and prefix-cache-preble. Which one performs better, and how significant is the difference? |
* Performance improvements for prefix cache routing Signed-off-by: Varun Gupta <varungup90@gmail.com>
Pull Request Description
Env
python3 benchmark/generator/generate_realistic_prefix_share_workload.py
Output: prefix_cache is ~45% faster compared to random routing.
Output summaries and detailed pdf report
random routing

random.pdf
least-request

least-request.pdf
prefix_cache

prefix_cache.pdf
Blips occur when prefix_match_pods are overloaded and fallback to non prefix cache pod happens.

prefix_cache with 2*std_dev load factor

prefix_cache_load_factor_2.pdf
If we use second standard deviation i.e. pack more requests on pods with prefix cache present then #blips reduce as expected and overall P50 or P99 TTFT improve ~10% compare to first standard deviation.

Related Issues
Resolves: #[Insert issue number(s)]
Important: Before submitting, please complete the description above and review the checklist below.
Contribution Guidelines (Expand for Details)
We appreciate your contribution to aibrix! To ensure a smooth review process and maintain high code quality, please adhere to the following guidelines:
Pull Request Title Format
Your PR title should start with one of these prefixes to indicate the nature of the change:
[Bug]
: Corrections to existing functionality[CI]
: Changes to build process or CI pipeline[Docs]
: Updates or additions to documentation[API]
: Modifications to aibrix's API or interface[CLI]
: Changes or additions to the Command Line Interface[Misc]
: For changes not covered above (use sparingly)Note: For changes spanning multiple categories, use multiple prefixes in order of importance.
Submission Checklist
By submitting this PR, you confirm that you've read these guidelines and your changes align with the project's contribution standards.