Skip to content

Conversation

zhangjyr
Copy link
Collaborator

Pull Request Description

Refactoring for cache:

  1. Merge multiple pod, model, and metric mapping by adding Pod metadata and Model metadata and using two main thread-safe registries for metadata.
  2. Eliminate the global cache mutex lock on reading and replace it with multiple layers of locks in the metadata.
  3. Cache serves in different contexts: control manager, metadata, and gateway. Added cache initialization helpers for each context for later customization.
  4. Added more unit test cases to cover the pod and model adapter changes.

Refactoring for router

  1. Eliminate thread-unsafe map access in the Router interface.
  2. Merge two contexts, context.Context and routing.RoutingContext, as RoutingContext
  3. Abstract away the Router interface and the RoutingContext for shared access from both the routing and cache package.
  4. RoutingContext now supports both sync and async routing resolution by following the Promise paradigm.

Related Issues

Resolves: #868

Important: Before submitting, please complete the description above and review the checklist below.


Contribution Guidelines (Expand for Details)

We appreciate your contribution to aibrix! To ensure a smooth review process and maintain high code quality, please adhere to the following guidelines:

Pull Request Title Format

Your PR title should start with one of these prefixes to indicate the nature of the change:

  • [Bug]: Corrections to existing functionality
  • [CI]: Changes to build process or CI pipeline
  • [Docs]: Updates or additions to documentation
  • [API]: Modifications to aibrix's API or interface
  • [CLI]: Changes or additions to the Command Line Interface
  • [Misc]: For changes not covered above (use sparingly)

Note: For changes spanning multiple categories, use multiple prefixes in order of importance.

Submission Checklist

  • PR title includes appropriate prefix(es)
  • Changes are clearly explained in the PR description
  • New and existing tests pass successfully
  • Code adheres to project style and best practices
  • Documentation updated to reflect changes (if applicable)
  • Thorough testing completed, no regressions introduced

By submitting this PR, you confirm that you've read these guidelines and your changes align with the project's contribution standards.

@zhangjyr zhangjyr changed the title [Misc] Cache and Router refactoring for concurrent performance, concurrent safety and stateful routing. [Misc][API] Cache and Router refactoring for concurrent performance, concurrent safety and stateful routing. Mar 19, 2025
@Jeffwan
Copy link
Collaborator

Jeffwan commented Mar 19, 2025

Some code changes might be conflict with #878. that's not a problem now. We can review the core logic first and then rebase the changes later.

@Jeffwan
Copy link
Collaborator

Jeffwan commented Mar 21, 2025

I will have a check today

@zhangjyr zhangjyr force-pushed the feature/cache_router_refactor branch from dd7d52d to cea9419 Compare March 24, 2025 21:36
// Parameters:
// modelName: Name of the model
// Returns:
// bool: True if model exists, false otherwise
GetModel(modelName string) bool
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why do you delete ListPods and GetModel in the interface?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There is no use case for ListPods except in unit tests. For this reason, I only provide a non-efficient implementation to return all pods in the cache and delete the function from the API.

The name of GetModel implies it will return some model object. However, the original GetModel API only returns true or false. HasModel would be a more proper name for this API.

// ListPodsByModel gets pods associated with a model
// Parameters:
// modelName: Name of the model
// Returns:
// map[string]*v1.Pod: Pod objects matching the criteria
// error: Error information if operation fails
ListPodsByModel(modelName string) (map[string]*v1.Pod, error)
Copy link
Collaborator

@Jeffwan Jeffwan Mar 24, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

please help me recap the motivation here. the core part is to refactor the cache store registry with better and efficiency data structure? could you remind me the painpoint here? is it just for code refactor or there's some challenges extend it or performance issue?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The original API exposes a thread-unsafe map object while the implementation returns what is stored in the cache without making a copy. This will cause panic on map read accessing while the map is written in the Lock() context.

Because there is no need for map keys, the new implementation returns a slice with read-through-cache acceleration.

@@ -73,7 +70,7 @@ type ModelCache interface {
// Returns:
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Will we have other informations in future? in that case, we may need complex structure as well?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That is possible. As *v1.Pod suggests, if our model(CRD) definition contains more options, such as labels, we may need to update returns here with complex structure. In this case, I would suggest adding the model definition to github.com/vllm-project/aibrix/api/model/v1alpha1

// requestID: Unique request identifier
// modelName: Name of the model
// Returns:
// int64: Trace term identifier
AddRequestCount(requestID string, modelName string) (traceTerm int64)
AddRequestCount(ctx *types.RoutingContext, requestID string, modelName string) (traceTerm int64)
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

here's an interface change . AddRequestCount and DoneRequestCount is moved under TraceCache

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In fact, AddRequestCount is paired with DoneRequestCount or DoneRequestTrace. For example
AddRequestCount + DoneRequestCount or AddRequestCount + DoneRequestTrace. While these functions were designed for trace only, they are also used as an entry point for real-time statistics. In contrast, the metric cache offers an interface for accessing cached metrics/real-time statistics.
In summary:

  1. MetricCache is used for reading metrics
  2. TraceCache is used to generate real-time statistics besides trace. I think MetricCollector may be a better name for this interface. I welcome other options.
    Related changes:
    enableGPUOptimizerTracing is moved from gateway to cache to offer:
  3. While enableGPUOptimizerTracing = false, the trace serving routine (for writing to redis) will not start altogether.
  4. The trace collection operations in AddRequestCount, DoneRequestCount, and DoneRequestTrace can be disabled individually.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I renamed TraceCache to RequestTracker

defer c.mu.RUnlock()

pod, ok := c.Pods[podName]
metaPod, ok := c.metaPods.Load(podName)
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I want to get more context here. (Sorry I didn't clearly review the issue yet). original way is very common in golang, what's the problem?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

R lock is ok. Since we use sync.map to replace thread-unsafe maps, Rlock is not needed anymore.
Using sync.map lower the lock granularity from cache-wide to map-wide.

)

const (
DeploymentIdentifier string = "app.kubernetes.io/name"
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

some scenarios like distributed inference use other orchestrators and we may not have this label.
better to have a group identifier could be configured

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I am thinking maybe client-go indexer is another option if we think map[]xxx is not efficient

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Let me define a new interface to replace PodArray. We can replace the PodArray with Indexer if that's more efficient later.
For DeploymentIdentifier, I can also add something like PodGrouper for alternative implementations.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

A types.PodList is added to decouple PodArray implementation. I hardly think the Indexer in the informer may be helpful in our context because we have retrieved pods' information and cached out of informer caches.

@Jeffwan
Copy link
Collaborator

Jeffwan commented Mar 24, 2025

/cc @Xunzhuo @varungup90 Please help review it.

@zhangjyr this is still a huge PR, if we have some agreement on the data structure and helper utilities. I suggest to add those add only code first to make this code more simpler

@Xunzhuo
Copy link
Member

Xunzhuo commented Mar 25, 2025

I think we need to let e2e test passed before reviews

@zhangjyr zhangjyr requested a review from Jeffwan March 26, 2025 01:22
Copy link
Collaborator

@varungup90 varungup90 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nice PR, overall LGTM. Left some minor comments.


case *extProcPb.ProcessingRequest_RequestBody:
resp, model, targetPodIP, stream, traceTerm = s.HandleRequestBody(ctx, requestID, req, user, routingStrategy)
resp, model, routerCtx, stream, traceTerm = s.HandleRequestBody(ctx, requestID, req, user, routingAlgorithm)
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

technically model and stream can be added to routingCtx as well. Does not need to be part of this PR, it can be done in follow up PR.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

routerCtx may not be available for requests that have not specified a routing strategy. For now, the "model" and "stream" are still needed here. However, since the routing context also included the RoutingAlgorithm field, it is possible to eliminate these fields later.

klog.ErrorS(nil, "incorrect routing strategy", "routing-strategy", routingStrategy)
return generateErrorResponse(
envoyTypePb.StatusCode_BadRequest,
[]*configPb.HeaderValueOption{{Header: &configPb.HeaderValue{
Key: HeaderErrorInvalidRouting, RawValue: []byte(routingStrategy),
}}}, "incorrect routing strategy"), utils.User{}, rpm, routingStrategy
}}}, "incorrect routing strategy"), utils.User{}, rpm, routingAlgorithm
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In text "incorrect routing algorithm". To a global replace for strategy -> algorithm

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Let's keep strategy as user-faced string variable. I don't plan to change all header staff. And algorithms be valid strategies that verifed.

…urrent safety and stateful routing

Signed-off-by: Jingyuan Zhang <jingyuan.zhang0929@bytedance.com>
@zhangjyr zhangjyr force-pushed the feature/cache_router_refactor branch from f8a1af8 to 4d7347c Compare March 27, 2025 18:33
@Jeffwan
Copy link
Collaborator

Jeffwan commented Mar 27, 2025

@Xunzhuo jingyuan fixed the integration test issues, do you get a chance to help take a look?

@zhangjyr zhangjyr requested a review from varungup90 March 27, 2025 20:23
@Jeffwan
Copy link
Collaborator

Jeffwan commented Mar 28, 2025

@Xunzhuo Just wanted to loop you in—this PR is currently blocking some performance-related changes (prefix-cache aware routing) that community users are waiting for. it's quite large and deserves more attention. We plan to merge it to unblock some following changes this time as an exception.

  • If you have some bandwidth, please help review this change and @zhangjyr will address those comments in future PRs
  • We do want to avoid such large PRs in the future. Big PRs like this are really tough to review properly and can slow everyone down. Contributors should split their changes into smaller, reviewable chunks to make the process smoother for everyone.

@Jeffwan Jeffwan changed the title [Misc][API] Cache and Router refactoring for concurrent performance, concurrent safety and stateful routing. [API] Cache and Router refactoring for concurrent performance, concurrent safety and stateful routing. Mar 28, 2025
@Jeffwan Jeffwan merged commit 622a618 into vllm-project:main Mar 28, 2025
11 checks passed
gangmuk pushed a commit to gangmuk/aibrix-gangmuk that referenced this pull request Jun 21, 2025
…rent safety and stateful routing. (vllm-project#884)

Cache and Router refactoring for concurrent performance, concurrent safety and stateful routing

Signed-off-by: Jingyuan Zhang <jingyuan.zhang0929@bytedance.com>
Co-authored-by: Jingyuan Zhang <jingyuan.zhang0929@bytedance.com>
Yaegaki1Erika pushed a commit to Yaegaki1Erika/aibrix that referenced this pull request Jul 23, 2025
…rent safety and stateful routing. (vllm-project#884)

Cache and Router refactoring for concurrent performance, concurrent safety and stateful routing

Signed-off-by: Jingyuan Zhang <jingyuan.zhang0929@bytedance.com>
Co-authored-by: Jingyuan Zhang <jingyuan.zhang0929@bytedance.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

[RFC]: Cache and Router refactoring for concurrent performance, concurrent safety and stateful routing.
4 participants