[Feature]: Implement Cache-Aware Routing for `dstack` Services

### Problem

Hi `dstack` team and community,

First off, thanks for creating `dstack`! It's a fantastic tool that has really simplified our inference infrastructure.

We use `dstack` for our LLM inference workloads.
One significant optimization for LLMs is **KV cache reuse (prefix caching)**, where computation for initial tokens (prefill) can be skipped if the prefix matches a previous request, significantly improving performance.
See:
*   sglang: https://lmsys.org/blog/2024-01-17-sglang/
*   vllm: https://docs.vllm.ai/en/latest/features/automatic_prefix_caching.html

Currently, `dstack`'s service load balancing likely uses standard strategies (e.g., round-robin).
While great for general load distribution, these policies are **cache-agnostic**.
A request is often sent to an instance that *doesn't* have the relevant prefix cached, even if another instance *does*. This leads to redundant prefill computation, resulting in higher **Time To First Token (TTFT)** and lower overall throughput.

**Proposal: Add Cache-Aware Routing to `dstack` Services**

Inspired by recent work in systems like SGLang, Nvidia Dynamo, and KubeAI, I propose adding support for a **cache-aware routing policy** to help better meet SLOs for LLM inference.
The goal is to route incoming inference requests to the specific *inference instance* that is most likely to have the request's prefix already in its KV cache.

**Benefits**
*   **Improved Inference Performance:** Significantly lower latency (especially TTFT) and higher throughput, particularly for workloads with shared prefixes (e.g., conversational AI, RAG systems with common instructions/prompts).
*   **Better Resource Utilization:** Reduces redundant computation across instances, potentially allowing for serving higher loads with the same hardware or reducing costs.

**Call for Discussion:**
I'm starting work on a private fork to add this capability to `dstack` and would love to contribute it upstream when ready!

Would this be a useful contribution to the community?
If so, it'd be great to discuss the high-level implementation approach here to align efforts and ensure a smooth contribution process later on.

Thanks for considering my proposal :)

### References
*   SGLang Router Design: https://docs.google.com/document/d/1cCqK3dh7ZR_rUPkcZT2cr0kLnAxv6_Sd-P1q37-3RNQ/edit?tab=t.0
*   KubeAI Load Balancing: https://www.kubeai.org/blog/2025/02/26/llm-load-balancing-at-scale-chwbl/
* *
*   Nvidia Dynamo KV Cache Routing: https://github.com/ai-dynamo/dynamo/blob/main/docs/kv_cache_routing.md

### Solution

_No response_

### Workaround

_No response_

### Would you like to help us implement this feature by sending a PR?

Yes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

[Feature]: Implement Cache-Aware Routing for `dstack` Services #2453

Problem

References

Solution

Workaround

Would you like to help us implement this feature by sending a PR?

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

[Feature]: Implement Cache-Aware Routing for dstack Services #2453

Description

Problem

References

Solution

Workaround

Would you like to help us implement this feature by sending a PR?

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions

[Feature]: Implement Cache-Aware Routing for `dstack` Services #2453