Description
Problem
Hi dstack
team and community,
First off, thanks for creating dstack
! It's a fantastic tool that has really simplified our inference infrastructure.
We use dstack
for our LLM inference workloads.
One significant optimization for LLMs is KV cache reuse (prefix caching), where computation for initial tokens (prefill) can be skipped if the prefix matches a previous request, significantly improving performance.
See:
- sglang: https://lmsys.org/blog/2024-01-17-sglang/
- vllm: https://docs.vllm.ai/en/latest/features/automatic_prefix_caching.html
Currently, dstack
's service load balancing likely uses standard strategies (e.g., round-robin).
While great for general load distribution, these policies are cache-agnostic.
A request is often sent to an instance that doesn't have the relevant prefix cached, even if another instance does. This leads to redundant prefill computation, resulting in higher Time To First Token (TTFT) and lower overall throughput.
Proposal: Add Cache-Aware Routing to dstack
Services
Inspired by recent work in systems like SGLang, Nvidia Dynamo, and KubeAI, I propose adding support for a cache-aware routing policy to help better meet SLOs for LLM inference.
The goal is to route incoming inference requests to the specific inference instance that is most likely to have the request's prefix already in its KV cache.
Benefits
- Improved Inference Performance: Significantly lower latency (especially TTFT) and higher throughput, particularly for workloads with shared prefixes (e.g., conversational AI, RAG systems with common instructions/prompts).
- Better Resource Utilization: Reduces redundant computation across instances, potentially allowing for serving higher loads with the same hardware or reducing costs.
Call for Discussion:
I'm starting work on a private fork to add this capability to dstack
and would love to contribute it upstream when ready!
Would this be a useful contribution to the community?
If so, it'd be great to discuss the high-level implementation approach here to align efforts and ensure a smooth contribution process later on.
Thanks for considering my proposal :)
References
- SGLang Router Design: https://docs.google.com/document/d/1cCqK3dh7ZR_rUPkcZT2cr0kLnAxv6_Sd-P1q37-3RNQ/edit?tab=t.0
- KubeAI Load Balancing: https://www.kubeai.org/blog/2025/02/26/llm-load-balancing-at-scale-chwbl/
-
- Nvidia Dynamo KV Cache Routing: https://github.com/ai-dynamo/dynamo/blob/main/docs/kv_cache_routing.md
Solution
No response
Workaround
No response
Would you like to help us implement this feature by sending a PR?
Yes