Skip to content

[Feature]: Implement Cache-Aware Routing for dstack Services #2453

Open
@silentlustre

Description

@silentlustre

Problem

Hi dstack team and community,

First off, thanks for creating dstack! It's a fantastic tool that has really simplified our inference infrastructure.

We use dstack for our LLM inference workloads.
One significant optimization for LLMs is KV cache reuse (prefix caching), where computation for initial tokens (prefill) can be skipped if the prefix matches a previous request, significantly improving performance.
See:

Currently, dstack's service load balancing likely uses standard strategies (e.g., round-robin).
While great for general load distribution, these policies are cache-agnostic.
A request is often sent to an instance that doesn't have the relevant prefix cached, even if another instance does. This leads to redundant prefill computation, resulting in higher Time To First Token (TTFT) and lower overall throughput.

Proposal: Add Cache-Aware Routing to dstack Services

Inspired by recent work in systems like SGLang, Nvidia Dynamo, and KubeAI, I propose adding support for a cache-aware routing policy to help better meet SLOs for LLM inference.
The goal is to route incoming inference requests to the specific inference instance that is most likely to have the request's prefix already in its KV cache.

Benefits

  • Improved Inference Performance: Significantly lower latency (especially TTFT) and higher throughput, particularly for workloads with shared prefixes (e.g., conversational AI, RAG systems with common instructions/prompts).
  • Better Resource Utilization: Reduces redundant computation across instances, potentially allowing for serving higher loads with the same hardware or reducing costs.

Call for Discussion:
I'm starting work on a private fork to add this capability to dstack and would love to contribute it upstream when ready!

Would this be a useful contribution to the community?
If so, it'd be great to discuss the high-level implementation approach here to align efforts and ensure a smooth contribution process later on.

Thanks for considering my proposal :)

References

Solution

No response

Workaround

No response

Would you like to help us implement this feature by sending a PR?

Yes

Metadata

Metadata

Assignees

No one assigned

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions