Support prefill/decode disaggregation

### What problem are you facing?

Prefill is compute-bound, decode is memory-bandwidth-bound. Running them on the same GPU pool means optimizing for neither. Modelplane has no way to express disaggregated serving today.

### How could Modelplane help solve your problem?

LLM inference has two phases. Prefill processes all the input tokens in parallel. It's compute-bound and determines time-to-first-token. Decode generates output tokens one at a time. It's memory-bandwidth-bound and determines inter-token latency. When both phases share a GPU pool, a prefill burst interferes with in-flight decodes, causing unpredictable tail latency. You can't independently tune parallelism for each phase.

Disaggregation runs them on separate pools. A prefill instance processes the prompt, then transfers its KV cache (the intermediate state it computed for every input token) to a decode instance over RDMA. This lets you use different parallelism strategies per phase. For example TP=1 prefill replicas for throughput, TP=4 decode replicas for latency. llm-d's benchmarks show 40-50% lower end-to-end latency for workloads with long input sequences (high input:output token ratio). It's not always the right choice. Short prompts and small models don't benefit enough to justify the KV cache transfer overhead. But for large models with long contexts it's a significant win.

KServe (which Modelplane uses as its inference stack) already supports this. An optional `prefill` section on `LLMInferenceService` tells the controller to create separate Deployments for prefill and decode. KServe v0.18 ships decode templates that include the llm-d routing sidecar (which coordinates KV cache transfer via NIXL), and supports per-phase autoscaling via WVA.

I think the natural place to express disaggregation in Modelplane is split across two resources: engine configuration on ClusterModel (or Model), and the topology choice plus replica counts on ModelDeployment. The platform team declares that disagg is *available* for a model on certain hardware. The ML team *activates* it for their deployment.

The serving profile gains a `topology` discriminator, a `prefill` block with its own engine spec and per-pod resources, and optional per-profile `resources` for the decode side. Each phase's engine spec is complete and independent, with no merging and no shared base args. This follows the existing serving profile philosophy where each profile is a complete, tested configuration.

```yaml
apiVersion: modelplane.ai/v1alpha1
kind: ClusterModel
metadata:
  name: llama-405b
spec:
  model:
    name: meta-llama/Llama-3.1-405B-Instruct
  source: HuggingFace
  huggingFace:
    repo: meta-llama/Llama-3.1-405B-Instruct
  resources:
    vram: "810Gi"

  serving:
  - name: vllm-disagg
    topology: PrefillDecode
    environmentSelector:
      matchLabels:
        modelplane.ai/rdma: "true"
    engine:
      name: vLLM
      image: vllm/vllm-openai:v0.9.1
      args:
      - "--tensor-parallel-size=4"
      - "--block-size=128"
      - '--kv-transfer-config={"kv_connector":"NixlConnector","kv_role":"kv_both"}'
    resources:
      gpu: 4
      cpu: "16"
      memory: "64Gi"
    prefill:
      engine:
        name: vLLM
        image: vllm/vllm-openai:v0.9.1
        args:
        - "--tensor-parallel-size=1"
        - "--block-size=128"
        - '--kv-transfer-config={"kv_connector":"NixlConnector","kv_role":"kv_both"}'
        - "--gpu-memory-utilization=0.9"
      resources:
        gpu: 1
        cpu: "8"
        memory: "16Gi"

  - name: vllm-unified
    engine:
      name: vLLM
      image: vllm/vllm-openai:v0.9.1
      args:
      - "--max-model-len=32768"
      - "--quantization=fp8"
```

On the ModelDeployment side, `topology` is explicit. It appears on both the serving profile and the deployment. The ML team opts in to disagg by setting `topology: PrefillDecode` and providing `prefill.scaling`. Both are required. There's no default ratio, because nothing else in the API has cross-resource defaulting and this shouldn't either. An ML team that doesn't set `topology: PrefillDecode` skips disagg profiles during matching and gets unified serving transparently.

```yaml
apiVersion: modelplane.ai/v1alpha1
kind: ModelDeployment
metadata:
  name: llama-405b-prod
  namespace: ml-team-a
spec:
  modelRef:
    kind: ClusterModel
    name: llama-405b
  environments: 2
  topology: PrefillDecode
  scaling:
    signal: Fixed
    fixed:
      replicas: 2
  prefill:
    scaling:
      signal: Fixed
      fixed:
        replicas: 8
```

Profile matching gains a topology filter. A `PrefillDecode` deployment only matches `PrefillDecode` profiles, and a `Unified` deployment only matches `Unified` profiles. If the ML team asks for disagg but no environment has an RDMA-capable disagg profile, the deployment fails visibly rather than silently falling back to unified. I think this is the right behavior. Silent degradation from disagg to unified would be a surprising performance cliff.

This work should include a bump to KServe v0.18. The `prefill` field exists in v0.16 (what we ship today) but v0.18 is substantially more mature. It ships decode templates with the llm-d routing sidecar built in, so Modelplane doesn't need to inject it. v0.18 also adds WVA-based autoscaling on each `WorkloadSpec`, which handles the per-phase scaling problem (our current Envoy-metric KEDA approach only sees traffic hitting decode pods and can't scale prefill). I'd make concurrency-based autoscaling mutually exclusive with disagg for now. The current KEDA approach doesn't generalize, and WVA is a different scaling model that deserves its own design work.

ML teams who want different P/D engine configuration (different parallelism, different P:D ratio baked into the profile itself) can create a namespaced `Model` with their own disagg serving profile. That's the existing break-glass path and it works here without any special handling.


Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Support prefill/decode disaggregation #34

What problem are you facing?

How could Modelplane help solve your problem?

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

Uh oh!

Support prefill/decode disaggregation #34

Description

What problem are you facing?

How could Modelplane help solve your problem?

Metadata

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

Issue actions