What problem are you facing?
Prefill is compute-bound, decode is memory-bandwidth-bound. Running them on the same GPU pool means optimizing for neither. Modelplane has no way to express disaggregated serving today.
How could Modelplane help solve your problem?
LLM inference has two phases. Prefill processes all the input tokens in parallel. It's compute-bound and determines time-to-first-token. Decode generates output tokens one at a time. It's memory-bandwidth-bound and determines inter-token latency. When both phases share a GPU pool, a prefill burst interferes with in-flight decodes, causing unpredictable tail latency. You can't independently tune parallelism for each phase.
Disaggregation runs them on separate pools. A prefill instance processes the prompt, then transfers its KV cache (the intermediate state it computed for every input token) to a decode instance over RDMA. This lets you use different parallelism strategies per phase. For example TP=1 prefill replicas for throughput, TP=4 decode replicas for latency. llm-d's benchmarks show 40-50% lower end-to-end latency for workloads with long input sequences (high input:output token ratio). It's not always the right choice. Short prompts and small models don't benefit enough to justify the KV cache transfer overhead. But for large models with long contexts it's a significant win.
KServe (which Modelplane uses as its inference stack) already supports this. An optional prefill section on LLMInferenceService tells the controller to create separate Deployments for prefill and decode. KServe v0.18 ships decode templates that include the llm-d routing sidecar (which coordinates KV cache transfer via NIXL), and supports per-phase autoscaling via WVA.
I think the natural place to express disaggregation in Modelplane is split across two resources: engine configuration on ClusterModel (or Model), and the topology choice plus replica counts on ModelDeployment. The platform team declares that disagg is available for a model on certain hardware. The ML team activates it for their deployment.
The serving profile gains a topology discriminator, a prefill block with its own engine spec and per-pod resources, and optional per-profile resources for the decode side. Each phase's engine spec is complete and independent, with no merging and no shared base args. This follows the existing serving profile philosophy where each profile is a complete, tested configuration.
apiVersion: modelplane.ai/v1alpha1
kind: ClusterModel
metadata:
name: llama-405b
spec:
model:
name: meta-llama/Llama-3.1-405B-Instruct
source: HuggingFace
huggingFace:
repo: meta-llama/Llama-3.1-405B-Instruct
resources:
vram: "810Gi"
serving:
- name: vllm-disagg
topology: PrefillDecode
environmentSelector:
matchLabels:
modelplane.ai/rdma: "true"
engine:
name: vLLM
image: vllm/vllm-openai:v0.9.1
args:
- "--tensor-parallel-size=4"
- "--block-size=128"
- '--kv-transfer-config={"kv_connector":"NixlConnector","kv_role":"kv_both"}'
resources:
gpu: 4
cpu: "16"
memory: "64Gi"
prefill:
engine:
name: vLLM
image: vllm/vllm-openai:v0.9.1
args:
- "--tensor-parallel-size=1"
- "--block-size=128"
- '--kv-transfer-config={"kv_connector":"NixlConnector","kv_role":"kv_both"}'
- "--gpu-memory-utilization=0.9"
resources:
gpu: 1
cpu: "8"
memory: "16Gi"
- name: vllm-unified
engine:
name: vLLM
image: vllm/vllm-openai:v0.9.1
args:
- "--max-model-len=32768"
- "--quantization=fp8"
On the ModelDeployment side, topology is explicit. It appears on both the serving profile and the deployment. The ML team opts in to disagg by setting topology: PrefillDecode and providing prefill.scaling. Both are required. There's no default ratio, because nothing else in the API has cross-resource defaulting and this shouldn't either. An ML team that doesn't set topology: PrefillDecode skips disagg profiles during matching and gets unified serving transparently.
apiVersion: modelplane.ai/v1alpha1
kind: ModelDeployment
metadata:
name: llama-405b-prod
namespace: ml-team-a
spec:
modelRef:
kind: ClusterModel
name: llama-405b
environments: 2
topology: PrefillDecode
scaling:
signal: Fixed
fixed:
replicas: 2
prefill:
scaling:
signal: Fixed
fixed:
replicas: 8
Profile matching gains a topology filter. A PrefillDecode deployment only matches PrefillDecode profiles, and a Unified deployment only matches Unified profiles. If the ML team asks for disagg but no environment has an RDMA-capable disagg profile, the deployment fails visibly rather than silently falling back to unified. I think this is the right behavior. Silent degradation from disagg to unified would be a surprising performance cliff.
This work should include a bump to KServe v0.18. The prefill field exists in v0.16 (what we ship today) but v0.18 is substantially more mature. It ships decode templates with the llm-d routing sidecar built in, so Modelplane doesn't need to inject it. v0.18 also adds WVA-based autoscaling on each WorkloadSpec, which handles the per-phase scaling problem (our current Envoy-metric KEDA approach only sees traffic hitting decode pods and can't scale prefill). I'd make concurrency-based autoscaling mutually exclusive with disagg for now. The current KEDA approach doesn't generalize, and WVA is a different scaling model that deserves its own design work.
ML teams who want different P/D engine configuration (different parallelism, different P:D ratio baked into the profile itself) can create a namespaced Model with their own disagg serving profile. That's the existing break-glass path and it works here without any special handling.
What problem are you facing?
Prefill is compute-bound, decode is memory-bandwidth-bound. Running them on the same GPU pool means optimizing for neither. Modelplane has no way to express disaggregated serving today.
How could Modelplane help solve your problem?
LLM inference has two phases. Prefill processes all the input tokens in parallel. It's compute-bound and determines time-to-first-token. Decode generates output tokens one at a time. It's memory-bandwidth-bound and determines inter-token latency. When both phases share a GPU pool, a prefill burst interferes with in-flight decodes, causing unpredictable tail latency. You can't independently tune parallelism for each phase.
Disaggregation runs them on separate pools. A prefill instance processes the prompt, then transfers its KV cache (the intermediate state it computed for every input token) to a decode instance over RDMA. This lets you use different parallelism strategies per phase. For example TP=1 prefill replicas for throughput, TP=4 decode replicas for latency. llm-d's benchmarks show 40-50% lower end-to-end latency for workloads with long input sequences (high input:output token ratio). It's not always the right choice. Short prompts and small models don't benefit enough to justify the KV cache transfer overhead. But for large models with long contexts it's a significant win.
KServe (which Modelplane uses as its inference stack) already supports this. An optional
prefillsection onLLMInferenceServicetells the controller to create separate Deployments for prefill and decode. KServe v0.18 ships decode templates that include the llm-d routing sidecar (which coordinates KV cache transfer via NIXL), and supports per-phase autoscaling via WVA.I think the natural place to express disaggregation in Modelplane is split across two resources: engine configuration on ClusterModel (or Model), and the topology choice plus replica counts on ModelDeployment. The platform team declares that disagg is available for a model on certain hardware. The ML team activates it for their deployment.
The serving profile gains a
topologydiscriminator, aprefillblock with its own engine spec and per-pod resources, and optional per-profileresourcesfor the decode side. Each phase's engine spec is complete and independent, with no merging and no shared base args. This follows the existing serving profile philosophy where each profile is a complete, tested configuration.On the ModelDeployment side,
topologyis explicit. It appears on both the serving profile and the deployment. The ML team opts in to disagg by settingtopology: PrefillDecodeand providingprefill.scaling. Both are required. There's no default ratio, because nothing else in the API has cross-resource defaulting and this shouldn't either. An ML team that doesn't settopology: PrefillDecodeskips disagg profiles during matching and gets unified serving transparently.Profile matching gains a topology filter. A
PrefillDecodedeployment only matchesPrefillDecodeprofiles, and aUnifieddeployment only matchesUnifiedprofiles. If the ML team asks for disagg but no environment has an RDMA-capable disagg profile, the deployment fails visibly rather than silently falling back to unified. I think this is the right behavior. Silent degradation from disagg to unified would be a surprising performance cliff.This work should include a bump to KServe v0.18. The
prefillfield exists in v0.16 (what we ship today) but v0.18 is substantially more mature. It ships decode templates with the llm-d routing sidecar built in, so Modelplane doesn't need to inject it. v0.18 also adds WVA-based autoscaling on eachWorkloadSpec, which handles the per-phase scaling problem (our current Envoy-metric KEDA approach only sees traffic hitting decode pods and can't scale prefill). I'd make concurrency-based autoscaling mutually exclusive with disagg for now. The current KEDA approach doesn't generalize, and WVA is a different scaling model that deserves its own design work.ML teams who want different P/D engine configuration (different parallelism, different P:D ratio baked into the profile itself) can create a namespaced
Modelwith their own disagg serving profile. That's the existing break-glass path and it works here without any special handling.