Multiple primitives need the same thing: typed signals captured at the workload plane, aggregated at the control plane, consumed by either a scheduler primitive or a user-facing surface. Each will rebuild the same plumbing if we don't define it once.
This is the v0.1 substrate. #77 is the parallel v0.2 primitive for app-observability traces — both share OTLP as the transport (one workload-plane agent emits, one control-plane collector fans out to two different consumers).
Why fleet-level
Per-cluster engines (vLLM, SGLang, Dynamo) and per-cluster gateways already emit most of this data. Modelplane's job is the rollup and cross-cluster comparison, not the emission. A single cluster's P99 TTFT is a deployment concern; "P99 TTFT in eu-west-1 is 3× us-east-1, the cluster is degrading" is a fleet concern. Same for capacity, cost, prefix overlap, failure rates.
Consumers and emitters on the bus
| Primitive |
Emits |
Consumes |
| #70 capacity signal |
(none — pure consumer) |
Per-cluster GPU availability for the federation matcher |
| #71 routing affinity |
(none — pure consumer) |
Cross-cluster prefix-hash overlap so the gateway picks the warm cluster |
| #48 overflow |
(none — pure consumer) |
Aggregate queue depth + cost to decide when to spill |
| #66 ModelCache |
Hydration latency, bytes staged, per-cluster ready state |
(none in v0.1) |
| #72 KVOffloadTier |
Per-tier hit rate, eviction rate, capacity util |
(none in v0.1) |
| Future |
Spec-dec acceptance, quality drift, cost samples, model-recall |
Cost-aware placement, drift detector, intent-based serving SLAs (ttft.p99 on ModelService) |
Transport: OTLP
OTel + OTLP is the transport. Industry has converged (vLLM, Triton, LiteLLM, llm-d, every app-observability stack speaks OTLP). One workload-plane agent emits via OTLP; the control-plane collector fans out to two different consumers — this issue (sketches for operator metrics) and #77 (raw spans to the user's tracing backend).
Sketch
No new CRDs. Extend the existing InferenceCluster and ModelService shapes; the composition function renders the OTel agent + scrape configs from declarative intent.
Per-cluster signal policy lives on the cluster (the thing that actually emits):
apiVersion: modelplane.ai/v1alpha1
kind: InferenceCluster
spec:
# ... existing kubeconfig, pool→class mapping, etc ...
signals:
kinds:
- { name: RequestLifecycle, sampleRate: 1.0 } # TTFT, TPS, queue depth, response codes, prefix hit rate
- { name: PrefixHash, sampleRate: 0.01 } # cross-cluster overlap + top-K samples
- { name: Capacity } # GPU availability, KV tier util, ModelCache bytes staged
- { name: ColdStart } # GPU procurement, image load, model load, engine startup, cache hydration
- { name: Cost } # $/GPU-hour × usage → derived $/token
- { name: Reliability } # failure events per (cluster, SKU, root-cause)
privacy:
scope: PerTenant # | Fleet (cross-tenant rollups require opt-in)
retention:
sketches: 30d
Per-service overrides live under observability (parallel to #77's observability.traces):
apiVersion: modelplane.ai/v1alpha1
kind: ModelService
spec:
# ... existing routes, endpoints, etc ...
observability:
metrics:
scope: Fleet # opt-in to cross-tenant aggregation
sampleRate: 0.1 # override cluster default
extraSignals: [QualityDrift, SpecDecAcceptance]
traces: # filed in #77, shown here for shape parallel
enabled: true
endpoint: https://cloud.langfuse.com/api/public/otel
Status surfaces what's flowing — InferenceCluster.status.signals for operational health (last flush, dropped events, enabled kinds), ModelService.status.observability.metrics for aggregated metrics the user actually consumes (prefix coverage, fleet hit rate, per-cluster breakdowns).
Implementation architecture
Workload plane Control plane
───────────────────── ──────────────────────────
[engine pods]
[gateway pods] ──OTLP──> [OTel collector]
[ModelCache CRs] ──OTLP──> (single agent ──> [aggregator]
[KVOffloadTier] ──OTLP──> per cluster, ──> (sketches:
[HotPrefixPool] ──OTLP──> configured by Count-Min,
InferenceCluster. HLL++,
spec.signals) t-digest;
rolling 30d)
│
└──> [tracing fanout, #77]
(raw spans →
user's Langfuse/
Langsmith/etc.)
One workload-plane OTel agent per InferenceCluster. Engines/gateway/cache controllers emit via OTLP to the agent; agent forwards to the control-plane collector; collector fans out to (a) the sketch-based aggregator that exposes signals to other primitives via API, and (b) the tracing fanout (#77) that forwards raw spans to the user's configured backend.
Initial signal types in scope for v0.1
The minimal set that unblocks the existing consumers:
-
RequestLifecycle — TTFT, TPS, end-to-end latency P50/P90/P99, response codes, queue depth, prefix-cache hit rate, prefix-hash samples. Kiely §7.4.3 + §5.3.3.
-
Capacity — Per-cluster GPU availability by SKU, KV cache utilization per tier (HBM / CPU / SSD / network), replica counts (active + starting), ModelCache bytes staged per cluster. Covers #70; cache-family capacity emission lands here.
-
ColdStart — 5-phase breakdown: GPU procurement, image loading, model loading, engine startup, ModelCache hydration. Kiely §7.2.2.
-
Cost — $/GPU-hour per cluster × GPU-hours consumed → derived $/token per service. Kiely §7.4.2.
-
Reliability — Failure events per (cluster, SKU, root-cause). Llama 3 paper (Kiely §7.3.3) reports 1 failure per 50K GPU-hrs; a single inference node hits 70K GPU-hrs/year.
Additive signal types (later, same bus)
File as separate issues when consumers materialize:
- Spec-dec acceptance rate (Kiely §5.2)
- Quality / drift samples (pairs with future ProvenanceLedger / ModelRecall)
- Disagg-specific (prefill queue, decode KV exhaustion, xPyD efficiency) — Kiely §5.5
- Per-tenant rollups (TPM/RPM, $/tenant, SLO compliance)
Privacy defaults
- Per-tenant aggregation by default; cross-tenant rollups require explicit opt-in via
observability.metrics.scope: Fleet
- Sketches not raw events at this bus — no raw prompts or PII leave the cluster (raw prompt/response data is #77's territory, gated by the user's tracing backend choice)
- Optional sampled-trace export to the tenant's own object store for offline analysis
- Hardware-class namespacing where relevant (KV signals keyed by GPU family)
Companion: fleet metrics exposure (separate issue)
User-facing surface (CLI, status fields, Prometheus federation, Grafana dashboards) is its own issue when we're ready to commit to UX. Different shape from the optimization primitives that consume the same signals internally. Track separately so v0.1 ships the substrate without committing to the full product surface.
Out of scope for v0.1
- User-facing analytics CLI / dashboards (companion issue, v0.2)
- Raw per-request spans for app observability (that's #77, v0.2 on the same OTLP substrate)
- The optimizer primitives that consume the signals (HotPrefixPool auto-discovery, cost-aware placement, drift detector — all v0.2+ on this bus)
Related issues
- #77 — v0.2 parallel primitive; OTel traces from the gateway to user's tracing backend, sharing OTLP transport
- #70, #71, #48 — pure consumers of bus signals
- #66 ModelCache, #72 KVOffloadTier
- PR #64 — design doc; "aggregate gateway metrics" implicit in the KEDA autoscaling story, formalized here
Multiple primitives need the same thing: typed signals captured at the workload plane, aggregated at the control plane, consumed by either a scheduler primitive or a user-facing surface. Each will rebuild the same plumbing if we don't define it once.
This is the v0.1 substrate. #77 is the parallel v0.2 primitive for app-observability traces — both share OTLP as the transport (one workload-plane agent emits, one control-plane collector fans out to two different consumers).
Why fleet-level
Per-cluster engines (vLLM, SGLang, Dynamo) and per-cluster gateways already emit most of this data. Modelplane's job is the rollup and cross-cluster comparison, not the emission. A single cluster's P99 TTFT is a deployment concern; "P99 TTFT in eu-west-1 is 3× us-east-1, the cluster is degrading" is a fleet concern. Same for capacity, cost, prefix overlap, failure rates.
Consumers and emitters on the bus
ttft.p99onModelService)Transport: OTLP
OTel + OTLP is the transport. Industry has converged (vLLM, Triton, LiteLLM, llm-d, every app-observability stack speaks OTLP). One workload-plane agent emits via OTLP; the control-plane collector fans out to two different consumers — this issue (sketches for operator metrics) and #77 (raw spans to the user's tracing backend).
Sketch
No new CRDs. Extend the existing
InferenceClusterandModelServiceshapes; the composition function renders the OTel agent + scrape configs from declarative intent.Per-cluster signal policy lives on the cluster (the thing that actually emits):
Per-service overrides live under
observability(parallel to #77'sobservability.traces):Status surfaces what's flowing —
InferenceCluster.status.signalsfor operational health (last flush, dropped events, enabled kinds),ModelService.status.observability.metricsfor aggregated metrics the user actually consumes (prefix coverage, fleet hit rate, per-cluster breakdowns).Implementation architecture
One workload-plane OTel agent per
InferenceCluster. Engines/gateway/cache controllers emit via OTLP to the agent; agent forwards to the control-plane collector; collector fans out to (a) the sketch-based aggregator that exposes signals to other primitives via API, and (b) the tracing fanout (#77) that forwards raw spans to the user's configured backend.Initial signal types in scope for v0.1
The minimal set that unblocks the existing consumers:
RequestLifecycle— TTFT, TPS, end-to-end latency P50/P90/P99, response codes, queue depth, prefix-cache hit rate, prefix-hash samples. Kiely §7.4.3 + §5.3.3.Capacity— Per-cluster GPU availability by SKU, KV cache utilization per tier (HBM / CPU / SSD / network), replica counts (active + starting), ModelCache bytes staged per cluster. Covers #70; cache-family capacity emission lands here.ColdStart— 5-phase breakdown: GPU procurement, image loading, model loading, engine startup, ModelCache hydration. Kiely §7.2.2.Cost—Reliability— Failure events per(cluster, SKU, root-cause). Llama 3 paper (Kiely §7.3.3) reports 1 failure per 50K GPU-hrs; a single inference node hits 70K GPU-hrs/year.Additive signal types (later, same bus)
File as separate issues when consumers materialize:
Privacy defaults
observability.metrics.scope: FleetCompanion: fleet metrics exposure (separate issue)
User-facing surface (CLI, status fields, Prometheus federation, Grafana dashboards) is its own issue when we're ready to commit to UX. Different shape from the optimization primitives that consume the same signals internally. Track separately so v0.1 ships the substrate without committing to the full product surface.
Out of scope for v0.1
Related issues