Fleet signal bus: capture layer for cross-cluster signals (v0.1)

Multiple primitives need the same thing: typed signals captured at the workload plane, aggregated at the control plane, consumed by either a scheduler primitive or a user-facing surface. Each will rebuild the same plumbing if we don't define it once.

This is the v0.1 substrate. [#77](https://github.com/modelplaneai/modelplane/issues/77) is the parallel v0.2 primitive for app-observability traces — both share OTLP as the transport (one workload-plane agent emits, one control-plane collector fans out to two different consumers).

## Why fleet-level

Per-cluster engines (vLLM, SGLang, Dynamo) and per-cluster gateways already emit most of this data. **Modelplane's job is the rollup and cross-cluster comparison, not the emission.** A single cluster's P99 TTFT is a deployment concern; "P99 TTFT in eu-west-1 is 3× us-east-1, the cluster is degrading" is a fleet concern. Same for capacity, cost, prefix overlap, failure rates.

## Consumers and emitters on the bus

| Primitive | Emits | Consumes |
|---|---|---|
| [#70 capacity signal](https://github.com/modelplaneai/modelplane/issues/70) | (none — pure consumer) | Per-cluster GPU availability for the federation matcher |
| [#71 routing affinity](https://github.com/modelplaneai/modelplane/issues/71) | (none — pure consumer) | Cross-cluster prefix-hash overlap so the gateway picks the warm cluster |
| [#48 overflow](https://github.com/modelplaneai/modelplane/issues/48) | (none — pure consumer) | Aggregate queue depth + cost to decide when to spill |
| [#66 ModelCache](https://github.com/modelplaneai/modelplane/issues/66) | Hydration latency, bytes staged, per-cluster ready state | (none in v0.1) |
| [#72 KVOffloadTier](https://github.com/modelplaneai/modelplane/issues/72) | Per-tier hit rate, eviction rate, capacity util | (none in v0.1) |
| **Future** | Spec-dec acceptance, quality drift, cost samples, model-recall | Cost-aware placement, drift detector, intent-based serving SLAs (`ttft.p99` on `ModelService`) |

## Transport: OTLP

OTel + OTLP is the transport. Industry has converged (vLLM, Triton, LiteLLM, llm-d, every app-observability stack speaks OTLP). One workload-plane agent emits via OTLP; the control-plane collector fans out to two different consumers — this issue (sketches for operator metrics) and [#77](https://github.com/modelplaneai/modelplane/issues/77) (raw spans to the user's tracing backend).

## Sketch

No new CRDs. Extend the existing `InferenceCluster` and `ModelService` shapes; the composition function renders the OTel agent + scrape configs from declarative intent.

Per-cluster signal policy lives on the cluster (the thing that actually emits):

```yaml
apiVersion: modelplane.ai/v1alpha1
kind: InferenceCluster
spec:
  # ... existing kubeconfig, pool→class mapping, etc ...
  signals:
    kinds:
      - { name: RequestLifecycle, sampleRate: 1.0 }    # TTFT, TPS, queue depth, response codes, prefix hit rate
      - { name: PrefixHash, sampleRate: 0.01 }         # cross-cluster overlap + top-K samples
      - { name: Capacity }                             # GPU availability, KV tier util, ModelCache bytes staged
      - { name: ColdStart }                            # GPU procurement, image load, model load, engine startup, cache hydration
      - { name: Cost }                                 # $/GPU-hour × usage → derived $/token
      - { name: Reliability }                          # failure events per (cluster, SKU, root-cause)
    privacy:
      scope: PerTenant                                 # | Fleet (cross-tenant rollups require opt-in)
    retention:
      sketches: 30d
```

Per-service overrides live under `observability` (parallel to [#77](https://github.com/modelplaneai/modelplane/issues/77)'s `observability.traces`):

```yaml
apiVersion: modelplane.ai/v1alpha1
kind: ModelService
spec:
  # ... existing routes, endpoints, etc ...
  observability:
    metrics:
      scope: Fleet                                     # opt-in to cross-tenant aggregation
      sampleRate: 0.1                                  # override cluster default
      extraSignals: [QualityDrift, SpecDecAcceptance]
    traces:                                            # filed in #77, shown here for shape parallel
      enabled: true
      endpoint: https://cloud.langfuse.com/api/public/otel
```

Status surfaces what's flowing — `InferenceCluster.status.signals` for operational health (last flush, dropped events, enabled kinds), `ModelService.status.observability.metrics` for aggregated metrics the user actually consumes (prefix coverage, fleet hit rate, per-cluster breakdowns).

## Implementation architecture

```
Workload plane                                Control plane
─────────────────────                         ──────────────────────────
[engine pods]                                                            
[gateway pods]   ──OTLP──>  [OTel collector]                             
[ModelCache CRs] ──OTLP──>      (single agent     ──>  [aggregator]
[KVOffloadTier]  ──OTLP──>       per cluster,     ──>     (sketches:
[HotPrefixPool]  ──OTLP──>       configured by             Count-Min,
                                 InferenceCluster.         HLL++,
                                 spec.signals)             t-digest;
                                                           rolling 30d)
                                                  │
                                                  └──>  [tracing fanout, #77]
                                                          (raw spans →
                                                           user's Langfuse/
                                                           Langsmith/etc.)
```

One workload-plane OTel agent per `InferenceCluster`. Engines/gateway/cache controllers emit via OTLP to the agent; agent forwards to the control-plane collector; collector fans out to (a) the sketch-based aggregator that exposes signals to other primitives via API, and (b) the tracing fanout (#77) that forwards raw spans to the user's configured backend.

## Initial signal types in scope for v0.1

The minimal set that unblocks the existing consumers:

1. **`RequestLifecycle`** — TTFT, TPS, end-to-end latency P50/P90/P99, response codes, queue depth, prefix-cache hit rate, prefix-hash samples. Kiely §7.4.3 + §5.3.3.
2. **`Capacity`** — Per-cluster GPU availability by SKU, KV cache utilization per tier (HBM / CPU / SSD / network), replica counts (active + starting), ModelCache bytes staged per cluster. Covers [#70](https://github.com/modelplaneai/modelplane/issues/70); cache-family capacity emission lands here.
3. **`ColdStart`** — 5-phase breakdown: GPU procurement, image loading, model loading, engine startup, ModelCache hydration. Kiely §7.2.2.
4. **`Cost`** — $/GPU-hour per cluster × GPU-hours consumed → derived $/token per service. Kiely §7.4.2.
5. **`Reliability`** — Failure events per `(cluster, SKU, root-cause)`. Llama 3 paper (Kiely §7.3.3) reports 1 failure per 50K GPU-hrs; a single inference node hits 70K GPU-hrs/year.

## Additive signal types (later, same bus)

File as separate issues when consumers materialize:

- Spec-dec acceptance rate (Kiely §5.2)
- Quality / drift samples (pairs with future ProvenanceLedger / ModelRecall)
- Disagg-specific (prefill queue, decode KV exhaustion, xPyD efficiency) — Kiely §5.5
- Per-tenant rollups (TPM/RPM, $/tenant, SLO compliance)

## Privacy defaults

- Per-tenant aggregation by default; cross-tenant rollups require explicit opt-in via `observability.metrics.scope: Fleet`
- Sketches not raw events at this bus — no raw prompts or PII leave the cluster (raw prompt/response data is [#77](https://github.com/modelplaneai/modelplane/issues/77)'s territory, gated by the user's tracing backend choice)
- Optional sampled-trace export to the tenant's own object store for offline analysis
- Hardware-class namespacing where relevant (KV signals keyed by GPU family)

## Companion: fleet metrics exposure (separate issue)

User-facing surface (CLI, status fields, Prometheus federation, Grafana dashboards) is its own issue when we're ready to commit to UX. Different shape from the optimization primitives that consume the same signals internally. Track separately so v0.1 ships the substrate without committing to the full product surface.

## Out of scope for v0.1

- User-facing analytics CLI / dashboards (companion issue, v0.2)
- Raw per-request spans for app observability (that's [#77](https://github.com/modelplaneai/modelplane/issues/77), v0.2 on the same OTLP substrate)
- The optimizer primitives that consume the signals (HotPrefixPool auto-discovery, cost-aware placement, drift detector — all v0.2+ on this bus)

## Related issues

- [#77](https://github.com/modelplaneai/modelplane/issues/77) — v0.2 parallel primitive; OTel traces from the gateway to user's tracing backend, sharing OTLP transport
- [#70](https://github.com/modelplaneai/modelplane/issues/70), [#71](https://github.com/modelplaneai/modelplane/issues/71), [#48](https://github.com/modelplaneai/modelplane/issues/48) — pure consumers of bus signals
- [#66 ModelCache](https://github.com/modelplaneai/modelplane/issues/66), [#72 KVOffloadTier](https://github.com/modelplaneai/modelplane/issues/72)
- [PR #64](https://github.com/modelplaneai/modelplane/pull/64) — design doc; "aggregate gateway metrics" implicit in the KEDA autoscaling story, formalized here


Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Fleet signal bus: capture layer for cross-cluster signals (v0.1) #74

Why fleet-level

Consumers and emitters on the bus

Transport: OTLP

Sketch

Implementation architecture

Initial signal types in scope for v0.1

Additive signal types (later, same bus)

Privacy defaults

Companion: fleet metrics exposure (separate issue)

Out of scope for v0.1

Related issues

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

Primitive	Emits	Consumes
#70 capacity signal	(none — pure consumer)	Per-cluster GPU availability for the federation matcher
#71 routing affinity	(none — pure consumer)	Cross-cluster prefix-hash overlap so the gateway picks the warm cluster
#48 overflow	(none — pure consumer)	Aggregate queue depth + cost to decide when to spill
#66 ModelCache	Hydration latency, bytes staged, per-cluster ready state	(none in v0.1)
#72 KVOffloadTier	Per-tier hit rate, eviction rate, capacity util	(none in v0.1)
Future	Spec-dec acceptance, quality drift, cost samples, model-recall	Cost-aware placement, drift detector, intent-based serving SLAs (`ttft.p99` on `ModelService`)

Uh oh!

Fleet signal bus: capture layer for cross-cluster signals (v0.1) #74

Description

Why fleet-level

Consumers and emitters on the bus

Transport: OTLP

Sketch

Implementation architecture

Initial signal types in scope for v0.1

Additive signal types (later, same bus)

Privacy defaults

Companion: fleet metrics exposure (separate issue)

Out of scope for v0.1

Related issues

Metadata

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

Issue actions