Skip to content

Fleet signal bus: capture layer for cross-cluster signals (v0.1) #74

Description

@dennis-upbound

Multiple primitives need the same thing: typed signals captured at the workload plane, aggregated at the control plane, consumed by either a scheduler primitive or a user-facing surface. Each will rebuild the same plumbing if we don't define it once.

This is the v0.1 substrate. #77 is the parallel v0.2 primitive for app-observability traces — both share OTLP as the transport (one workload-plane agent emits, one control-plane collector fans out to two different consumers).

Why fleet-level

Per-cluster engines (vLLM, SGLang, Dynamo) and per-cluster gateways already emit most of this data. Modelplane's job is the rollup and cross-cluster comparison, not the emission. A single cluster's P99 TTFT is a deployment concern; "P99 TTFT in eu-west-1 is 3× us-east-1, the cluster is degrading" is a fleet concern. Same for capacity, cost, prefix overlap, failure rates.

Consumers and emitters on the bus

Primitive Emits Consumes
#70 capacity signal (none — pure consumer) Per-cluster GPU availability for the federation matcher
#71 routing affinity (none — pure consumer) Cross-cluster prefix-hash overlap so the gateway picks the warm cluster
#48 overflow (none — pure consumer) Aggregate queue depth + cost to decide when to spill
#66 ModelCache Hydration latency, bytes staged, per-cluster ready state (none in v0.1)
#72 KVOffloadTier Per-tier hit rate, eviction rate, capacity util (none in v0.1)
Future Spec-dec acceptance, quality drift, cost samples, model-recall Cost-aware placement, drift detector, intent-based serving SLAs (ttft.p99 on ModelService)

Transport: OTLP

OTel + OTLP is the transport. Industry has converged (vLLM, Triton, LiteLLM, llm-d, every app-observability stack speaks OTLP). One workload-plane agent emits via OTLP; the control-plane collector fans out to two different consumers — this issue (sketches for operator metrics) and #77 (raw spans to the user's tracing backend).

Sketch

No new CRDs. Extend the existing InferenceCluster and ModelService shapes; the composition function renders the OTel agent + scrape configs from declarative intent.

Per-cluster signal policy lives on the cluster (the thing that actually emits):

apiVersion: modelplane.ai/v1alpha1
kind: InferenceCluster
spec:
  # ... existing kubeconfig, pool→class mapping, etc ...
  signals:
    kinds:
      - { name: RequestLifecycle, sampleRate: 1.0 }    # TTFT, TPS, queue depth, response codes, prefix hit rate
      - { name: PrefixHash, sampleRate: 0.01 }         # cross-cluster overlap + top-K samples
      - { name: Capacity }                             # GPU availability, KV tier util, ModelCache bytes staged
      - { name: ColdStart }                            # GPU procurement, image load, model load, engine startup, cache hydration
      - { name: Cost }                                 # $/GPU-hour × usage → derived $/token
      - { name: Reliability }                          # failure events per (cluster, SKU, root-cause)
    privacy:
      scope: PerTenant                                 # | Fleet (cross-tenant rollups require opt-in)
    retention:
      sketches: 30d

Per-service overrides live under observability (parallel to #77's observability.traces):

apiVersion: modelplane.ai/v1alpha1
kind: ModelService
spec:
  # ... existing routes, endpoints, etc ...
  observability:
    metrics:
      scope: Fleet                                     # opt-in to cross-tenant aggregation
      sampleRate: 0.1                                  # override cluster default
      extraSignals: [QualityDrift, SpecDecAcceptance]
    traces:                                            # filed in #77, shown here for shape parallel
      enabled: true
      endpoint: https://cloud.langfuse.com/api/public/otel

Status surfaces what's flowing — InferenceCluster.status.signals for operational health (last flush, dropped events, enabled kinds), ModelService.status.observability.metrics for aggregated metrics the user actually consumes (prefix coverage, fleet hit rate, per-cluster breakdowns).

Implementation architecture

Workload plane                                Control plane
─────────────────────                         ──────────────────────────
[engine pods]                                                            
[gateway pods]   ──OTLP──>  [OTel collector]                             
[ModelCache CRs] ──OTLP──>      (single agent     ──>  [aggregator]
[KVOffloadTier]  ──OTLP──>       per cluster,     ──>     (sketches:
[HotPrefixPool]  ──OTLP──>       configured by             Count-Min,
                                 InferenceCluster.         HLL++,
                                 spec.signals)             t-digest;
                                                           rolling 30d)
                                                  │
                                                  └──>  [tracing fanout, #77]
                                                          (raw spans →
                                                           user's Langfuse/
                                                           Langsmith/etc.)

One workload-plane OTel agent per InferenceCluster. Engines/gateway/cache controllers emit via OTLP to the agent; agent forwards to the control-plane collector; collector fans out to (a) the sketch-based aggregator that exposes signals to other primitives via API, and (b) the tracing fanout (#77) that forwards raw spans to the user's configured backend.

Initial signal types in scope for v0.1

The minimal set that unblocks the existing consumers:

  1. RequestLifecycle — TTFT, TPS, end-to-end latency P50/P90/P99, response codes, queue depth, prefix-cache hit rate, prefix-hash samples. Kiely §7.4.3 + §5.3.3.
  2. Capacity — Per-cluster GPU availability by SKU, KV cache utilization per tier (HBM / CPU / SSD / network), replica counts (active + starting), ModelCache bytes staged per cluster. Covers #70; cache-family capacity emission lands here.
  3. ColdStart — 5-phase breakdown: GPU procurement, image loading, model loading, engine startup, ModelCache hydration. Kiely §7.2.2.
  4. Cost$/GPU-hour per cluster × GPU-hours consumed → derived $/token per service. Kiely §7.4.2.
  5. Reliability — Failure events per (cluster, SKU, root-cause). Llama 3 paper (Kiely §7.3.3) reports 1 failure per 50K GPU-hrs; a single inference node hits 70K GPU-hrs/year.

Additive signal types (later, same bus)

File as separate issues when consumers materialize:

  • Spec-dec acceptance rate (Kiely §5.2)
  • Quality / drift samples (pairs with future ProvenanceLedger / ModelRecall)
  • Disagg-specific (prefill queue, decode KV exhaustion, xPyD efficiency) — Kiely §5.5
  • Per-tenant rollups (TPM/RPM, $/tenant, SLO compliance)

Privacy defaults

  • Per-tenant aggregation by default; cross-tenant rollups require explicit opt-in via observability.metrics.scope: Fleet
  • Sketches not raw events at this bus — no raw prompts or PII leave the cluster (raw prompt/response data is #77's territory, gated by the user's tracing backend choice)
  • Optional sampled-trace export to the tenant's own object store for offline analysis
  • Hardware-class namespacing where relevant (KV signals keyed by GPU family)

Companion: fleet metrics exposure (separate issue)

User-facing surface (CLI, status fields, Prometheus federation, Grafana dashboards) is its own issue when we're ready to commit to UX. Different shape from the optimization primitives that consume the same signals internally. Track separately so v0.1 ships the substrate without committing to the full product surface.

Out of scope for v0.1

  • User-facing analytics CLI / dashboards (companion issue, v0.2)
  • Raw per-request spans for app observability (that's #77, v0.2 on the same OTLP substrate)
  • The optimizer primitives that consume the signals (HotPrefixPool auto-discovery, cost-aware placement, drift detector — all v0.2+ on this bus)

Related issues

  • #77 — v0.2 parallel primitive; OTel traces from the gateway to user's tracing backend, sharing OTLP transport
  • #70, #71, #48 — pure consumers of bus signals
  • #66 ModelCache, #72 KVOffloadTier
  • PR #64 — design doc; "aggregate gateway metrics" implicit in the KEDA autoscaling story, formalized here

Metadata

Metadata

Assignees

No one assigned

    Labels

    SchedulingScheduling componentenhancementNew feature or request

    Type

    No type

    Fields

    No fields configured for issues without a type.

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions