Customers running production AI workloads need per-request observability — prompts, completions, tokens, latency, eval scores, user sessions. That layer is filled today by Langfuse, Langsmith, Phoenix, Helicone, Datadog LLM Observability, and homegrown OTel pipelines. Modelplane sits at the gateway and sees every request; it's the natural emission point. But picking one app-observability backend creates lock-in for the others.
The right shape is: the gateway emits OpenTelemetry traces with gen_ai.* semantic conventions to an operator-configured OTLP endpoint. Customers point at Langfuse, Langsmith, Phoenix, a custom OTel collector, or anything else that speaks OTLP. Modelplane stays neutral.
This is the v0.2 companion to #74 fleet signal bus. Both share OTLP as the transport, but carry different signal types for different audiences.
Two observability planes, one transport
|
#74 signal bus |
This issue (OTel gateway traces) |
| Signal |
Aggregated metrics, sketches, time series |
Per-request spans (gen_ai.*) |
| Volume |
KB/s per cluster |
MB/s per active deployment (sampled) |
| Retention |
30d sketches in control plane |
Hours hot / months cold in user's backend |
| Audience |
Operators, scheduler, fleet matcher |
Application developers, eval pipelines |
| PII exposure |
Hashes / aggregates only |
Full prompts/responses |
| Backend |
Modelplane control plane |
User's choice (Langfuse / Langsmith / Phoenix / custom) |
Same OTLP collector substrate; one fanout aggregates spans into #74 sketches, the other forwards raw spans to the user's configured backend.
Shape
apiVersion: modelplane.ai/v1alpha1
kind: ModelService
spec:
endpoints:
- selector: { matchLabels: { modelplane.ai/deployment: kimi-k2 } }
observability:
traces:
enabled: true
endpoint: https://cloud.langfuse.com/api/public/otel # or langsmith / phoenix / custom otel-collector
headers:
Authorization:
valueFrom: { secretKeyRef: { name: langfuse-creds, key: auth } }
sampling:
ratio: 1.0
Gateway emits gen_ai.completion, gen_ai.embedding, gen_ai.tool_call spans to the configured OTLP endpoint with standard attributes (gen_ai.system, gen_ai.request.model, gen_ai.usage.input_tokens, gen_ai.response.finish_reasons, latency, error).
Cache-family attribute enrichment
The gateway already reads cache-family signals via #74 for routing decisions; same info goes onto outgoing spans as attributes, letting operators correlate per-request latency with cache state in their tracing backend:
gen_ai.request.model = "meta-llama/Llama-3.3-70B-Instruct"
gen_ai.usage.input_tokens = 4231
gen_ai.usage.output_tokens = 187
modelplane.cache.weights = "llama-3-3-70b" # ModelCache ref
modelplane.cache.prefix_warmth = "hot" # HotPrefixPool hit
modelplane.cache.kv_tier_hit = "L2" # KVOffloadTier tier
modelplane.routing.cluster = "us-east-1"
v0.2 scope
ModelService.spec.observability.traces block with the shape above
- Gateway-side OTel instrumentation with
gen_ai.* semantic conventions
- Standard attributes: model name, tokens, latency, finish reason, error class
- Cache-family attribute enrichment (consuming #66 ModelCache / #72 KVOffloadTier / #73 HotPrefixPool status)
- Single OTLP endpoint per ModelService; sampling ratio configurable
- Same OTLP collector substrate as #74 — one workload-plane agent, two control-plane outputs
Related
References
Customers running production AI workloads need per-request observability — prompts, completions, tokens, latency, eval scores, user sessions. That layer is filled today by Langfuse, Langsmith, Phoenix, Helicone, Datadog LLM Observability, and homegrown OTel pipelines. Modelplane sits at the gateway and sees every request; it's the natural emission point. But picking one app-observability backend creates lock-in for the others.
The right shape is: the gateway emits OpenTelemetry traces with
gen_ai.*semantic conventions to an operator-configured OTLP endpoint. Customers point at Langfuse, Langsmith, Phoenix, a custom OTel collector, or anything else that speaks OTLP. Modelplane stays neutral.This is the v0.2 companion to #74 fleet signal bus. Both share OTLP as the transport, but carry different signal types for different audiences.
Two observability planes, one transport
gen_ai.*)Same OTLP collector substrate; one fanout aggregates spans into #74 sketches, the other forwards raw spans to the user's configured backend.
Shape
Gateway emits
gen_ai.completion,gen_ai.embedding,gen_ai.tool_callspans to the configured OTLP endpoint with standard attributes (gen_ai.system,gen_ai.request.model,gen_ai.usage.input_tokens,gen_ai.response.finish_reasons, latency, error).Cache-family attribute enrichment
The gateway already reads cache-family signals via #74 for routing decisions; same info goes onto outgoing spans as attributes, letting operators correlate per-request latency with cache state in their tracing backend:
v0.2 scope
ModelService.spec.observability.tracesblock with the shape abovegen_ai.*semantic conventionsRelated
ModelServiceshapeReferences
gen_ai.*semantic conventions (still stabilizing; aligning with the working group's current draft)