Skip to content

v0.2 ModelService observability.traces: OTLP gateway emission #77

Description

@dennis-upbound

Customers running production AI workloads need per-request observability — prompts, completions, tokens, latency, eval scores, user sessions. That layer is filled today by Langfuse, Langsmith, Phoenix, Helicone, Datadog LLM Observability, and homegrown OTel pipelines. Modelplane sits at the gateway and sees every request; it's the natural emission point. But picking one app-observability backend creates lock-in for the others.

The right shape is: the gateway emits OpenTelemetry traces with gen_ai.* semantic conventions to an operator-configured OTLP endpoint. Customers point at Langfuse, Langsmith, Phoenix, a custom OTel collector, or anything else that speaks OTLP. Modelplane stays neutral.

This is the v0.2 companion to #74 fleet signal bus. Both share OTLP as the transport, but carry different signal types for different audiences.

Two observability planes, one transport

#74 signal bus This issue (OTel gateway traces)
Signal Aggregated metrics, sketches, time series Per-request spans (gen_ai.*)
Volume KB/s per cluster MB/s per active deployment (sampled)
Retention 30d sketches in control plane Hours hot / months cold in user's backend
Audience Operators, scheduler, fleet matcher Application developers, eval pipelines
PII exposure Hashes / aggregates only Full prompts/responses
Backend Modelplane control plane User's choice (Langfuse / Langsmith / Phoenix / custom)

Same OTLP collector substrate; one fanout aggregates spans into #74 sketches, the other forwards raw spans to the user's configured backend.

Shape

apiVersion: modelplane.ai/v1alpha1
kind: ModelService
spec:
  endpoints:
    - selector: { matchLabels: { modelplane.ai/deployment: kimi-k2 } }
  observability:
    traces:
      enabled: true
      endpoint: https://cloud.langfuse.com/api/public/otel    # or langsmith / phoenix / custom otel-collector
      headers:
        Authorization:
          valueFrom: { secretKeyRef: { name: langfuse-creds, key: auth } }
      sampling:
        ratio: 1.0

Gateway emits gen_ai.completion, gen_ai.embedding, gen_ai.tool_call spans to the configured OTLP endpoint with standard attributes (gen_ai.system, gen_ai.request.model, gen_ai.usage.input_tokens, gen_ai.response.finish_reasons, latency, error).

Cache-family attribute enrichment

The gateway already reads cache-family signals via #74 for routing decisions; same info goes onto outgoing spans as attributes, letting operators correlate per-request latency with cache state in their tracing backend:

gen_ai.request.model         = "meta-llama/Llama-3.3-70B-Instruct"
gen_ai.usage.input_tokens    = 4231
gen_ai.usage.output_tokens   = 187
modelplane.cache.weights     = "llama-3-3-70b"             # ModelCache ref
modelplane.cache.prefix_warmth = "hot"                     # HotPrefixPool hit
modelplane.cache.kv_tier_hit = "L2"                        # KVOffloadTier tier
modelplane.routing.cluster   = "us-east-1"

v0.2 scope

  • ModelService.spec.observability.traces block with the shape above
  • Gateway-side OTel instrumentation with gen_ai.* semantic conventions
  • Standard attributes: model name, tokens, latency, finish reason, error class
  • Cache-family attribute enrichment (consuming #66 ModelCache / #72 KVOffloadTier / #73 HotPrefixPool status)
  • Single OTLP endpoint per ModelService; sampling ratio configurable
  • Same OTLP collector substrate as #74 — one workload-plane agent, two control-plane outputs

Related

References

Metadata

Metadata

Assignees

No one assigned

    Labels

    DevexDeveloper experience componentenhancementNew feature or request

    Type

    No type

    Fields

    No fields configured for issues without a type.

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions