Skip to content

Latest commit

 

History

History
134 lines (106 loc) · 4.37 KB

File metadata and controls

134 lines (106 loc) · 4.37 KB

Observability

Orion provides structured logging, Prometheus metrics, distributed tracing, and health monitoring out of the box. No sidecars, no agents. Everything runs inside the single binary.

Structured Logging

Orion emits structured logs in JSON or pretty-printed format, configurable at runtime:

[logging]
level = "info"        # trace, debug, info, warn, error
format = "pretty"     # pretty or json

JSON format is recommended for production. It integrates directly with log aggregators like Loki, Datadog, or CloudWatch:

ORION_LOGGING__FORMAT=json
ORION_LOGGING__LEVEL=info

Per-crate filtering with RUST_LOG gives fine-grained control:

RUST_LOG=orion=debug,tower_http=warn,sqlx=warn
Level Usage
error Failures that need attention
warn Degraded behavior (circuit breakers, retries)
info Request lifecycle, engine reloads, startup/shutdown
debug Detailed processing, SQL queries, connector calls
trace Fine-grained internal state

Every request carries a UUID x-request-id header. Pass your own or let Orion generate one. The ID propagates through logs and responses for end-to-end correlation.

Prometheus Metrics

Enable metrics and scrape at GET /metrics (Prometheus text format):

[metrics]
enabled = true
Metric Type Labels Description
messages_total Counter channel, status Total messages processed
message_duration_seconds Histogram channel Processing latency
active_workflows Gauge Workflows loaded in engine
errors_total Counter type Errors encountered
http_requests_total Counter method, path, status HTTP requests served
http_request_duration_seconds Histogram method, path, status HTTP request latency
db_query_duration_seconds Histogram operation Database query latency
engine_reloads_total Counter status Engine reload events
engine_reload_duration_seconds Histogram Engine reload latency
circuit_breaker_trips_total Counter connector, channel Circuit breaker trip events
circuit_breaker_rejections_total Counter connector, channel Requests rejected by open breakers
channel_executions_total Counter channel Channel invocations
rate_limit_rejections_total Counter client Rate-limited requests

Distributed Tracing

Enable OpenTelemetry trace export with OTLP gRPC:

[tracing]
enabled = true
otlp_endpoint = "http://localhost:4317"
service_name = "orion"
sample_rate = 1.0    # 0.0 (none) to 1.0 (all)
  • W3C Trace Context extraction and propagation: incoming traceparent headers are respected
  • Per-request spans with channel, workflow, and task attributes
  • OTLP gRPC export to Jaeger, Tempo, or any compatible collector
  • Configurable sampling rate for production use
  • Trace context injected into outbound http_call requests for full distributed traces

Health Monitoring

Orion exposes three health endpoints for different operational needs.

Component health: GET /health returns component-level status with automatic degradation detection:

{
  "status": "ok",
  "version": "0.1.0",
  "uptime_seconds": 3600,
  "workflows_loaded": 42,
  "components": {
    "database": "ok",
    "engine": "ok"
  }
}

The health check tests the database with SELECT 1 and verifies engine availability with a configurable lock timeout. If either check fails, the endpoint returns 503 Service Unavailable with "status": "degraded".

Kubernetes probes:

Endpoint Purpose Behavior
GET /healthz Liveness probe Always returns 200. If the process is running, it's alive
GET /readyz Readiness probe Returns 200 only when DB is reachable, engine is loaded, and startup is complete; 503 otherwise
livenessProbe:
  httpGet:
    path: /healthz
    port: 8080
  initialDelaySeconds: 5
  periodSeconds: 10
readinessProbe:
  httpGet:
    path: /readyz
    port: 8080
  initialDelaySeconds: 5
  periodSeconds: 5

Engine status: GET /api/v1/admin/engine/status returns a detailed breakdown:

{
  "version": "0.1.0",
  "uptime_seconds": 3600,
  "workflows_count": 42,
  "active_workflows": 38,
  "channels": ["orders", "events", "alerts"]
}