Skip to content

Monitoring & Observability Setup #8

@Sakeeb91

Description

@Sakeeb91

Priority

P1

Story Points

8

Dependencies

Depends on #3, #6

Description

Stand up end-to-end monitoring, logging, and alerting across the Scribemed platform using Datadog so SWE teams gain visibility into API services, async workers, AI inference paths, and shared infrastructure. This work must extend the Terraform footprint from #3 and instrumentation contracts from #6 to deliver a reusable observability foundation that is HIPAA-aligned and ready for production workloads.

Acceptance Criteria

  • Datadog org configured with dev, staging, prod environments, SSO/RBAC, and API keys stored via existing secrets manager
  • Datadog Cluster + Node Agents deployed on Kubernetes, autodiscovering core services, queues (Kafka/RabbitMQ), PostgreSQL, Redis, vector DB, and Elasticsearch
  • APM instrumentation added to all HTTP and worker services with trace propagation across HTTP and message queue boundaries, capturing key spans (ingestion, transcription, note generation, coding inference)
  • Structured JSON logs shipped to Datadog with PHI scrubbing enforced, enriched with service, env, component, encounter_id, and physician_id (when available)
  • Baseline dashboards live for Platform Overview, Ingestion & AI Pipeline, and Data Stores with documented SLO targets
  • At least six P1/P2 monitors active (API error rate, latency SLO breach, queue backlog, transcription failure %, vector DB outage, LLM timeout rate) routing to Slack/PagerDuty
  • Runbooks published documenting alert responses, dashboard links, and observability conventions
  • Staging validation evidence captured (alert screenshots/logs) before closing issue

Technical Specification

Datadog Account & Access

  • Enable HIPAA-compliant Datadog org; configure SSO + RBAC roles for SWE, SRE, security
  • Provision API/APP keys via Vault/SSM and reference them in Terraform modules

Agent Deployment

  • Extend Terraform from HIPAA-Compliant AWS Infrastructure with Terraform #3 to install Datadog Cluster Agent + Node Agents with autodiscovery templates for services/* pods, RabbitMQ/Kafka, PostgreSQL, Redis, pgvector, Elasticsearch
  • Enable log collection, APM, Live Processes, and Kubernetes events; tag resources with env, service, team

Service Instrumentation

  • Integrate Datadog APM libraries (Node.js + Python) with OpenTelemetry bridge from Shared Libraries Package Setup #6 shared monitoring package
  • Implement trace propagation (dd-trace headers, W3C Trace Context) for REST, gRPC, and message queues, including encounter_id baggage
  • Capture critical spans for ingestion pipeline, transcription jobs, RAG retrieval, coding inference, and workflow automation tasks

Logging Pipeline

  • Use structured logging middleware (from shared libraries) with log shipper sidecar or Datadog agent intake
  • Define Datadog log pipelines for JSON parsing, sensitive-field redaction, and route to HIPAA index

Dashboards & Metrics

  • Create dashboard templates:
    • Platform Overview: uptime, error budgets, request latency, top failing services
    • Ingestion & AI Pipeline: queue depth, job throughput, transcription latency distribution, LLM success rate
    • Data Stores: PostgreSQL slow queries, Redis hit rate, vector DB latency, S3 upload errors
  • Leverage custom metrics for hallucination flags, inference success/failure, queue lag

Alerting & Integrations

  • Configure Slack + PagerDuty integrations
  • Build monitors with runbook links and tags (team:sre, service:transcription, etc.)
  • Establish maintenance windows for deployments via Terraform variables

Documentation & Runbooks

  • Add docs/observability/monitoring.md covering tagging schema, dashboard URLs, alert catalog, validation steps, and onboarding checklist
  • Record troubleshooting guides for common failures (agent crash, API throttling, missing logs)

Implementation Steps

  1. Confirm Datadog org setup, environments, SSO, and secrets management strategy with infrastructure team
  2. Extend Terraform modules to deploy cluster/node agents, configure integrations, and manage API keys
  3. Update service templates (HTTP + worker) to initialize Datadog tracing and log enrichment via shared monitoring package
  4. Instrument async workers and queue consumers for trace linkage and custom metrics (queue lag, DLQ size)
  5. Configure log pipelines and PHI scrubbing rules; validate with sample payloads
  6. Build and share core dashboards; review metrics with product + clinical stakeholders for completeness
  7. Define and enable P1/P2 monitors, routing to Slack/PagerDuty, and attach runbooks
  8. Execute staging game-day tests (simulate queue backlog, force 5xx spike, induce LLM timeout) to verify alerts and dashboards
  9. Capture validation artifacts and finalize documentation before closing

Testing Requirements

  • Unit/integration tests verifying instrumentation hooks emit spans/logs without impacting request latency budgets
  • Staging smoke tests confirming dashboards update in near real time for synthetic load
  • Game-day simulations produce expected alerts with correct routing and context
  • Terraform plan/apply passes CI checks with new Datadog resources

Documentation

  • docs/observability/monitoring.md with architecture diagram, onboarding checklist, and runbooks
  • Update root README.md or docs/architecture index to reference new observability docs
  • Add developer onboarding notes showing how to instrument new services using shared monitoring package

Metadata

Metadata

Assignees

No one assigned

    Labels

    devopsCI/CD and operations toolingepic-foundationFoundational platform workinfrastructureInfrastructure-related workp1High priority (important for iteration)

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions