-
Notifications
You must be signed in to change notification settings - Fork 1
Open
Labels
devopsCI/CD and operations toolingCI/CD and operations toolingepic-foundationFoundational platform workFoundational platform workinfrastructureInfrastructure-related workInfrastructure-related workp1High priority (important for iteration)High priority (important for iteration)
Description
Priority
P1
Story Points
8
Dependencies
Description
Stand up end-to-end monitoring, logging, and alerting across the Scribemed platform using Datadog so SWE teams gain visibility into API services, async workers, AI inference paths, and shared infrastructure. This work must extend the Terraform footprint from #3 and instrumentation contracts from #6 to deliver a reusable observability foundation that is HIPAA-aligned and ready for production workloads.
Acceptance Criteria
- Datadog org configured with
dev,staging,prodenvironments, SSO/RBAC, and API keys stored via existing secrets manager - Datadog Cluster + Node Agents deployed on Kubernetes, autodiscovering core services, queues (Kafka/RabbitMQ), PostgreSQL, Redis, vector DB, and Elasticsearch
- APM instrumentation added to all HTTP and worker services with trace propagation across HTTP and message queue boundaries, capturing key spans (ingestion, transcription, note generation, coding inference)
- Structured JSON logs shipped to Datadog with PHI scrubbing enforced, enriched with
service,env,component,encounter_id, andphysician_id(when available) - Baseline dashboards live for Platform Overview, Ingestion & AI Pipeline, and Data Stores with documented SLO targets
- At least six P1/P2 monitors active (API error rate, latency SLO breach, queue backlog, transcription failure %, vector DB outage, LLM timeout rate) routing to Slack/PagerDuty
- Runbooks published documenting alert responses, dashboard links, and observability conventions
- Staging validation evidence captured (alert screenshots/logs) before closing issue
Technical Specification
Datadog Account & Access
- Enable HIPAA-compliant Datadog org; configure SSO + RBAC roles for SWE, SRE, security
- Provision API/APP keys via Vault/SSM and reference them in Terraform modules
Agent Deployment
- Extend Terraform from HIPAA-Compliant AWS Infrastructure with Terraform #3 to install Datadog Cluster Agent + Node Agents with autodiscovery templates for
services/*pods, RabbitMQ/Kafka, PostgreSQL, Redis, pgvector, Elasticsearch - Enable log collection, APM, Live Processes, and Kubernetes events; tag resources with
env,service,team
Service Instrumentation
- Integrate Datadog APM libraries (Node.js + Python) with OpenTelemetry bridge from Shared Libraries Package Setup #6 shared monitoring package
- Implement trace propagation (
dd-traceheaders, W3C Trace Context) for REST, gRPC, and message queues, includingencounter_idbaggage - Capture critical spans for ingestion pipeline, transcription jobs, RAG retrieval, coding inference, and workflow automation tasks
Logging Pipeline
- Use structured logging middleware (from shared libraries) with log shipper sidecar or Datadog agent intake
- Define Datadog log pipelines for JSON parsing, sensitive-field redaction, and route to HIPAA index
Dashboards & Metrics
- Create dashboard templates:
- Platform Overview: uptime, error budgets, request latency, top failing services
- Ingestion & AI Pipeline: queue depth, job throughput, transcription latency distribution, LLM success rate
- Data Stores: PostgreSQL slow queries, Redis hit rate, vector DB latency, S3 upload errors
- Leverage custom metrics for hallucination flags, inference success/failure, queue lag
Alerting & Integrations
- Configure Slack + PagerDuty integrations
- Build monitors with runbook links and tags (
team:sre,service:transcription, etc.) - Establish maintenance windows for deployments via Terraform variables
Documentation & Runbooks
- Add
docs/observability/monitoring.mdcovering tagging schema, dashboard URLs, alert catalog, validation steps, and onboarding checklist - Record troubleshooting guides for common failures (agent crash, API throttling, missing logs)
Implementation Steps
- Confirm Datadog org setup, environments, SSO, and secrets management strategy with infrastructure team
- Extend Terraform modules to deploy cluster/node agents, configure integrations, and manage API keys
- Update service templates (HTTP + worker) to initialize Datadog tracing and log enrichment via shared monitoring package
- Instrument async workers and queue consumers for trace linkage and custom metrics (queue lag, DLQ size)
- Configure log pipelines and PHI scrubbing rules; validate with sample payloads
- Build and share core dashboards; review metrics with product + clinical stakeholders for completeness
- Define and enable P1/P2 monitors, routing to Slack/PagerDuty, and attach runbooks
- Execute staging game-day tests (simulate queue backlog, force 5xx spike, induce LLM timeout) to verify alerts and dashboards
- Capture validation artifacts and finalize documentation before closing
Testing Requirements
- Unit/integration tests verifying instrumentation hooks emit spans/logs without impacting request latency budgets
- Staging smoke tests confirming dashboards update in near real time for synthetic load
- Game-day simulations produce expected alerts with correct routing and context
- Terraform plan/apply passes CI checks with new Datadog resources
Documentation
docs/observability/monitoring.mdwith architecture diagram, onboarding checklist, and runbooks- Update root
README.mdordocs/architectureindex to reference new observability docs - Add developer onboarding notes showing how to instrument new services using shared monitoring package
Metadata
Metadata
Assignees
Labels
devopsCI/CD and operations toolingCI/CD and operations toolingepic-foundationFoundational platform workFoundational platform workinfrastructureInfrastructure-related workInfrastructure-related workp1High priority (important for iteration)High priority (important for iteration)