Monitoring & Observability Setup

## Priority
P1

## Story Points
8

## Dependencies
Depends on #3, #6

## Description
Stand up end-to-end monitoring, logging, and alerting across the Scribemed platform using Datadog so SWE teams gain visibility into API services, async workers, AI inference paths, and shared infrastructure. This work must extend the Terraform footprint from #3 and instrumentation contracts from #6 to deliver a reusable observability foundation that is HIPAA-aligned and ready for production workloads.

### Acceptance Criteria
- Datadog org configured with `dev`, `staging`, `prod` environments, SSO/RBAC, and API keys stored via existing secrets manager
- Datadog Cluster + Node Agents deployed on Kubernetes, autodiscovering core services, queues (Kafka/RabbitMQ), PostgreSQL, Redis, vector DB, and Elasticsearch
- APM instrumentation added to all HTTP and worker services with trace propagation across HTTP and message queue boundaries, capturing key spans (ingestion, transcription, note generation, coding inference)
- Structured JSON logs shipped to Datadog with PHI scrubbing enforced, enriched with `service`, `env`, `component`, `encounter_id`, and `physician_id` (when available)
- Baseline dashboards live for Platform Overview, Ingestion & AI Pipeline, and Data Stores with documented SLO targets
- At least six P1/P2 monitors active (API error rate, latency SLO breach, queue backlog, transcription failure %, vector DB outage, LLM timeout rate) routing to Slack/PagerDuty
- Runbooks published documenting alert responses, dashboard links, and observability conventions
- Staging validation evidence captured (alert screenshots/logs) before closing issue

### Technical Specification
**Datadog Account & Access**
- Enable HIPAA-compliant Datadog org; configure SSO + RBAC roles for SWE, SRE, security
- Provision API/APP keys via Vault/SSM and reference them in Terraform modules

**Agent Deployment**
- Extend Terraform from #3 to install Datadog Cluster Agent + Node Agents with autodiscovery templates for `services/*` pods, RabbitMQ/Kafka, PostgreSQL, Redis, pgvector, Elasticsearch
- Enable log collection, APM, Live Processes, and Kubernetes events; tag resources with `env`, `service`, `team`

**Service Instrumentation**
- Integrate Datadog APM libraries (Node.js + Python) with OpenTelemetry bridge from #6 shared monitoring package
- Implement trace propagation (`dd-trace` headers, W3C Trace Context) for REST, gRPC, and message queues, including `encounter_id` baggage
- Capture critical spans for ingestion pipeline, transcription jobs, RAG retrieval, coding inference, and workflow automation tasks

**Logging Pipeline**
- Use structured logging middleware (from shared libraries) with log shipper sidecar or Datadog agent intake
- Define Datadog log pipelines for JSON parsing, sensitive-field redaction, and route to HIPAA index

**Dashboards & Metrics**
- Create dashboard templates:
  - Platform Overview: uptime, error budgets, request latency, top failing services
  - Ingestion & AI Pipeline: queue depth, job throughput, transcription latency distribution, LLM success rate
  - Data Stores: PostgreSQL slow queries, Redis hit rate, vector DB latency, S3 upload errors
- Leverage custom metrics for hallucination flags, inference success/failure, queue lag

**Alerting & Integrations**
- Configure Slack + PagerDuty integrations
- Build monitors with runbook links and tags (`team:sre`, `service:transcription`, etc.)
- Establish maintenance windows for deployments via Terraform variables

**Documentation & Runbooks**
- Add `docs/observability/monitoring.md` covering tagging schema, dashboard URLs, alert catalog, validation steps, and onboarding checklist
- Record troubleshooting guides for common failures (agent crash, API throttling, missing logs)

### Implementation Steps
1. Confirm Datadog org setup, environments, SSO, and secrets management strategy with infrastructure team
2. Extend Terraform modules to deploy cluster/node agents, configure integrations, and manage API keys
3. Update service templates (HTTP + worker) to initialize Datadog tracing and log enrichment via shared monitoring package
4. Instrument async workers and queue consumers for trace linkage and custom metrics (queue lag, DLQ size)
5. Configure log pipelines and PHI scrubbing rules; validate with sample payloads
6. Build and share core dashboards; review metrics with product + clinical stakeholders for completeness
7. Define and enable P1/P2 monitors, routing to Slack/PagerDuty, and attach runbooks
8. Execute staging game-day tests (simulate queue backlog, force 5xx spike, induce LLM timeout) to verify alerts and dashboards
9. Capture validation artifacts and finalize documentation before closing

### Testing Requirements
- Unit/integration tests verifying instrumentation hooks emit spans/logs without impacting request latency budgets
- Staging smoke tests confirming dashboards update in near real time for synthetic load
- Game-day simulations produce expected alerts with correct routing and context
- Terraform plan/apply passes CI checks with new Datadog resources

### Documentation
- `docs/observability/monitoring.md` with architecture diagram, onboarding checklist, and runbooks
- Update root `README.md` or `docs/architecture` index to reference new observability docs
- Add developer onboarding notes showing how to instrument new services using shared monitoring package


Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Monitoring & Observability Setup #8

Priority

Story Points

Dependencies

Description

Acceptance Criteria

Technical Specification

Implementation Steps

Testing Requirements

Documentation

Metadata

Assignees

Labels

Projects

Milestone

Relationships

Development

Monitoring & Observability Setup #8

Description

Priority

Story Points

Dependencies

Description

Acceptance Criteria

Technical Specification

Implementation Steps

Testing Requirements

Documentation

Metadata

Metadata

Assignees

Labels

Projects

Milestone

Relationships

Development

Issue actions