nanohype · stxkxs · Jul 2, 2026 · Jul 2, 2026
diff --git a/README.md b/README.md
@@ -152,7 +152,12 @@ task clean                    # Remove rendered output
 - [Environment Configuration](docs/configuration/environments.md)
 - [Adding Addons](docs/configuration/adding-addons.md)
 - [Contributing](docs/development/contributing.md)
-- [Troubleshooting](docs/runbooks/troubleshooting.md)
+- Runbooks
+  - [Troubleshooting](docs/runbooks/troubleshooting.md)
+  - [Addon Sync Stuck or Degraded](docs/runbooks/addon-sync-degraded.md)
+  - [Rolling Back an Addon](docs/runbooks/rollback.md)
+  - [Druid Operations](docs/runbooks/druid-operations.md)
+  - [Render-Gate Failures on PRs](docs/runbooks/render-gate-failures.md)
 
 ## License
 

diff --git a/docs/runbooks/addon-sync-degraded.md b/docs/runbooks/addon-sync-degraded.md
@@ -0,0 +1,69 @@
+# Runbook — Addon Sync Stuck or Degraded
+
+**Severity**: high — a degraded addon can hold back everything that depends on it (CRDs, secrets, certificates). **Scope**: any Application generated by the ApplicationSets in `applicationsets/`.
+
+## Symptoms
+
+- `argocd app list` shows an addon `OutOfSync` for longer than one sync + retry cycle (retry limit is 5 with exponential backoff capped at 3m — roughly 10 minutes end to end)
+- Application health is `Degraded` or `Progressing` without converging
+- An expected Application does not exist at all
+- Resources the addon owns keep flapping between applied and reverted
+
+## Diagnosis
+
+Start with the Application's own status — conditions carry the actual error:
+
+```bash
+argocd app get <app-name>
+kubectl get application <app-name> -n argocd -o jsonpath='{.status.conditions}' | jq
+kubectl get application <app-name> -n argocd -o jsonpath='{.status.operationState.message}'
+```
+
+Then narrow by symptom:
+
+**Application missing entirely** — the generator never produced it. Every ApplicationSet uses a matrix of a `clusters` selector and a `list`/`git` generator, keyed on cluster secret labels:
+
+```bash
+kubectl get secrets -n argocd -l argocd.argoproj.io/secret-type=cluster
+kubectl get secret <cluster-secret> -n argocd -o jsonpath='{.metadata.labels}' | jq
+kubectl logs -n argocd -l app.kubernetes.io/component=applicationset-controller --tail=100
+```
+
+The `environment` label is mandatory — it resolves `values-{env}.yaml` paths and overlay directories, and `goTemplateOptions: ["missingkey=error"]` means a missing label fails template rendering for that cluster instead of generating a broken app. Also check exclusions: the hub cluster is deliberately excluded from workload ApplicationSets (e.g. `druid-tenants` matches `environment NotIn [hub]`).
+
+**OutOfSync with sync errors** — read `operationState`. The recurring shapes:
+
+- *CRD kind not found* — a CRD-dependent resource synced before its CRD chart. Sync waves order this: `prometheus-operator-crds` (bootstrap, wave 0-2) before anything shipping ServiceMonitors, `kagent-crds`/`agentgateway-crds` before `kagent`/`agentgateway`, `cert-manager` before any Certificate. Waves only order creation during a coordinated sync — on a fresh cluster, confirm the CRD app itself is Synced/Healthy before chasing the consumer.
+- *ComparisonError / values ref failure* — the `$values` multi-source ref points at this repo on `main`; a missing `values-{env}.yaml` for the cluster's environment breaks comparison. Every Helm addon needs `values.yaml` plus the env delta file.
+- *ServerSideApply field conflict* — another controller owns a field. Check `operationState.message` for the conflicting manager.
+
+**Healthy in git, Degraded in cluster** — look at the addon's own workloads (`kubectl -n <ns> get pods,events`). ExternalSecret-backed addons degrade when the `ClusterSecretStore` can't reach AWS Secrets Manager — check `kubectl get clustersecretstore aws-secrets-manager -o yaml` conditions first.
+
+**Perpetual OutOfSync on a StatefulSet** — expected drift on `volumeClaimTemplates` / `persistentVolumeClaimRetentionPolicy` is already handled by `ignoreDifferences` + `RespectIgnoreDifferences=true` in the observability ApplicationSet. If a new StatefulSet-shipping addon flaps on those fields, the fix is extending `ignoreDifferences` in its ApplicationSet, not manual syncs.
+
+## Remediation
+
+1. **Fix the cause in git** — this repo is the source of truth and `selfHeal: true` reverts anything else. Values fixes, wave reordering, and `ignoreDifferences` additions all land as PRs through the render gates.
+2. **Targeted refresh** when ArgoCD's cache is stale (git is right, app disagrees):
+   ```bash
+   argocd app get <app-name> --hard-refresh
+   ```
+3. **Targeted sync** for one stuck resource instead of the whole app:
+   ```bash
+   argocd app sync <app-name> --resource <group>:<Kind>:<name>
+   ```
+4. **Terminate a wedged operation** before retrying:
+   ```bash
+   argocd app terminate-op <app-name> && argocd app sync <app-name>
+   ```
+5. Know what won't stick: the ApplicationSet controller owns the generated Application spec, so `argocd app set` overrides (sync policy, target revision) are reverted on its next reconcile. Manual levers buy minutes for diagnosis, not a durable state.
+
+## Verification
+
+```bash
+argocd app get <app-name>          # Synced / Healthy
+kubectl get application <app-name> -n argocd -o jsonpath='{.status.conditions}'   # empty or info-only
+kubectl -n <addon-namespace> get pods
+```
+
+Confirm the app stays Synced through one full self-heal interval (a few minutes) — a fix that only survives until the next reconcile means the ApplicationSet or git still disagrees with what you applied.
diff --git a/docs/runbooks/druid-operations.md b/docs/runbooks/druid-operations.md
@@ -0,0 +1,84 @@
+# Runbook — Druid Operations
+
+**Severity**: high for tenant-facing outages (query path down), medium for single-component degradation. **Scope**: the chart this repo owns outright — `catalog/druid/chart/` — deployed per tenant by the `druid-tenants` ApplicationSet (wave 50, one Application per `catalog/druid/tenants/<tenant>/` directory, namespace `druid-<tenant>`, never on the hub).
+
+The cluster is ZooKeeper-less (`druid.discovery.type=k8s` via druid-kubernetes-extensions, leader election over ConfigMaps) and TLS-only (`druid.enablePlaintextPort=false`). Components: coordinator, overlord, historical (StatefulSets), broker, router (Deployments); ingestion tasks run as Jobs launched by the overlord (druid-kubernetes-overlord-extensions). Each component gets its own Karpenter NodePool from the chart, backed by a shared EC2NodeClass.
+
+## Keystore secret rotation
+
+The PKCS#12 keystore/truststore password is shared by cert-manager (which encrypts the keystores it writes) and the Druid pods (which read them back). The chain:
+
+```
+AWS Secrets Manager <tenant keystore secret>  (Values.secrets.keystore, property "password")
+  → ExternalSecret <name>-keystore-password    (refreshInterval: 1h, ClusterSecretStore aws-secrets-manager)
+  → k8s Secret <name>-keystore-password
+  → cert-manager Certificate <name>-druid-tls  (keystores.pkcs12.passwordSecretRef)
+  → keystore.p12 / truststore.p12 in Secret <name>-druid-tls, mounted at /opt/druid/conf/druid/cluster/tls
+  → pods: DRUID_TLS_KEYSTORE_PASSWORD env + druid.server.https.keyStorePassword=${env:...}
+```
+
+Rotation procedure:
+
+1. Update the `password` property of the tenant's keystore secret in AWS Secrets Manager.
+2. Sync the ExternalSecret — wait for the 1h `refreshInterval` or force it:
+   ```bash
+   kubectl -n druid-<tenant> annotate externalsecret <name>-keystore-password \
+     force-sync=$(date +%s) --overwrite
+   kubectl -n druid-<tenant> get externalsecret <name>-keystore-password   # SecretSynced/Ready
+   ```
+3. Have cert-manager re-encrypt the keystores with the new password:
+   ```bash
+   cmctl renew <name>-druid-tls -n druid-<tenant>
+   # or: kubectl -n druid-<tenant> delete secret <name>-druid-tls   (cert-manager reissues)
+   ```
+4. Roll the pods. Druid reloads keystore *files* from disk every 180s (`reloadSslContextSeconds`), but the *password* env var is read once at process start — a rotation is not complete until every pod restarts:
+   ```bash
+   for c in coordinator overlord historical; do
+     kubectl -n druid-<tenant> rollout restart statefulset -l app.kubernetes.io/name=druid
+   done
+   kubectl -n druid-<tenant> rollout restart deployment
+   ```
+5. Order matters only in one place: do not restart pods between steps 2 and 3 — a pod starting with the new password while the TLS secret still holds keystores encrypted with the old one fails at JVM keystore load.
+
+The metadata/admin/system credentials rotate the same way (steps 1–2, then restart) minus the cert-manager step — they are plain env vars from their ExternalSecret-backed Secrets.
+
+## Probe semantics
+
+All five components probe over HTTPS `httpGet` (the kubelet skips certificate verification for HTTPS probes, so the chart's self-signed internal CA is fine; the endpoints are on Druid's unsecured-path list, so basic-auth doesn't block them):
+
+| Component | Port | Liveness | Readiness |
+|---|---|---|---|
+| coordinator | 8281 | `/status/health` | `/status/health` |
+| broker | 8282 | `/status/health` | `/druid/broker/v1/readiness` |
+| historical | 8283 | `/status/health` | `/druid/historical/v1/readiness` |
+| overlord | 8290 | `/status/health` | `/status/health` |
+| router | 9088 | `/status/health` | `/status/health` |
+
+Timing: startupProbe allows 60s initial delay + 60 failures × 10s ≈ **11 minutes to come up** before the kubelet starts killing; liveness/readiness then run at 10s periods with `initialDelaySeconds: 180`. What the two distinct readiness endpoints mean:
+
+- **Broker** `/druid/broker/v1/readiness` returns 503 until the broker has synced the full segment view from historicals — a broker that is alive but not ready is *correct* behavior during historical restarts; don't chase it.
+- **Historical** `/druid/historical/v1/readiness` returns 503 until all assigned segments are loaded from deep storage. Large tenants can hold readiness for a while after a reschedule — segment cache is an `emptyDir`, so every pod replacement re-pulls its assignment from S3.
+
+A pod stuck in `Running` but never `Ready` past those windows: check the JVM actually bound the TLS port (`kubectl logs` for keystore errors), then metadata connectivity (`DRUID_METADATA_STORAGE_*` env from the metadata ExternalSecret), then segment-load progress in the coordinator console.
+
+## Scaling
+
+Per-component `replicas` and `resources` live in the values layering: `catalog/druid/values.yaml` (base) → `catalog/druid/tenants/<tenant>/values.yaml` → `values-{env}.yaml`. Change them there and let ArgoCD sync — `selfHeal: true` reverts manual `kubectl scale` within minutes.
+
+- **Historical** — the usual scale-out target for query capacity. New replicas go Ready only after loading their segment assignment (see probes above); scale one step at a time on large tenants so rebalancing doesn't thundering-herd deep storage.
+- **Broker / router** — stateless; scale freely for query concurrency.
+- **Coordinator / overlord** — leader-elected; additional replicas are warm standbys, not capacity.
+- **Ingestion (task) capacity** — not replica-driven: the overlord launches task pods as Jobs from the task template. Capacity is governed by task tuning in the runtime properties and the task NodePool limits, not by scaling a StatefulSet.
+
+Nodes follow automatically: each component pins to its own Karpenter NodePool via nodeSelector, so scaling replicas provisions/consolidates EC2 without manual node work. If pods stay `Pending`, check the component's NodePool limits and Karpenter events before anything else.
+
+## Verification
+
+```bash
+argocd app get druid-<tenant>                          # Synced / Healthy
+kubectl -n druid-<tenant> get pods -o wide             # all Ready, spread across component pools
+kubectl -n druid-<tenant> get externalsecrets          # all SecretSynced
+kubectl -n druid-<tenant> get certificate              # <name>-druid-tls Ready
+```
+
+End-to-end: port-forward the router (9088) and run a trivial query through it — the router exercises broker discovery, the broker exercises historical TLS, and a 200 proves the whole mTLS mesh agrees on the keystore password.
diff --git a/docs/runbooks/render-gate-failures.md b/docs/runbooks/render-gate-failures.md
@@ -0,0 +1,62 @@
+# Runbook — Render-Gate Failures on PRs
+
+**Severity**: low (nothing is deployed — CI blocked the merge, which is the gate doing its job). **Scope**: the `validate` job in `.github/workflows/ci.yml`, which renders every kustomize root (per environment: dev, staging, production, hub) plus the druid catalog chart with synthetic tenant values, then runs three gates over the *rendered* output.
+
+## Symptoms
+
+- A PR check fails on one of: `Zero-placeholder gate`, `Lint YAML`, or a step inside `Validate (<env>)` — render, "Assert no unfilled sentinels", "Schema gate (kubeconform)", or "Misconfiguration gate (trivy config)"
+- The PR summary comment shows a red row
+
+## Diagnosis
+
+Reproduce locally before reading CI logs twice — the Taskfile mirrors the pipeline:
+
+```bash
+task lint:yaml     # yamllint over the repo
+task render        # kustomize + helm render into rendered/ (incl. druid chart)
+task scan          # kubeconform + trivy config over rendered/
+task validate      # lint + build combined
+```
+
+Then map the failing step to what it actually checks:
+
+**Zero-placeholder / render-assert** — an unfilled sentinel (placeholder token, zero account id, account-less ARN) either in source files or appearing only after templating. The fix is always filling the real value; these gates exist precisely so a placeholder never reaches a cluster.
+
+**Render failure** — `kustomize build --enable-helm` or `helm template` errored. Usual causes: overlay missing its `kustomization.yaml`, a `values-{env}.yaml` absent for one of the four environments, or (druid) a template change that breaks under the synthetic `--set` values in the workflow. Note the matrix renders *every* environment — a change that renders fine in dev can still fail the hub leg.
+
+**Schema gate (kubeconform)** — strict mode, native kinds from the default kubernetes-json-schema location, CRD kinds from the datreeio CRDs-catalog, deliberately **no** `-ignore-missing-schemas`. Two failure shapes:
+
+- *"could not find schema for <Kind>"* — the kind is new to the repo and neither source knows it. Options, in order of preference: the CRD exists in the datreeio catalog under a different group/version (fix the manifest's apiVersion); contribute the schema upstream to the CRDs-catalog; or add an explicit `-skip <Kind>` in the workflow with a comment justifying it (the existing `-skip Grafana` shows the shape — skips are per-kind, commented, and rare).
+- *field/type errors* — a genuinely invalid manifest, or the catalog schema is stricter than the CRD actually deployed (the Grafana skip exists for exactly that mismatch). Verify against the real CRD before assuming the manifest is wrong.
+
+**Misconfiguration gate (trivy config)** — runs over `rendered/`, so every finding reflects post-templating truth with values applied. MEDIUM and above fails the build. The finding ID (`KSV-*`/`AVD-*`) plus the rendered file name tell you which addon and which check.
+
+## Remediation
+
+**Default: fix the manifest.** Most trivy findings have a direct fix — set the securityContext, add resources/probes, drop the capability, pin the tag. The gate severity floor is MEDIUM, so it isn't flagging trivia.
+
+**Exception path: a reasoned `.trivyignore.yaml` entry.** Legitimate only when the flagged configuration *is the component's contract* — the existing entries are the calibration: a device plugin must run as root and hostPath-mount the kubelet socket (KSV-0012/0023); Druid's ConfigMap keys named `*Password` hold `${env:...}` indirection, not secret material (KSV-0109/01010); the overlord's namespace-scoped Role manages Jobs because that is what the k8s task runner does (KSV-0042 et al.). An entry must have all three:
+
+```yaml
+- id: KSV-XXXX
+  paths:
+    - "*<rendered-file-glob-for-exactly-this-addon>*"
+  statement: >-
+    Why this configuration is the component's contract, and where the
+    compensating control lives if there is one.
+```
+
+- `paths` scoped to one addon's rendered output — never a bare `id` that suppresses the check repo-wide
+- `statement` that survives a cold review: what the finding flags, why it's intentional, compensating control (e.g. registry findings point at the Kyverno verify-images policy)
+- If you can't write that statement convincingly, it's a fix, not an ignore
+
+**kubeconform skips** follow the same discipline in the workflow file: per-kind, commented with the concrete schema/CRD mismatch, nothing broader.
+
+## Verification
+
+```bash
+task render && task scan     # clean locally
+git add <files by name> && git commit ...
+```
+
+Push and confirm all four `Validate (<env>)` matrix legs pass — the PR summary comment goes green. For a new `.trivyignore.yaml` entry, also confirm the gate still fails on *other* findings (the entry's `paths` glob should match only the intended rendered file; `grep <id> rendered/ -r` against the trivy output is a quick sanity check that you scoped it tightly).