Schema + misconfig gates on rendered output; druid secrets, probes, hardening by stxkxs · Pull Request #85 · nanohype/eks-gitops

stxkxs · 2026-07-01T22:25:34Z

See the five commit messages for full details.

Summary

CI now kubeconform-validates (strict, CRD schemas, no blanket ignore) and trivy-config-scans every rendered root as hard gates — first run found and fixed 100 invalid manifests (GrafanaDashboard revision: latest vs integer-typed CRD; duplicate label keys) and druid RBAC/container hardening
druid keystore password sourced from Secrets Manager via ExternalSecret; zero changeit in rendered output; changeit added to the placeholder sentinels
Fixed en route: .gitignore was keeping externalsecret.yaml untracked — the deployed chart shipped no ExternalSecrets at all
druid probes are real httpGet health checks (15 rendered, zero tcpSocket)

The bug: the druid.probes helper took (port, healthPath, readinessPath) parameters and then ignored both path arguments, emitting three identical tcpSocket probes. Every component's carefully chosen health endpoints — /status/health, /druid/broker/v1/readiness, /druid/historical/v1/readiness — were dead template arguments; a wedged process holding its listen socket open (JVM up, Jetty stuck, broker unaware of segments) always probed healthy. Root cause: the helper templated only {{ $port }} into tcpSocket stanzas; the $healthPath / $readinessPath bindings were declared and never referenced. The fix: emit httpGet probes wired to the passed paths — liveness and startup hit the health path, readiness hits the readiness path — with scheme HTTPS, because the cluster runs TLS-only (druid.enablePlaintextPort=false). The kubelet does not verify serving certificates on HTTPS probes, so the cert-manager-issued internal CA is fine, and Druid serves these endpoints unauthenticated (its default unsecured-path list) with client certificates requested but not required. All five processes (coordinator, overlord, broker, historical, router) expose /status/health, so no component needs a tcpSocket fallback; timings (thresholds, delays, periods) are unchanged. Co-authored-by: stxkxsbot <275011021+stxkxsbot@users.noreply.github.com>

The bug: the chart shipped a static credential. certificate.yaml rendered a Kubernetes Secret with the literal `password: changeit` (the JVM keystore default), and the base values repeated the same literal six times across druid.server.https.* / druid.client.https.* keystore and truststore properties in the common runtime.properties — a committed password guarding every keystore in the cluster's internal mTLS setup. Root cause: the keystore password sits at the junction of two consumers — cert-manager (which encrypts the PKCS#12 keystore/truststore it writes into the TLS secret, via the Certificate's passwordSecretRef) and Druid (which needs the same password to open those stores) — and the shortcut was to pin a known constant both sides could see. The fix, following the chart's existing secret pattern (metadata/admin/system already flow through External Secrets): - externalsecret.yaml grows a fourth ExternalSecret that materialises <name>-keystore-password from AWS Secrets Manager (property `password`), addressed by the new values-required `secrets.keystore` field — empty in base values like its siblings, set per tenant. - certificate.yaml drops the committed Secret; its passwordSecretRef now points at the ExternalSecret-owned target of the same name. - druid.env exports the value as DRUID_TLS_KEYSTORE_PASSWORD on every component and task pod. - The six runtime.properties literals become ${env:DRUID_TLS_KEYSTORE_PASSWORD}, the same env indirection the file already uses for the admin, system, and metadata credentials. Related bug surfaced while staging this change: the blanket `*secret*.yaml` gitignore rule had kept externalsecret.yaml out of the repository entirely, so the chart ArgoCD pulls from git carried NO ExternalSecrets — none of the druid credential secrets (metadata, admin, system) would ever materialise in-cluster. ExternalSecret manifests are references into the secret store, not secret material; .gitignore now carves them out (mirroring the existing `!*secret*-store*` exception) and the file is tracked. Rendered output verified: `helm template` over the chart contains no `changeit` anywhere. Co-authored-by: stxkxsbot <275011021+stxkxsbot@users.noreply.github.com>

Three chart defects surfaced by running the new rendered-output gates (kubeconform strict + trivy config) over `helm template` of this chart: ─── Duplicate label key (kubeconform: strict-mode unmarshal error) ─── druid.component.match.labels emitted {domain}/name, which common.labels (included by druid.component.labels) already emits — every Service selector and workload pod-template that combines the two helpers rendered the same map key twice, invalid YAML under any strict decoder. match.labels now carries only the {domain}/component key: still uniquely selects each component's pods (which keep carrying it), while {domain}/name continues to reach every resource through common.labels. Selectors are safe to reshape — no tenants exist yet (catalog/druid/tenants/ is unpopulated), so nothing deployed holds the old immutable selector. ─── Wildcard RBAC verbs (trivy: KSV-0045, CRITICAL) ─── The namespace Role granted verbs: ["*"] on jobs, pods, pods/log, configmaps and secrets — a grant that silently widens as the API grows. Verbs are now enumerated (create/delete/get/list/patch/update/watch) for the same resources, which druid-kubernetes-extensions (pod discovery, ConfigMap announcements/leader election) and druid-kubernetes-overlord-extensions (ingestion tasks as Jobs, their pods and logs) genuinely need. ─── Missing container securityContext (trivy: KSV-0001/KSV-0104, MEDIUM) ─── Pod-level securityContext already pinned uid/gid 1000 + runAsNonRoot, but containers could still escalate privileges and ran unconfined by seccomp. A shared druid.container.security helper — allowPrivilegeEscalation: false, drop ALL capabilities, seccompProfile RuntimeDefault — now applies to all six containers (five components plus the task pod template). Safe for the JVM: no setuid binaries needed, and every listen port is above 1024. Co-authored-by: stxkxsbot <275011021+stxkxsbot@users.noreply.github.com>

…teger The bug: all 24 GrafanaDashboard CRs pinned their grafana.com dashboard with `spec.grafanaCom.revision: latest`. The grafana-operator v1beta1 CRD types that field as integer, so the API server rejects every one of these manifests at admission — the string "latest" never survives server-side validation. Root cause: the operator's actual "track latest" contract is to OMIT revision (it then resolves the newest revision from grafana.com); the string literal read like a supported keyword but was never valid against the schema. Nothing in CI validated rendered output against schemas, so the mistake sat in all 24 dashboards — this is exactly the class of bug the new kubeconform gate exists to catch, and it flagged every instance on its first run. The fix: remove the revision line everywhere, keeping the pinned `id` — identical intent (latest revision of that dashboard id), now schema-valid. Co-authored-by: stxkxsbot <275011021+stxkxsbot@users.noreply.github.com>

The validate job already rendered every kustomize root per environment with --enable-helm and asserted sentinels; this catalog of 40 addons still shipped with no schema or security scanner anywhere in CI. Two hard-fail gates now run against the rendered manifests — never the repo tree — so every check sees the post-templating truth with helm values applied. ─── Druid chart joins the rendered surface ─── The druid catalog chart deploys through the druid-tenants ApplicationSet (base values + per-tenant values), so the overlay render loop never touched it. The validate job now `helm template`s it with synthetic tenant values into rendered/, putting the chart's templates through the same render-assert, schema, and misconfiguration gates as every kustomize root. `task render` mirrors this locally. ─── Schema gate: kubeconform (strict) ─── Native kinds resolve from the default kubernetes-json-schema location, CRD kinds from the datreeio CRDs-catalog. Deliberately NO -ignore-missing-schemas: a kind neither source knows fails the build until it gets a schema or an explicit, justified -skip. The single skip is the Grafana kind — the one external Grafana CR ships spec.external.url empty by design (the dashboards ApplicationSet injects the per-cluster Amazon Managed Grafana URL via its cluster-generator patch) and the catalog schema is stricter than the upstream CRD. The binary and downloaded schemas are cached across runs (restore-keys prefix, so fresh schemas persist). ─── Misconfiguration gate: trivy config ─── MEDIUM and above hard-fails. Scoped, justified exceptions live in .trivyignore.yaml, each pinned to one rendered file with a statement: the neuron device plugin's root user and kubelet-socket hostPath (the device-plugin contract), druid's k8s-extensions Role and env-indirected password keys, and the registry-allowlist check (image provenance is enforced cluster-side by the Kyverno verify-images policy). Anything new fails the gate until fixed or added with a reason. The trivy binary is pinned and cached via setup-trivy. ─── Local parity + source gate ─── - `task scan` runs both gates over rendered/ with identical flags - scripts/no-placeholders.sh adds `changeit` (the JVM keystore default password) to the sentinel list, so the class of committed credential this work removed from the druid chart cannot come back - CLAUDE.md documents the new commands and CI shape Verified locally over all four environments (dev/staging/production/hub): render → sentinel assert → kubeconform (111/111, 98/98 valid, 1 skipped Grafana) → trivy (exit 0 with the ignorefile, exit 1 without it), plus negative tests proving both gates fail on an invalid Deployment, an unknown kind, and unignored findings. Co-authored-by: stxkxsbot <275011021+stxkxsbot@users.noreply.github.com>

github-actions · 2026-07-01T22:26:18Z

CI Results

Check	Status
YAML Lint	✅ success
Render + assert + schema + misconfig (all environments)	✅ success

All checks passed.

stxkxs and others added 5 commits July 1, 2026 15:21

stxkxs marked this pull request as ready for review July 1, 2026 22:58

stxkxs merged commit 4e6562f into main Jul 1, 2026
9 checks passed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Schema + misconfig gates on rendered output; druid secrets, probes, hardening#85

Schema + misconfig gates on rendered output; druid secrets, probes, hardening#85
stxkxs merged 5 commits into
mainfrom
render-gates

stxkxs commented Jul 1, 2026

Uh oh!

github-actions Bot commented Jul 1, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Uh oh!

Conversation

stxkxs commented Jul 1, 2026

Summary

Uh oh!

github-actions Bot commented Jul 1, 2026

CI Results

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant