Schema + misconfig gates on rendered output; druid secrets, probes, hardening#85
Merged
Conversation
The bug: the druid.probes helper took (port, healthPath, readinessPath)
parameters and then ignored both path arguments, emitting three identical
tcpSocket probes. Every component's carefully chosen health endpoints —
/status/health, /druid/broker/v1/readiness, /druid/historical/v1/readiness —
were dead template arguments; a wedged process holding its listen socket
open (JVM up, Jetty stuck, broker unaware of segments) always probed healthy.
Root cause: the helper templated only {{ $port }} into tcpSocket stanzas; the
$healthPath / $readinessPath bindings were declared and never referenced.
The fix: emit httpGet probes wired to the passed paths — liveness and startup
hit the health path, readiness hits the readiness path — with scheme HTTPS,
because the cluster runs TLS-only (druid.enablePlaintextPort=false). The
kubelet does not verify serving certificates on HTTPS probes, so the
cert-manager-issued internal CA is fine, and Druid serves these endpoints
unauthenticated (its default unsecured-path list) with client certificates
requested but not required. All five processes (coordinator, overlord,
broker, historical, router) expose /status/health, so no component needs a
tcpSocket fallback; timings (thresholds, delays, periods) are unchanged.
Co-authored-by: stxkxsbot <275011021+stxkxsbot@users.noreply.github.com>
The bug: the chart shipped a static credential. certificate.yaml rendered a
Kubernetes Secret with the literal `password: changeit` (the JVM keystore
default), and the base values repeated the same literal six times across
druid.server.https.* / druid.client.https.* keystore and truststore
properties in the common runtime.properties — a committed password guarding
every keystore in the cluster's internal mTLS setup.
Root cause: the keystore password sits at the junction of two consumers —
cert-manager (which encrypts the PKCS#12 keystore/truststore it writes into
the TLS secret, via the Certificate's passwordSecretRef) and Druid (which
needs the same password to open those stores) — and the shortcut was to pin
a known constant both sides could see.
The fix, following the chart's existing secret pattern (metadata/admin/system
already flow through External Secrets):
- externalsecret.yaml grows a fourth ExternalSecret that materialises
<name>-keystore-password from AWS Secrets Manager (property `password`),
addressed by the new values-required `secrets.keystore` field — empty in
base values like its siblings, set per tenant.
- certificate.yaml drops the committed Secret; its passwordSecretRef now
points at the ExternalSecret-owned target of the same name.
- druid.env exports the value as DRUID_TLS_KEYSTORE_PASSWORD on every
component and task pod.
- The six runtime.properties literals become
${env:DRUID_TLS_KEYSTORE_PASSWORD}, the same env indirection the file
already uses for the admin, system, and metadata credentials.
Related bug surfaced while staging this change: the blanket `*secret*.yaml`
gitignore rule had kept externalsecret.yaml out of the repository entirely,
so the chart ArgoCD pulls from git carried NO ExternalSecrets — none of the
druid credential secrets (metadata, admin, system) would ever materialise
in-cluster. ExternalSecret manifests are references into the secret store,
not secret material; .gitignore now carves them out (mirroring the existing
`!*secret*-store*` exception) and the file is tracked.
Rendered output verified: `helm template` over the chart contains no
`changeit` anywhere.
Co-authored-by: stxkxsbot <275011021+stxkxsbot@users.noreply.github.com>
Three chart defects surfaced by running the new rendered-output gates
(kubeconform strict + trivy config) over `helm template` of this chart:
─── Duplicate label key (kubeconform: strict-mode unmarshal error) ───
druid.component.match.labels emitted {domain}/name, which common.labels
(included by druid.component.labels) already emits — every Service selector
and workload pod-template that combines the two helpers rendered the same
map key twice, invalid YAML under any strict decoder. match.labels now
carries only the {domain}/component key: still uniquely selects each
component's pods (which keep carrying it), while {domain}/name continues to
reach every resource through common.labels. Selectors are safe to reshape —
no tenants exist yet (catalog/druid/tenants/ is unpopulated), so nothing
deployed holds the old immutable selector.
─── Wildcard RBAC verbs (trivy: KSV-0045, CRITICAL) ───
The namespace Role granted verbs: ["*"] on jobs, pods, pods/log, configmaps
and secrets — a grant that silently widens as the API grows. Verbs are now
enumerated (create/delete/get/list/patch/update/watch) for the same
resources, which druid-kubernetes-extensions (pod discovery, ConfigMap
announcements/leader election) and druid-kubernetes-overlord-extensions
(ingestion tasks as Jobs, their pods and logs) genuinely need.
─── Missing container securityContext (trivy: KSV-0001/KSV-0104, MEDIUM) ───
Pod-level securityContext already pinned uid/gid 1000 + runAsNonRoot, but
containers could still escalate privileges and ran unconfined by seccomp. A
shared druid.container.security helper — allowPrivilegeEscalation: false,
drop ALL capabilities, seccompProfile RuntimeDefault — now applies to all
six containers (five components plus the task pod template). Safe for the
JVM: no setuid binaries needed, and every listen port is above 1024.
Co-authored-by: stxkxsbot <275011021+stxkxsbot@users.noreply.github.com>
…teger The bug: all 24 GrafanaDashboard CRs pinned their grafana.com dashboard with `spec.grafanaCom.revision: latest`. The grafana-operator v1beta1 CRD types that field as integer, so the API server rejects every one of these manifests at admission — the string "latest" never survives server-side validation. Root cause: the operator's actual "track latest" contract is to OMIT revision (it then resolves the newest revision from grafana.com); the string literal read like a supported keyword but was never valid against the schema. Nothing in CI validated rendered output against schemas, so the mistake sat in all 24 dashboards — this is exactly the class of bug the new kubeconform gate exists to catch, and it flagged every instance on its first run. The fix: remove the revision line everywhere, keeping the pinned `id` — identical intent (latest revision of that dashboard id), now schema-valid. Co-authored-by: stxkxsbot <275011021+stxkxsbot@users.noreply.github.com>
The validate job already rendered every kustomize root per environment with --enable-helm and asserted sentinels; this catalog of 40 addons still shipped with no schema or security scanner anywhere in CI. Two hard-fail gates now run against the rendered manifests — never the repo tree — so every check sees the post-templating truth with helm values applied. ─── Druid chart joins the rendered surface ─── The druid catalog chart deploys through the druid-tenants ApplicationSet (base values + per-tenant values), so the overlay render loop never touched it. The validate job now `helm template`s it with synthetic tenant values into rendered/, putting the chart's templates through the same render-assert, schema, and misconfiguration gates as every kustomize root. `task render` mirrors this locally. ─── Schema gate: kubeconform (strict) ─── Native kinds resolve from the default kubernetes-json-schema location, CRD kinds from the datreeio CRDs-catalog. Deliberately NO -ignore-missing-schemas: a kind neither source knows fails the build until it gets a schema or an explicit, justified -skip. The single skip is the Grafana kind — the one external Grafana CR ships spec.external.url empty by design (the dashboards ApplicationSet injects the per-cluster Amazon Managed Grafana URL via its cluster-generator patch) and the catalog schema is stricter than the upstream CRD. The binary and downloaded schemas are cached across runs (restore-keys prefix, so fresh schemas persist). ─── Misconfiguration gate: trivy config ─── MEDIUM and above hard-fails. Scoped, justified exceptions live in .trivyignore.yaml, each pinned to one rendered file with a statement: the neuron device plugin's root user and kubelet-socket hostPath (the device-plugin contract), druid's k8s-extensions Role and env-indirected password keys, and the registry-allowlist check (image provenance is enforced cluster-side by the Kyverno verify-images policy). Anything new fails the gate until fixed or added with a reason. The trivy binary is pinned and cached via setup-trivy. ─── Local parity + source gate ─── - `task scan` runs both gates over rendered/ with identical flags - scripts/no-placeholders.sh adds `changeit` (the JVM keystore default password) to the sentinel list, so the class of committed credential this work removed from the druid chart cannot come back - CLAUDE.md documents the new commands and CI shape Verified locally over all four environments (dev/staging/production/hub): render → sentinel assert → kubeconform (111/111, 98/98 valid, 1 skipped Grafana) → trivy (exit 0 with the ignorefile, exit 1 without it), plus negative tests proving both gates fail on an invalid Deployment, an unknown kind, and unignored findings. Co-authored-by: stxkxsbot <275011021+stxkxsbot@users.noreply.github.com>
CI Results
All checks passed. |
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
See the five commit messages for full details.
Summary