Skip to content

Schema + misconfig gates on rendered output; druid secrets, probes, hardening#85

Merged
stxkxs merged 5 commits into
mainfrom
render-gates
Jul 1, 2026
Merged

Schema + misconfig gates on rendered output; druid secrets, probes, hardening#85
stxkxs merged 5 commits into
mainfrom
render-gates

Conversation

@stxkxs

@stxkxs stxkxs commented Jul 1, 2026

Copy link
Copy Markdown
Member

See the five commit messages for full details.

Summary

  • CI now kubeconform-validates (strict, CRD schemas, no blanket ignore) and trivy-config-scans every rendered root as hard gates — first run found and fixed 100 invalid manifests (GrafanaDashboard revision: latest vs integer-typed CRD; duplicate label keys) and druid RBAC/container hardening
  • druid keystore password sourced from Secrets Manager via ExternalSecret; zero changeit in rendered output; changeit added to the placeholder sentinels
  • Fixed en route: .gitignore was keeping externalsecret.yaml untracked — the deployed chart shipped no ExternalSecrets at all
  • druid probes are real httpGet health checks (15 rendered, zero tcpSocket)

stxkxs and others added 5 commits July 1, 2026 15:21
The bug: the druid.probes helper took (port, healthPath, readinessPath)
parameters and then ignored both path arguments, emitting three identical
tcpSocket probes. Every component's carefully chosen health endpoints —
/status/health, /druid/broker/v1/readiness, /druid/historical/v1/readiness —
were dead template arguments; a wedged process holding its listen socket
open (JVM up, Jetty stuck, broker unaware of segments) always probed healthy.

Root cause: the helper templated only {{ $port }} into tcpSocket stanzas; the
$healthPath / $readinessPath bindings were declared and never referenced.

The fix: emit httpGet probes wired to the passed paths — liveness and startup
hit the health path, readiness hits the readiness path — with scheme HTTPS,
because the cluster runs TLS-only (druid.enablePlaintextPort=false). The
kubelet does not verify serving certificates on HTTPS probes, so the
cert-manager-issued internal CA is fine, and Druid serves these endpoints
unauthenticated (its default unsecured-path list) with client certificates
requested but not required. All five processes (coordinator, overlord,
broker, historical, router) expose /status/health, so no component needs a
tcpSocket fallback; timings (thresholds, delays, periods) are unchanged.

Co-authored-by: stxkxsbot <275011021+stxkxsbot@users.noreply.github.com>
The bug: the chart shipped a static credential. certificate.yaml rendered a
Kubernetes Secret with the literal `password: changeit` (the JVM keystore
default), and the base values repeated the same literal six times across
druid.server.https.* / druid.client.https.* keystore and truststore
properties in the common runtime.properties — a committed password guarding
every keystore in the cluster's internal mTLS setup.

Root cause: the keystore password sits at the junction of two consumers —
cert-manager (which encrypts the PKCS#12 keystore/truststore it writes into
the TLS secret, via the Certificate's passwordSecretRef) and Druid (which
needs the same password to open those stores) — and the shortcut was to pin
a known constant both sides could see.

The fix, following the chart's existing secret pattern (metadata/admin/system
already flow through External Secrets):

- externalsecret.yaml grows a fourth ExternalSecret that materialises
  <name>-keystore-password from AWS Secrets Manager (property `password`),
  addressed by the new values-required `secrets.keystore` field — empty in
  base values like its siblings, set per tenant.
- certificate.yaml drops the committed Secret; its passwordSecretRef now
  points at the ExternalSecret-owned target of the same name.
- druid.env exports the value as DRUID_TLS_KEYSTORE_PASSWORD on every
  component and task pod.
- The six runtime.properties literals become
  ${env:DRUID_TLS_KEYSTORE_PASSWORD}, the same env indirection the file
  already uses for the admin, system, and metadata credentials.

Related bug surfaced while staging this change: the blanket `*secret*.yaml`
gitignore rule had kept externalsecret.yaml out of the repository entirely,
so the chart ArgoCD pulls from git carried NO ExternalSecrets — none of the
druid credential secrets (metadata, admin, system) would ever materialise
in-cluster. ExternalSecret manifests are references into the secret store,
not secret material; .gitignore now carves them out (mirroring the existing
`!*secret*-store*` exception) and the file is tracked.

Rendered output verified: `helm template` over the chart contains no
`changeit` anywhere.

Co-authored-by: stxkxsbot <275011021+stxkxsbot@users.noreply.github.com>
Three chart defects surfaced by running the new rendered-output gates
(kubeconform strict + trivy config) over `helm template` of this chart:

─── Duplicate label key (kubeconform: strict-mode unmarshal error) ───

druid.component.match.labels emitted {domain}/name, which common.labels
(included by druid.component.labels) already emits — every Service selector
and workload pod-template that combines the two helpers rendered the same
map key twice, invalid YAML under any strict decoder. match.labels now
carries only the {domain}/component key: still uniquely selects each
component's pods (which keep carrying it), while {domain}/name continues to
reach every resource through common.labels. Selectors are safe to reshape —
no tenants exist yet (catalog/druid/tenants/ is unpopulated), so nothing
deployed holds the old immutable selector.

─── Wildcard RBAC verbs (trivy: KSV-0045, CRITICAL) ───

The namespace Role granted verbs: ["*"] on jobs, pods, pods/log, configmaps
and secrets — a grant that silently widens as the API grows. Verbs are now
enumerated (create/delete/get/list/patch/update/watch) for the same
resources, which druid-kubernetes-extensions (pod discovery, ConfigMap
announcements/leader election) and druid-kubernetes-overlord-extensions
(ingestion tasks as Jobs, their pods and logs) genuinely need.

─── Missing container securityContext (trivy: KSV-0001/KSV-0104, MEDIUM) ───

Pod-level securityContext already pinned uid/gid 1000 + runAsNonRoot, but
containers could still escalate privileges and ran unconfined by seccomp. A
shared druid.container.security helper — allowPrivilegeEscalation: false,
drop ALL capabilities, seccompProfile RuntimeDefault — now applies to all
six containers (five components plus the task pod template). Safe for the
JVM: no setuid binaries needed, and every listen port is above 1024.

Co-authored-by: stxkxsbot <275011021+stxkxsbot@users.noreply.github.com>
…teger

The bug: all 24 GrafanaDashboard CRs pinned their grafana.com dashboard with
`spec.grafanaCom.revision: latest`. The grafana-operator v1beta1 CRD types
that field as integer, so the API server rejects every one of these
manifests at admission — the string "latest" never survives server-side
validation.

Root cause: the operator's actual "track latest" contract is to OMIT
revision (it then resolves the newest revision from grafana.com); the string
literal read like a supported keyword but was never valid against the
schema. Nothing in CI validated rendered output against schemas, so the
mistake sat in all 24 dashboards — this is exactly the class of bug the new
kubeconform gate exists to catch, and it flagged every instance on its
first run.

The fix: remove the revision line everywhere, keeping the pinned `id` —
identical intent (latest revision of that dashboard id), now schema-valid.

Co-authored-by: stxkxsbot <275011021+stxkxsbot@users.noreply.github.com>
The validate job already rendered every kustomize root per environment with
--enable-helm and asserted sentinels; this catalog of 40 addons still shipped
with no schema or security scanner anywhere in CI. Two hard-fail gates now
run against the rendered manifests — never the repo tree — so every check
sees the post-templating truth with helm values applied.

─── Druid chart joins the rendered surface ───

The druid catalog chart deploys through the druid-tenants ApplicationSet
(base values + per-tenant values), so the overlay render loop never touched
it. The validate job now `helm template`s it with synthetic tenant values
into rendered/, putting the chart's templates through the same
render-assert, schema, and misconfiguration gates as every kustomize root.
`task render` mirrors this locally.

─── Schema gate: kubeconform (strict) ───

Native kinds resolve from the default kubernetes-json-schema location, CRD
kinds from the datreeio CRDs-catalog. Deliberately NO -ignore-missing-schemas:
a kind neither source knows fails the build until it gets a schema or an
explicit, justified -skip. The single skip is the Grafana kind — the one
external Grafana CR ships spec.external.url empty by design (the dashboards
ApplicationSet injects the per-cluster Amazon Managed Grafana URL via its
cluster-generator patch) and the catalog schema is stricter than the
upstream CRD. The binary and downloaded schemas are cached across runs
(restore-keys prefix, so fresh schemas persist).

─── Misconfiguration gate: trivy config ───

MEDIUM and above hard-fails. Scoped, justified exceptions live in
.trivyignore.yaml, each pinned to one rendered file with a statement:
the neuron device plugin's root user and kubelet-socket hostPath (the
device-plugin contract), druid's k8s-extensions Role and env-indirected
password keys, and the registry-allowlist check (image provenance is
enforced cluster-side by the Kyverno verify-images policy). Anything new
fails the gate until fixed or added with a reason. The trivy binary is
pinned and cached via setup-trivy.

─── Local parity + source gate ───

- `task scan` runs both gates over rendered/ with identical flags
- scripts/no-placeholders.sh adds `changeit` (the JVM keystore default
  password) to the sentinel list, so the class of committed credential this
  work removed from the druid chart cannot come back
- CLAUDE.md documents the new commands and CI shape

Verified locally over all four environments (dev/staging/production/hub):
render → sentinel assert → kubeconform (111/111, 98/98 valid, 1 skipped
Grafana) → trivy (exit 0 with the ignorefile, exit 1 without it), plus
negative tests proving both gates fail on an invalid Deployment, an unknown
kind, and unignored findings.

Co-authored-by: stxkxsbot <275011021+stxkxsbot@users.noreply.github.com>
@github-actions

github-actions Bot commented Jul 1, 2026

Copy link
Copy Markdown

CI Results

Check Status
YAML Lint ✅ success
Render + assert + schema + misconfig (all environments) ✅ success

All checks passed.

@stxkxs stxkxs marked this pull request as ready for review July 1, 2026 22:58
@stxkxs stxkxs merged commit 4e6562f into main Jul 1, 2026
9 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant