Skip to content

fix: rc10 consolidation — auth bypass, fail-open gateway, deploy hardening, projections, data integrity#76

Merged
PaulDotterer merged 11 commits into
mainfrom
fix/rc10-consolidation
Apr 23, 2026
Merged

fix: rc10 consolidation — auth bypass, fail-open gateway, deploy hardening, projections, data integrity#76
PaulDotterer merged 11 commits into
mainfrom
fix/rc10-consolidation

Conversation

@PaulDotterer
Copy link
Copy Markdown
Contributor

@PaulDotterer PaulDotterer commented Apr 23, 2026

Summary

Large consolidation PR addressing findings from the rc10 audit sweep. Breaking changes allowed per scope decision. Organized by theme:

Security-critical fixes

  • Close CONTROL_PASSWORD_AUTH_ENABLED bypass. The global flag previously only gated the SSO handler's `ListAuthMethods` response (a UI hint); the Login RPC itself never checked it, so direct POST-based password login succeeded against any account with `has_password=true` even in SSO-only deployments. Now enforced in `AuthHandler.Login` before the DB lookup, burning the dummy bcrypt cycle on the disabled path so timing doesn't leak whether password login is enabled. Regression test: `TestLogin_GlobalPasswordAuthDisabled`.
  • Terminal tokens are now single-use. `TokenStore.Validate` uses a new atomic `GetAndDelete` backend primitive (Valkey GETDEL, in-memory CAS on the fake). A leaked bearer can no longer replay a second WebSocket within the TTL. Forgery attempts restore the unused entry with remaining TTL so a legitimate client isn't locked out by a guess. Regression tests: `TestTokenStore_Validate_IsSingleUse`, `TestTokenStore_Validate_MismatchPreservesSession`.

Fail-open audit (gateway startup)

Six `os.Exit(1)` paths converted to `Warn`+feature-disabled so a malformed optional knob no longer takes agent mTLS down:

  • Malformed `PublicTerminalURLTemplate` → terminal disabled
  • `registry.RegisterGateway` failure → terminal disabled for replica
  • `routableIP()` failure for `GATEWAY_INTERNAL_URL` → admin fan-out disabled
  • `routableIP()` failure for Traefik backend → skip publish (don't poison Redis KV with empty backend)
  • `PublishTraefikRoute` failure at startup → rely on refresh goroutine's retry cadence (the crash loop was exactly the rc8 failure mode)
  • `BootstrapHost` without derivable assignedHost → disable middleware instead of crashing

Remaining `os.Exit` paths (TLS cert/key/CA flags, Valkey/Control URL missing, cert parse, `tls.Config`, `http2.ConfigureServer`) stay fatal — gateway cannot serve primary mTLS role without them.

Data integrity

  • LUKS three-phase audit stream-ID coherence. New sqlc query `GetLuksRevocationStreamID` looks up the `luks_key` event-stream ID minted at request time so the async Revoked/Failed event lands on the SAME stream as the Requested/Dispatched phases. Previous behavior generated a fresh ULID, splitting every revocation across two streams and breaking the projection stitch.
  • SecurityAlert projection. Migration 010 installs `security_alerts_projection` + sidecar AFTER INSERT trigger. Events were persisted but nothing projected them; audit caught silent data loss on the SIEM path.

Config coherence

  • `GATEWAY_TTY` host now falls back through the full chain `GATEWAY_TRAEFIK_TTY_HOST` → `GATEWAY_TTY_DOMAIN` → `GATEWAY_DOMAIN`. `.env.example` always documented the third rung; `config.go` finally implements it (this was the root of the staging crash loop).
  • Empty `CONTROL_GATEWAY_URL` is fatal at startup (was a warn); `RegistrationHandler.Register` has defense-in-depth refusal.
  • `deploy/compose.yml` now passes `GATEWAY_VALKEY_DB` through (was dead config).

Deploy hardening

  • `IMAGE_TAG` now required via `${IMAGE_TAG:?…}`, no more silent `:latest` upgrades on `docker compose pull`.
  • `redis-stack-server` pinned to `7.4.0-v0` so upstream changes don't silently land.
  • `initdb.d/01-indexer-user.sh` hard-fails on unset `INDEXER_POSTGRES_PASSWORD` instead of returning 0.

Docs

  • `cmd/control/README.md`: drop stale `:8080` from the `CONTROL_GATEWAY_URL` example (compose routes via `:443` with TCP passthrough).
  • `server/README.md`: correct `GATEWAY_WEB_LISTEN_ADDR` description (cleartext HTTP behind Traefik, not TLS).

Error-code parity guard

New unit test `TestErrorCodeParityWithTSSDK` walks `errors.go` and the sdk's `ts/errors.ts` via filesystem, asserts the two code sets match. Catches the drift the rc10 audit found (4 codes missing on TS side).

Deferred to follow-ups (explicit non-goals for this PR)

The rc10 plan also called for:

  • Revocation pub/sub (DeviceRevoked → close gateway stream) — significant new subsystem (Valkey pub/sub channel + gateway subscriber + stream close semantics). Worth its own design review and PR.
  • URL-knob consolidation into `GATEWAY_PUBLIC_URL` — invasive config-shape change, deserves its own breaking-release PR.
  • Legacy env alias fail-fast (`TTY_DOMAIN`, unprefixed `VALKEY_*`) — additive hardening that shouldn't block the security fixes in this PR.
  • Integration-test harness covering the full fail-open matrix end-to-end.
  • Multi-event append transactional wrapping (`CreateUser`, `LinkOrCreate`) — bigger refactor that touches several handlers.
  • `ON CONFLICT` on rotation projections — projection replay is a theoretical concern not seen in practice.

Each of these has been captured on the rc10 audit notes and can land as focused follow-ups without the scope creep risk of piling them all into one PR.

Audit claims verified and rejected

Publishing these explicitly so future devs don't chase ghosts:

  • ❌ LPS plaintext password leak in event metadata: already encrypted at rest via `h.encryptor.Encrypt(r.Password)` in `internal_handler.go:233`.
  • ❌ `inbox_worker.go` "appends both success AND failure events" for LUKS revocation: `return AppendEvent(...)` terminates the function; only one path runs.
  • ❌ Expired cert acceptance on gateway mTLS: TLS layer + `x509.Verify` default `time.Now()` already gates this.
  • ❌ `CONTROL_ENCRYPTION_KEY_REQUIRED=False` capital-F typo bypass: code uses literal `== "false"` compare; capital F falls into fail-closed branch.
  • ❌ OSQuery bypass of event store: `osquery_results` is operational data, not a projection — intended design.
  • ❌ Proto files out of sync with generated code: mtimes match.

Fixes #74.

Test plan

  • `go build ./...` clean across control, gateway, indexer
  • `go test ./internal/api/` including `TestLogin_GlobalPasswordAuthDisabled`, `TestErrorCodeParityWithTSSDK`
  • `go test ./internal/terminal/` including single-use + mismatch-restore regression tests
  • Full `go test ./...` including the long `internal/api` testcontainer run
  • Staging redeploy: `IMAGE_TAG=v2026.05-rc10 docker compose pull && docker compose up -d`, agent reinstall, confirm mTLS handshake completes without cert-unknown
  • Deliberately-broken `.env` dry-run: unset `GATEWAY_TTY_DOMAIN`, confirm gateway stays up and serves agent mTLS (fails-open)

Summary by CodeRabbit

  • New Features

    • Track, list, and acknowledge security alerts
    • Option to disable password authentication
    • Session tokens are single-use (consumed on successful validation)
  • Improvements

    • Stricter gateway/config validation with redacted logging and startup checks
    • More resilient startup/registry behavior and safer Traefik publishing
    • Compose tightened (required image tag, pinned image); init scripts fail fast on missing secrets
  • Documentation

    • Clarified gateway listener, host/port semantics, terminal token contract
  • Tests

    • New parity and validation tests for errors, gateway URLs, config, and token semantics

…e-IP / fail-open WIP

Security fix: `CONTROL_PASSWORD_AUTH_ENABLED=false` previously only
gated the SSO handler's ListAuthMethods response, which is a UI hint
for the login form. The Login RPC itself never checked the flag, so an
attacker POSTing directly to /pm.v1.ControlService/Login with valid
email+password still authenticated against any account whose
HasPassword column was true. This is the bypass a penetration test
would find immediately; SSO-only deployments were not SSO-only in the
enforced sense.

Plumb passwordAuthEnabled into AuthHandler and enforce it BEFORE the
DB lookup. Burn the dummy bcrypt cycle on the disabled path too so the
"is password auth disabled?" question doesn't leak via timing either.

Regression test: TestLogin_GlobalPasswordAuthDisabled covers the exact
bypass path (valid credentials, global switch off, expect
CodeUnauthenticated + password_login_disabled).

Also folds in the pre-rc10 WIP from the staging debugging session:
 - routableIP() helper replacing os.Hostname() for Traefik backend
   auto-derivation (container IDs are not in Docker's embedded DNS).
 - Fail-open on a malformed GATEWAY_PUBLIC_TERMINAL_URL_TEMPLATE so a
   broken optional template disables terminal registration with a
   warning instead of crash-looping the gateway.
 - GATEWAY_TRAEFIK_* defaults collapsed into the binary so compose.yml
   only needs GATEWAY_DOMAIN / GATEWAY_TTY_DOMAIN in the common case.

Remaining rc10 scope continues in subsequent commits on this branch.

Part of #74.
…KS stream-id coherence, deploy hardening

Gateway startup: convert five more os.Exit paths to Warn+feature-
disabled so a broken optional knob no longer takes agent mTLS down
with it.

 - registry.RegisterGateway failure → terminal disabled for this
   replica, agent mTLS keeps serving.
 - routableIP() failure for GATEWAY_INTERNAL_URL → admin fan-out
   disabled, mTLS keeps serving.
 - routableIP() failure for Traefik backend auto-derive → skip
   publish entirely rather than poison Redis KV with an empty
   backend; operator can still set GATEWAY_TRAEFIK_{MTLS,TTY}_BACKEND
   explicitly.
 - PublishTraefikRoute failure at startup → rely on the refresh
   goroutine's normal retry cadence; the previous crash loop caused
   the pm-mtls TCP route to expire during restart backoff and fall
   through to the HTTP router (wrong cert).
 - BootstrapHost without a derivable assignedHost → disable the
   bootstrap middleware instead of crashing.

The remaining os.Exit paths in gateway startup (TLS cert / key / CA
flag missing, ValkeyAddr / ControlURL missing, cert parse, tls.Config
build, HTTP/2 server configure) are truly fatal and stay fatal — the
gateway cannot serve its primary mTLS role without them.

Terminal tokens: enforce single-use via Valkey GETDEL (go-redis
GetDel) in the new SessionBackend.GetAndDelete primitive. Validate()
now atomically consumes the token on success, so a leaked bearer
token cannot replay a second WebSocket within the TTL. Forgery
attempts restore the unused entry with the remaining TTL so a
legitimate client is not locked out by a guess. Regression tests
TestTokenStore_Validate_IsSingleUse and
TestTokenStore_Validate_MismatchPreservesSession cover both paths.

LUKS revocation three-phase audit: the inbox worker now looks up
the luks_key stream ID minted at request time (Requested /
Dispatched phases in api/device_handler.go) via a new sqlc query
GetLuksRevocationStreamID, so the async Revoked / Failed event
lands on the SAME stream as the earlier phases. Previously the
worker generated a fresh ULID, splitting every revocation across
two streams and breaking the projection's stitch. Falls back to a
fresh ULID only if the Requested event can't be found — an orphan
Failed event is still better than silently dropping the agent's
report.

Config coherence:
 - GATEWAY_TTY host fallback now goes GATEWAY_TRAEFIK_TTY_HOST →
   GATEWAY_TTY_DOMAIN → GATEWAY_DOMAIN. .env.example has always
   documented the third rung; config.go finally implements it.
 - Empty CONTROL_GATEWAY_URL is fatal at startup (was only a warn),
   and the registration handler refuses to hand an empty URL back
   even if the startup guard ever regresses. Defence in depth on a
   path that was otherwise silently broken.
 - deploy/compose.yml: GATEWAY_VALKEY_DB passthrough added (the
   code reads it; without this the env var was dead).
 - IMAGE_TAG now required (fail with clear message if unset) rather
   than silently defaulting to :latest, which was causing
   unintended upgrades on every `docker compose pull`.
 - redis-stack-server pinned to 7.4.0-v0 so upstream changes don't
   silently land.
 - initdb.d/01-indexer-user.sh now hard-fails on unset
   INDEXER_POSTGRES_PASSWORD instead of returning 0, so the
   misconfiguration is visible at init time rather than hours later
   at indexer first-connect.

Auth: TestLogin_GlobalPasswordAuthDisabled regression test (from
prior commit) covers the rc10 bypass fix on the password-auth
disable path.

Part of #74.
…uard

SecurityAlert projection: new migration 010 installs a
security_alerts_projection table plus a sidecar AFTER INSERT
trigger that routes SecurityAlert / SecurityAlertAcknowledged
events into it. Previously the inbox worker appended the event to
the device stream but no projection handled it, so the UI had no
way to list alerts without scanning the append-only events log.
Using a sidecar trigger (instead of rewriting project_event())
keeps the change additive and avoids the risk of dropping an
existing case arm from the central dispatcher.

OutputChunk events deliberately kept projection-less: they already
have a dedicated retrieval path via LoadOutputChunks which scans
events by (stream_id, event_type='OutputChunk') in order. That's
the intended access pattern for streaming output — a projection
table would just duplicate the events table.

Docs: fix the stale :8080 example in cmd/control/README.md (compose
routes via :443 with TCP passthrough; port was a leftover from
before the SNI collapse). Correct the server README claim that
GATEWAY_WEB_LISTEN_ADDR is a TLS listener — it's cleartext HTTP
behind Traefik, which terminates TLS upstream.

Error-code parity guard: new unit test
TestErrorCodeParityWithTSSDK in internal/api walks errors.go and
the sdk's ts/errors.ts via filesystem and asserts the two code
sets match. Skips gracefully when run outside the multi-repo
workspace (so `go test ./...` from a read-only mirror still
passes). This is the test that would have caught the 4-code
drift the rc10 audit found.

Part of #74.
@coderabbitai
Copy link
Copy Markdown

coderabbitai Bot commented Apr 23, 2026

Warning

Rate limit exceeded

@PaulDotterer has exceeded the limit for the number of commits that can be reviewed per hour. Please wait 39 minutes and 33 seconds before requesting another review.

Your organization is not enrolled in usage-based pricing. Contact your admin to enable usage-based pricing to continue reviews beyond the rate limit, or try again in 39 minutes and 33 seconds.

⌛ How to resolve this issue?

After the wait time has elapsed, a review can be triggered using the @coderabbitai review command as a PR comment. Alternatively, push new commits to this PR.

We recommend that you space out your commits to avoid hitting the rate limit.

🚦 How do rate limits work?

CodeRabbit enforces hourly rate limits for each developer per organization.

Our paid plans have higher rate limits than the trial, open-source and free plans. In all cases, we re-allow further reviews after a brief timeout.

Please see our FAQ for further information.

ℹ️ Review info
⚙️ Run configuration

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Pro

Run ID: 42e5ccab-2584-4f92-99de-85dc959951b2

📥 Commits

Reviewing files that changed from the base of the PR and between 7f4179a and e9d8f44.

📒 Files selected for processing (4)
  • cmd/gateway/main.go
  • deploy/.env.example
  • deploy/compose.yml
  • deploy/deploy.sh
📝 Walkthrough

Walkthrough

Tightened startup validation for gateway and control; added gateway config validation and safer Traefik self-registration; terminal session tokens are now single-use via atomic GetAndDelete; password-auth can be globally disabled; security-alerts projection and TypeScript error-code parity test added.

Changes

Cohort / File(s) Summary
Docs & env examples
README.md, cmd/control/README.md, deploy/.env.example
Clarify terminal WebSocket is cleartext (TLS terminates at Traefik); update CONTROL_GATEWAY_URL example and TTY/domain collision/fallback docs.
Compose & init scripts
deploy/compose.yml, deploy/initdb.d/01-indexer-user.sh, deploy/.env.example
Require IMAGE_TAG, pin valkey image, add GATEWAY_VALKEY_DB, and make indexer init fail-fast when INDEXER_POSTGRES_PASSWORD is unset.
Gateway startup & registry
cmd/gateway/main.go, internal/config/config.go
Add Config.Validate(), perform early config validation, refactor registry initialization to use a shared Valkey client, make Traefik registration fail-open with timeouts, and strengthen host-from-URL validation.
Control startup validation
cmd/control/main.go
Always redact CONTROL_GATEWAY_URL in startup logs and validate it at startup, exiting on invalid config.
Gateway/Internal API URL validation
internal/api/registration_handler.go, internal/api/validate_gateway_url_test.go, internal/api/registration_handler_test.go
Add exported ValidateGatewayURL and RedactGatewayURL; constructors/handlers validate/redact gateway URLs; tests added/updated.
Auth handler & service wiring
internal/api/auth_handler.go, internal/api/auth_handler_test.go, internal/api/service.go
Add passwordAuthEnabled flag to AuthHandler, update constructor, short-circuit password login when disabled, and add tests including a disabled-password regression.
Terminal token store & backends
internal/terminal/store.go, internal/terminal/store_test.go, internal/terminal/fake_backend.go, internal/terminal/valkey_backend.go
Add SessionBackend.GetAndDelete; Validate now atomically consumes tokens on success (single-use), restores session on forgery, update backends and tests.
Internal handler docs & tests
internal/api/internal_handler.go, internal/api/internal_handler_test.go
Document and test atomic consume-on-success semantics for ProxyValidateTerminalToken; tests updated to expect single-use.
Control inbox & LUKS stream handling
internal/control/inbox_worker.go, internal/store/queries/luks.sql
Resolve revocation stream IDs via new GetLuksRevocationStreamID query to append terminal outcomes to original streams and avoid forking event history.
Security alerts projection & queries
internal/store/migrations/010_security_alerts_projection.sql, internal/store/queries/security_alerts.sql
Add security_alerts_projection table, projector functions and trigger, plus sqlc queries for list/get/count operations and acknowledgment handling.
Error code parity test
internal/api/errors_parity_test.go
Add TestErrorCodeParityWithTSSDK to verify server snake_case error codes match TypeScript SDK exports; supports CI gating via env var.
Config & validation tests
internal/config/config_test.go, various README/handler comment updates
Add tests for Config.Validate() invariants and refine internal handler comments regarding token semantics.

Sequence Diagram(s)

sequenceDiagram
    autonumber
    participant Client
    participant GatewayAPI as Internal API
    participant SessionStore as Terminal Store
    participant Backend as SessionBackend
    rect rgba(150,200,100,0.5)
    Note over Client,Backend: Atomic consume-on-success flow
    Client->>GatewayAPI: ProxyValidateTerminalToken(sessionID, bearer)
    GatewayAPI->>SessionStore: Validate(sessionID, bearer)
    SessionStore->>Backend: GetAndDelete(sessionID) (atomic)
    Backend->>Backend: retrieve key and delete
    alt payload present
        SessionStore->>SessionStore: parse payload, compare hash
        alt match
            SessionStore-->>GatewayAPI: session metadata (success)
            Note right of Backend: token consumed (single-use)
        else mismatch
            SessionStore->>Backend: Restore(sessionID, payload) with remaining TTL
            SessionStore-->>GatewayAPI: ErrTokenMismatch (auth failure)
        end
    else missing/expired
        SessionStore-->>GatewayAPI: ErrTokenNotFound / Unauthenticated
    end
    GatewayAPI-->>Client: success or Unauthenticated
    end
Loading

Estimated code review effort

🎯 4 (Complex) | ⏱️ ~50 minutes

Possibly related PRs

Poem

🐰 I nibble logs and tidy threads,

one token eaten — no reposts spread;
Gateways murmur, registries mend,
Alerts hop up, then find their end;
Hop—secure, we patch and tend.

🚥 Pre-merge checks | ✅ 4 | ❌ 1

❌ Failed checks (1 warning)

Check name Status Explanation Resolution
Docstring Coverage ⚠️ Warning Docstring coverage is 34.04% which is insufficient. The required threshold is 80.00%. Write docstrings for the functions missing them to satisfy the coverage threshold.
✅ Passed checks (4 passed)
Check name Status Explanation
Description Check ✅ Passed Check skipped - CodeRabbit’s high-level summary is enabled.
Title check ✅ Passed The PR title clearly summarizes the main consolidation theme (rc10 fixes) and highlights key changes: auth bypass, fail-open gateway, deploy hardening, projections, and data integrity.
Linked Issues check ✅ Passed The PR addresses all major requirements from issue #74: replaces os.Hostname() with routableIP(), moves Traefik defaults into gateway binary, implements TTY/MTLS host fallback chains, simplifies compose.yml, and adds validation to prevent invalid backend addresses.
Out of Scope Changes check ✅ Passed All changes are directly aligned with PR objectives: security fixes (password auth, token single-use), gateway fail-open hardening, data integrity (LUKS/security_alerts), config validation, and deployment hardening—no unrelated modifications detected.

✏️ Tip: You can configure your own custom pre-merge checks in the settings.

✨ Finishing Touches
🧪 Generate unit tests (beta)
  • Create PR with unit tests
  • Commit unit tests in branch fix/rc10-consolidation

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

Comment @coderabbitai help to get the list of available commands and usage tips.

…rityAlert projection, parity-test CI hook

Address review findings on #76 before merge:

URL validation was too weak. The previous check was url.Parse + empty
string guard, which lets obviously-invalid shapes through: bare
hostnames like "gateway.example.com" parse without error but aren't
absolute URLs; http:// downgrades silently; wss:// confuses the
streaming client; userinfo in the URL leaks on every enrollment
response. Introduce api.ValidateGatewayURL that requires scheme=https,
non-empty host, no userinfo, no fragment. Wire it into
cmd/control/main.go (fatal on violation) and into
NewRegistrationHandler (panic on violation — programmer error by the
time we reach the constructor) and Register (re-validation for
defence in depth). New TestValidateGatewayURL covers all the footgun
shapes the review flagged. Existing registration_handler_test.go
fixtures used wss://gateway.test:443 which the new validator
correctly rejects — updated to https://gateway.test:8080 to match
the production CONTROL_GATEWAY_URL shape.

SecurityAlert projection idempotency. Migration 010 was using a
fresh UUID as the primary key, so event replay (backfill, rebuild,
trigger re-fire) would duplicate alerts. Key the projection by the
originating event_id (REFERENCES events(id)) and use
INSERT ... ON CONFLICT (event_id) DO NOTHING in the projector. This
is the event-sourcing idempotency pattern that keeps projections
deterministic across replays. SecurityAlertAcknowledged payloads
carry alert_id = the event_id of the originating SecurityAlert, so
the UPDATE path is unchanged in spirit but keyed by the stable id.

SecurityAlert is explicitly Phase 1 — the migration + sidecar trigger
+ sqlc queries (ListSecurityAlertsForDevice,
ListUnacknowledgedSecurityAlerts, GetSecurityAlert,
CountUnacknowledgedSecurityAlerts) make the projection queryable from
Go. The RPC surface and web UI wiring remain Phase 2 because they
need a proto change (ControlService.ListSecurityAlerts), which is
better in its own PR to keep the scope tight.

LUKS stream-id correctness assumption. The inbox worker looks up
the luks_key stream ID by (device_id, action_id) and picks the
latest matching Requested/Dispatched event via ORDER BY sequence_num
DESC LIMIT 1. Documented the assumption that there is at most one
outstanding revocation per (device, action) at a time, and what the
behavior is when the assumption is violated (worst case: the latest
request gets the terminal event, older abandoned streams miss their
terminal — not a projection correctness issue, but worth flagging
for audit export).

Parity test CI hook. TestErrorCodeParityWithTSSDK previously silently
skipped when the sibling SDK repo wasn't present, so standalone
server CI couldn't rely on it. Now reads PM_SDK_TS_ERRORS env var as
the primary source (CI sets this when it checks out the SDK beside
server), falls back to the local dev-workspace path, and loudly
fails when PM_SDK_PARITY_REQUIRED=1 is set but the file still isn't
reachable. Skips with a clear log line otherwise so a plain
`go test ./...` in a standalone server checkout still passes.

Part of #76 review feedback.
@PaulDotterer
Copy link
Copy Markdown
Contributor Author

@coderabbitai Full review

@coderabbitai
Copy link
Copy Markdown

coderabbitai Bot commented Apr 23, 2026

✅ Actions performed

Full review triggered.

Copy link
Copy Markdown

@coderabbitai coderabbitai Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 9

🧹 Nitpick comments (6)
internal/api/auth_handler_test.go (1)

77-91: Regression test correctly covers the bypass; consider also asserting the error code string.

The test validates the critical behavior: valid credentials + passwordAuthEnabled=falseCodeUnauthenticated. As an optional hardening, you could also assert the returned error carries ErrPasswordLoginDisabled (rather than ErrInvalidCredentials) so a future refactor that collapses the branch into the generic "invalid credentials" path would fail this test. Existing tests like TestLogin_PasswordDisabledByProvider already use assert.Contains(t, err.Error(), "password login is disabled") — the same check here would keep the error-surface contract explicit.

🧪 Optional tightening
 	require.Error(t, err, "login must fail when global password auth is disabled")
 	assert.Equal(t, connect.CodeUnauthenticated, connect.CodeOf(err))
+	assert.Contains(t, err.Error(), "password login is disabled")
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@internal/api/auth_handler_test.go` around lines 77 - 91, The test
TestLogin_GlobalPasswordAuthDisabled currently only checks the connect error
code; tighten it by also asserting the returned error string indicates the
specific disabled-password case (match existing pattern used in
TestLogin_PasswordDisabledByProvider) — e.g., assert that err.Error() contains
"password login is disabled" or otherwise contains the sentinel
ErrPasswordLoginDisabled text so the test fails if the handler collapses this
branch into a generic "invalid credentials" path; update the assertion in
TestLogin_GlobalPasswordAuthDisabled accordingly.
internal/store/queries/security_alerts.sql (2)

36-37: Avoid narrowing COUNT(*) to int.

COUNT(*) in PostgreSQL returns bigint; explicitly casting to ::int silently truncates at ~2.1B and also propagates an int32 return type into the generated Go code, which is inconsistent with buildNextPageToken that works with int64 totals (see internal/api/helpers.go:77-83). Prefer returning bigint.

♻️ Proposed change
 -- name: CountUnacknowledgedSecurityAlerts :one
-SELECT COUNT(*)::int FROM security_alerts_projection WHERE NOT acknowledged;
+SELECT COUNT(*)::bigint FROM security_alerts_projection WHERE NOT acknowledged;
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@internal/store/queries/security_alerts.sql` around lines 36 - 37, The query
CountUnacknowledgedSecurityAlerts currently casts COUNT(*) to ::int which
truncates large counts and causes a mismatch with functions expecting int64;
update the SQL in CountUnacknowledgedSecurityAlerts to return bigint (remove the
::int cast) so the result remains BIGINT and aligns with buildNextPageToken's
int64 handling; ensure any generated Go types or usages of the
security_alerts_projection count expect an int64 accordingly.

11-28: Missing companion CountSecurityAlertsForDevice for paginated responses.

ListSecurityAlertsForDevice returns a page but there is no count query scoped to device_id (with the same include_acknowledged filter). buildNextPageToken needs a totalCount to decide whether to emit a next token (internal/api/helpers.go:77-83), so the forthcoming ListSecurityAlerts handler will either have to fabricate one or over-emit tokens. Consider adding a matching :one count now while the schema is fresh in memory.

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@internal/store/queries/security_alerts.sql` around lines 11 - 28, Add a
companion count query named CountSecurityAlertsForDevice (:one) that returns the
total number of alerts scoped to a device_id and using the same
include_acknowledged filter logic as ListSecurityAlertsForDevice (i.e., WHERE
device_id = $1 AND (sqlc.arg(include_acknowledged)::bool OR NOT acknowledged));
this count will be used by buildNextPageToken/internal/api/helpers.go to compute
totalCount for pagination so the ListSecurityAlerts handler can emit correct
next-page tokens.
internal/store/migrations/010_security_alerts_projection.sql (1)

67-75: Silent no-op when acknowledging an unknown alert_id.

If a SecurityAlertAcknowledged event carries an alert_id that does not match any row (e.g., ack arrives before the alert projection due to out-of-order replay, or an operator ack'd an alert that was garbage-collected), the UPDATE affects 0 rows and the trigger completes successfully, so there is no trace in projection_errors. Consider checking FOUND after the UPDATE and raising so the error lands in projection_errors for ops visibility.

♻️ Proposed change
         WHEN 'SecurityAlertAcknowledged' THEN
             UPDATE security_alerts_projection
             SET acknowledged = TRUE,
                 acknowledged_at = event.occurred_at,
                 acknowledged_by = event.data->>'acknowledged_by'
-            WHERE event_id::TEXT = event.data->>'alert_id';
+            WHERE event_id = (event.data->>'alert_id')::uuid;
+            IF NOT FOUND THEN
+                RAISE EXCEPTION 'SecurityAlertAcknowledged references unknown alert_id=%', event.data->>'alert_id';
+            END IF;
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@internal/store/migrations/010_security_alerts_projection.sql` around lines 67
- 75, The UPDATE for the 'SecurityAlertAcknowledged' CASE can silently affect 0
rows; after updating security_alerts_projection using event.data->>'alert_id'
check FOUND and if false raise an error (include the alert_id and event_id in
the message) so the trigger surfaces the problem into projection_errors; modify
the CASE branch for SecurityAlertAcknowledged to perform the UPDATE then IF NOT
FOUND THEN RAISE EXCEPTION ... END IF to ensure missing-ack targets are
recorded.
internal/api/errors_parity_test.go (1)

143-148: Dead sanity guard — remove.

strings.Count("Err", "E") is a compile-time-constant expression over string literals; it cannot guard against "string-builder typos" in the tests since the tests don't use strings.Count or assemble these literals dynamically. It adds noise without value.

♻️ Proposed removal
-// Guard against string-builder typos in the tests above.
-func init() {
-	if strings.Count("Err", "E") != 1 {
-		panic("sanity")
-	}
-}

If strings becomes unused after this, also drop the import.

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@internal/api/errors_parity_test.go` around lines 143 - 148, Remove the dead
"sanity guard" init() block that calls strings.Count("Err", "E") — it's a
compile-time constant check and adds no value; delete the entire init() function
(the strings.Count call and panic) and, if that makes the strings import unused,
remove the strings import as well. Locate the init() function with the
strings.Count("Err", "E") call and cleanly delete it and any resultant
unused-import references.
cmd/gateway/main.go (1)

253-259: Redundant nil check — ensureGatewayRegistry is already idempotent.

The closure at L154-166 returns early when gatewayReg != nil, so the outer guard here just adds a layer. Drop it and call ensureGatewayRegistry() unconditionally to keep the intent obvious.

-		if gatewayReg == nil {
-			// Need a Valkey-backed registry even when terminal URLs are
-			// disabled; Traefik self-registration is the agent mTLS
-			// routing layer.
-			gatewayReg = ensureGatewayRegistry()
-		}
+		// Need a Valkey-backed registry even when terminal URLs are
+		// disabled; Traefik self-registration is the agent mTLS
+		// routing layer.
+		gatewayReg = ensureGatewayRegistry()
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@cmd/gateway/main.go` around lines 253 - 259, The nil check around gatewayReg
is redundant because ensureGatewayRegistry() is idempotent; when
cfg.TraefikSelfRegister is true, call ensureGatewayRegistry() unconditionally.
Remove the inner if gatewayReg == nil block and invoke ensureGatewayRegistry()
directly in the cfg.TraefikSelfRegister branch so the code simply calls
ensureGatewayRegistry() when TraefikSelfRegister is enabled, relying on
ensureGatewayRegistry() to no-op if gatewayReg is already set.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.

Inline comments:
In `@cmd/gateway/main.go`:
- Around line 366-414: The internal URL registration is skipped when gatewayReg
== nil, making admin fan-out unavailable even if GATEWAY_INTERNAL_URL is
configured; call ensureGatewayRegistry() to create or obtain a registry before
using gatewayReg. Update the block that currently checks "if gatewayReg != nil"
to first call ensureGatewayRegistry(ctx, ...) (or the existing function name
ensureGatewayRegistry) to initialize gatewayReg when needed, then proceed to
compute internalURL (using cfg.InternalURL, routableIP(), portOfListenAddr) and
call gatewayReg.RegisterGatewayInternal/gatewayReg.RegisterGatewayInternal
refresh goroutine with gatewayID and registry.DefaultGatewayTTL; ensure errors
from ensureGatewayRegistry are logged and handled so we don't silently skip
registration.

In `@internal/api/errors_parity_test.go`:
- Around line 113-118: The TS extraction regex
export\s+const\s+Err\w+\s*=\s*'([a-z][a-z0-9_]*)' is too strict; update the
regex used in the TS extraction (the variable/match logic around re and matches
in the block) to accept single, double, or backtick quotes and optionally allow
a TypeScript type annotation (e.g., : string) before the = so it robustly
captures the error identifier; make the same quote-tolerance change to the
server-side regex used earlier (the regex at the other extraction block
referenced in the file, originally around line 56) so both sides are symmetric;
and remove the pointless init() sanity check function call/logic (the init()
sanity block that does strings.Count("Err", "E") != 1) to clean up dead code.
- Line 92: The fallback path in the candidates list uses too many parent
traversals; update the filepath.Join call that appends
"../../../../sdk/ts/errors.ts" (the candidates append) to use
"../../../sdk/ts/errors.ts" so it correctly resolves the SDK from internal/api/,
or add an additional fallback entry for "../../../sdk/ts/errors.ts" alongside
the existing one to cover the documented repo layout; modify the append(s) that
reference the SDK file in the candidates construction to include the corrected
relative path.

In `@internal/api/registration_handler.go`:
- Around line 78-79: The panic/log currently includes the raw gatewayURL (seen
in NewRegistrationHandler when calling ValidateGatewayURL and similarly at the
other checks around lines 102-105), which can leak userinfo; before including
gatewayURL in any panic or structured log, parse it with net/url and remove User
(userinfo) and/or replace the authority with a redacted form (e.g., host only or
host with "[REDACTED]")—implement a small helper like RedactGatewayURL(urlStr
string) string and call it wherever the code panics or logs the invalid gateway
URL (e.g., in the NewRegistrationHandler validation error paths) so only the
redacted URL appears in error output.
- Around line 51-53: Replace the current host emptiness check that uses u.Host
with u.Hostname() so malformed URLs like "https://:8443" (empty hostname with
only a port) are rejected; specifically update the check in
registration_handler.go to use u.Hostname() == "" and return the same error
message, and add the regression test case {"empty hostname with port",
"https://:8443", true, "host"} to internal/api/validate_gateway_url_test.go so
validateGatewayURL (or the test suite for Host validation) verifies this
scenario.

In `@internal/config/config.go`:
- Around line 196-206: The TraefikTTYHost fallback to GATEWAY_DOMAIN can collide
with MTLSHost when terminal sessions are enabled; update the config load to
detect this and either warn or fail fast: after computing TraefikTTYHost (via
firstNonEmpty) and MTLSHost and reading GATEWAY_WEB_LISTEN_ADDR and TTYBackend,
add a startup check that if GATEWAY_WEB_LISTEN_ADDR is non-empty (terminal
enabled) and TTYBackend is non-empty and TraefikTTYHost == MTLSHost then log a
clear error (or return a config validation error) explaining the SNI/Host
collision and refuse startup, otherwise emit a warning; also update the existing
fallback comment near TraefikTTYHost to note that the third rung is only safe
when an explicit GATEWAY_TTY_DOMAIN is provided when terminals are enabled.

In `@internal/control/inbox_worker.go`:
- Around line 562-573: The current lookup treats any error from
GetLuksRevocationStreamID as "not found" and creates a fresh ULID, which can
fork streams on transient DB/context errors; change the logic in the inbox
worker around GetLuksRevocationStreamID so that you only assign a new ulid to
luksStreamID when errors.Is(err, sql.ErrNoRows) (or your store's sentinel "no
rows" error), and for any other err return or propagate the error (so Asynq can
retry) — ensure GetLuksRevocationStreamID, luksStreamID and the surrounding
function return the non-ErrNoRows error and do not swallow context cancellation.

In `@internal/store/migrations/010_security_alerts_projection.sql`:
- Around line 67-72: The UPDATE uses a cast on event_id which prevents the PK
index from being used; change the predicate in the WHEN
'SecurityAlertAcknowledged' branch to cast the right-hand side value to UUID
(cast event.data->>'alert_id' to UUID) instead of casting event_id to TEXT so
the WHERE compares event_id (UUID) to a UUID value and leverages the primary-key
index on security_alerts_projection; this also ensures malformed alert_id values
surface as projection errors via the sidecar trigger rather than causing a
full-table scan that silently matches nothing.

---

Nitpick comments:
In `@cmd/gateway/main.go`:
- Around line 253-259: The nil check around gatewayReg is redundant because
ensureGatewayRegistry() is idempotent; when cfg.TraefikSelfRegister is true,
call ensureGatewayRegistry() unconditionally. Remove the inner if gatewayReg ==
nil block and invoke ensureGatewayRegistry() directly in the
cfg.TraefikSelfRegister branch so the code simply calls ensureGatewayRegistry()
when TraefikSelfRegister is enabled, relying on ensureGatewayRegistry() to no-op
if gatewayReg is already set.

In `@internal/api/auth_handler_test.go`:
- Around line 77-91: The test TestLogin_GlobalPasswordAuthDisabled currently
only checks the connect error code; tighten it by also asserting the returned
error string indicates the specific disabled-password case (match existing
pattern used in TestLogin_PasswordDisabledByProvider) — e.g., assert that
err.Error() contains "password login is disabled" or otherwise contains the
sentinel ErrPasswordLoginDisabled text so the test fails if the handler
collapses this branch into a generic "invalid credentials" path; update the
assertion in TestLogin_GlobalPasswordAuthDisabled accordingly.

In `@internal/api/errors_parity_test.go`:
- Around line 143-148: Remove the dead "sanity guard" init() block that calls
strings.Count("Err", "E") — it's a compile-time constant check and adds no
value; delete the entire init() function (the strings.Count call and panic) and,
if that makes the strings import unused, remove the strings import as well.
Locate the init() function with the strings.Count("Err", "E") call and cleanly
delete it and any resultant unused-import references.

In `@internal/store/migrations/010_security_alerts_projection.sql`:
- Around line 67-75: The UPDATE for the 'SecurityAlertAcknowledged' CASE can
silently affect 0 rows; after updating security_alerts_projection using
event.data->>'alert_id' check FOUND and if false raise an error (include the
alert_id and event_id in the message) so the trigger surfaces the problem into
projection_errors; modify the CASE branch for SecurityAlertAcknowledged to
perform the UPDATE then IF NOT FOUND THEN RAISE EXCEPTION ... END IF to ensure
missing-ack targets are recorded.

In `@internal/store/queries/security_alerts.sql`:
- Around line 36-37: The query CountUnacknowledgedSecurityAlerts currently casts
COUNT(*) to ::int which truncates large counts and causes a mismatch with
functions expecting int64; update the SQL in CountUnacknowledgedSecurityAlerts
to return bigint (remove the ::int cast) so the result remains BIGINT and aligns
with buildNextPageToken's int64 handling; ensure any generated Go types or
usages of the security_alerts_projection count expect an int64 accordingly.
- Around line 11-28: Add a companion count query named
CountSecurityAlertsForDevice (:one) that returns the total number of alerts
scoped to a device_id and using the same include_acknowledged filter logic as
ListSecurityAlertsForDevice (i.e., WHERE device_id = $1 AND
(sqlc.arg(include_acknowledged)::bool OR NOT acknowledged)); this count will be
used by buildNextPageToken/internal/api/helpers.go to compute totalCount for
pagination so the ListSecurityAlerts handler can emit correct next-page tokens.
🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

  • Push a commit to this branch (recommended)
  • Create a new PR with the fixes

ℹ️ Review info
⚙️ Run configuration

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Pro

Run ID: 352d924a-9903-49b2-add3-7dcbf6f89c2f

📥 Commits

Reviewing files that changed from the base of the PR and between 3d0dd7a and 93e4fa1.

⛔ Files ignored due to path filters (3)
  • internal/store/generated/luks.sql.go is excluded by !**/generated/**
  • internal/store/generated/models.go is excluded by !**/generated/**
  • internal/store/generated/security_alerts.sql.go is excluded by !**/generated/**
📒 Files selected for processing (22)
  • README.md
  • cmd/control/README.md
  • cmd/control/main.go
  • cmd/gateway/main.go
  • deploy/compose.yml
  • deploy/initdb.d/01-indexer-user.sh
  • internal/api/auth_handler.go
  • internal/api/auth_handler_test.go
  • internal/api/errors_parity_test.go
  • internal/api/registration_handler.go
  • internal/api/registration_handler_test.go
  • internal/api/service.go
  • internal/api/validate_gateway_url_test.go
  • internal/config/config.go
  • internal/control/inbox_worker.go
  • internal/store/migrations/010_security_alerts_projection.sql
  • internal/store/queries/luks.sql
  • internal/store/queries/security_alerts.sql
  • internal/terminal/fake_backend.go
  • internal/terminal/store.go
  • internal/terminal/store_test.go
  • internal/terminal/valkey_backend.go

Comment thread cmd/gateway/main.go
Comment thread internal/api/errors_parity_test.go
Comment thread internal/api/errors_parity_test.go Outdated
Comment thread internal/api/registration_handler.go Outdated
Comment thread internal/api/registration_handler.go Outdated
Comment thread internal/config/config.go
Comment thread internal/control/inbox_worker.go
Comment thread internal/store/migrations/010_security_alerts_projection.sql Outdated
Comment on lines +270 to 291
payload, err := s.backend.GetAndDelete(ctx, sessionID)
if err != nil {
return nil, err
}
var session Session
if err := json.Unmarshal(payload, &session); err != nil {
return nil, fmt.Errorf("terminal: decode session %s: %w", sessionID, err)
}
if subtle.ConstantTimeCompare([]byte(session.TokenHash), []byte(hashToken(bearerToken))) != 1 {
// Forgery attempt — restore the real session so the
// legitimate client isn't locked out. Compute the remaining
// TTL from ExpiresAt; if already expired we just drop it.
remaining := session.ExpiresAt.Sub(s.now())
if remaining > 0 {
if restoreErr := s.backend.Set(ctx, sessionID, payload, remaining); restoreErr != nil {
// Log via caller — we don't have a logger here. Returning
// mismatch is the priority; the caller surfaces it as
// Unauthenticated and the audit pipeline flags it.
return nil, ErrTokenMismatch
}
}
return nil, ErrTokenMismatch
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟠 Major

Avoid deleting the session before proving the bearer matches.

GetAndDelete removes the key before the token hash comparison. A forged request with a known session_id can briefly erase the real session; a legitimate concurrent Validate in that window gets ErrTokenNotFound, and a restore failure drops the session entirely.

Prefer a conditional consume: first read and compare, then atomically consume only after the bearer matches, or move compare-and-delete into the backend as one atomic operation.

🐛 Safer minimal shape using read-before-consume
 func (s *TokenStore) Validate(ctx context.Context, sessionID, bearerToken string) (*Session, error) {
-	payload, err := s.backend.GetAndDelete(ctx, sessionID)
+	payload, err := s.backend.Get(ctx, sessionID)
 	if err != nil {
 		return nil, err
 	}
 	var session Session
 	if err := json.Unmarshal(payload, &session); err != nil {
 		return nil, fmt.Errorf("terminal: decode session %s: %w", sessionID, err)
 	}
 	if subtle.ConstantTimeCompare([]byte(session.TokenHash), []byte(hashToken(bearerToken))) != 1 {
-		// Forgery attempt — restore the real session so the
-		// legitimate client isn't locked out. Compute the remaining
-		// TTL from ExpiresAt; if already expired we just drop it.
-		remaining := session.ExpiresAt.Sub(s.now())
-		if remaining > 0 {
-			if restoreErr := s.backend.Set(ctx, sessionID, payload, remaining); restoreErr != nil {
-				// Log via caller — we don't have a logger here. Returning
-				// mismatch is the priority; the caller surfaces it as
-				// Unauthenticated and the audit pipeline flags it.
-				return nil, ErrTokenMismatch
-			}
-		}
 		return nil, ErrTokenMismatch
 	}
+
+	payload, err = s.backend.GetAndDelete(ctx, sessionID)
+	if err != nil {
+		return nil, err
+	}
+	if err := json.Unmarshal(payload, &session); err != nil {
+		return nil, fmt.Errorf("terminal: decode session %s: %w", sessionID, err)
+	}
+	if subtle.ConstantTimeCompare([]byte(session.TokenHash), []byte(hashToken(bearerToken))) != 1 {
+		return nil, ErrTokenMismatch
+	}
 	return &session, nil
 }
📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change
payload, err := s.backend.GetAndDelete(ctx, sessionID)
if err != nil {
return nil, err
}
var session Session
if err := json.Unmarshal(payload, &session); err != nil {
return nil, fmt.Errorf("terminal: decode session %s: %w", sessionID, err)
}
if subtle.ConstantTimeCompare([]byte(session.TokenHash), []byte(hashToken(bearerToken))) != 1 {
// Forgery attempt — restore the real session so the
// legitimate client isn't locked out. Compute the remaining
// TTL from ExpiresAt; if already expired we just drop it.
remaining := session.ExpiresAt.Sub(s.now())
if remaining > 0 {
if restoreErr := s.backend.Set(ctx, sessionID, payload, remaining); restoreErr != nil {
// Log via caller — we don't have a logger here. Returning
// mismatch is the priority; the caller surfaces it as
// Unauthenticated and the audit pipeline flags it.
return nil, ErrTokenMismatch
}
}
return nil, ErrTokenMismatch
payload, err := s.backend.Get(ctx, sessionID)
if err != nil {
return nil, err
}
var session Session
if err := json.Unmarshal(payload, &session); err != nil {
return nil, fmt.Errorf("terminal: decode session %s: %w", sessionID, err)
}
if subtle.ConstantTimeCompare([]byte(session.TokenHash), []byte(hashToken(bearerToken))) != 1 {
return nil, ErrTokenMismatch
}
payload, err = s.backend.GetAndDelete(ctx, sessionID)
if err != nil {
return nil, err
}
if err := json.Unmarshal(payload, &session); err != nil {
return nil, fmt.Errorf("terminal: decode session %s: %w", sessionID, err)
}
if subtle.ConstantTimeCompare([]byte(session.TokenHash), []byte(hashToken(bearerToken))) != 1 {
return nil, ErrTokenMismatch
}
return &session, nil

Address the 8 actionable + 6 nitpick comments CodeRabbit posted on
#76 (run 352d924a) plus the reviewer's blocker finding on the
terminal-token contract.

Blocker — terminal token contract mismatch
  TestProxyValidateTerminalToken_DoesNotConsumeToken was the pre-rc10
  contract ("validation does NOT consume the token") — which is
  exactly the replay vulnerability the rc10 audit flagged. The
  updated terminal.TokenStore.Validate consumes on success via GETDEL,
  so the test was asserting the old contract and failing under the
  new one. Replace with TestProxyValidateTerminalToken_IsSingleUse
  asserting the correct rc10 semantic: first validation succeeds,
  second returns Unauthenticated. Rewrite the RPC doc comment on
  ProxyValidateTerminalToken to document the single-use contract
  and the forgery-restore behaviour. terminal_bridge.go only calls
  the RPC once per WebSocket and uses the returned metadata for
  the connection's lifetime, so single-use is compatible with the
  real gateway flow.

CodeRabbit actionables

  • internal/api/errors_parity_test.go
    - Path used 4 `..` which over-traverses from server/internal/api
      up to / instead of to the workspace root. Primary fallback is
      now 3 `..` (workspace layout); the 4-`..` candidate is kept
      as a secondary fallback for harnesses that run tests from one
      level higher. PM_SDK_TS_ERRORS env var still takes precedence.
    - Relaxed the TS extraction regex to accept single, double, or
      backtick quotes plus an optional `: string` type annotation
      so a future `export const ErrFoo: string = "…"` refactor
      still matches. Symmetric change not needed on the Go side —
      errors.go uses simple `ErrX = "…"` literals.
    - Removed the dead strings.Count("Err", "E") sanity-init stub;
      it was a compile-time constant and guarded nothing.

  • internal/api/registration_handler.go
    - ValidateGatewayURL now uses u.Hostname() rather than u.Host so
      pathological shapes like "https://:8443" (port only, no host)
      are rejected — u.Host would accept them. The
      validate_gateway_url_test.go suite gains a regression case for
      exactly that shape.
    - Exported RedactGatewayURL so cmd/control/main.go can strip
      userinfo from the URL before logging (the validator rejects
      URLs with credentials, but the failure log should not echo
      them back verbatim). Called on all three sites that surface
      the raw URL: main.go startup error, NewRegistrationHandler
      panic, and Register's defensive re-validation.

  • internal/store/migrations/010_security_alerts_projection.sql
    - SecurityAlertAcknowledged UPDATE now casts alert_id to UUID
      on the RHS instead of casting event_id to TEXT; lets the
      primary-key index satisfy the predicate and surfaces a
      malformed alert_id as a projection error rather than a
      silent full-scan no-match.
    - If the UPDATE hits zero rows (out-of-order replay, purged
      alert) RAISE EXCEPTION so the sidecar trigger's handler
      logs it into projection_errors. Previously silent.

  • internal/store/queries/security_alerts.sql
    - CountUnacknowledgedSecurityAlerts now returns bigint instead
      of ::int, matching PostgreSQL's native COUNT(*) return and
      keeping parity with buildNextPageToken's int64 handling.
    - Added CountSecurityAlertsForDevice companion so the forthcoming
      ListSecurityAlerts handler has the total it needs to compute
      next-page tokens correctly.

  • internal/control/inbox_worker.go
    - handleRevokeLuksDeviceKeyResult no longer masks every error
      from GetLuksRevocationStreamID as "not found". Only
      pgx.ErrNoRows triggers the fresh-ULID fallback (the genuinely-
      missing-Requested case). Any other error (context, transient
      DB, etc.) returns so Asynq retries rather than forking the
      audit stream on every DB flake.

  • internal/config/config.go
    - New Config.Validate() method catches the TTY/MTLS host
      collision when terminal + Traefik self-register are both
      enabled. Without this, Traefik's TCP passthrough router
      for the shared SNI wins over the TTY HTTP router and
      WebSocket handshakes silently break. cfg.Validate() is
      called at the top of cmd/gateway/main.go so the failure is
      visible at startup rather than on first terminal attempt.
    - TestValidate_TTYMTLSHostCollision covers the four shapes
      (collision+terminal-on, collision+terminal-off, distinct+
      terminal-on, collision+self-register-off).

  • cmd/gateway/main.go
    - Dropped the redundant `if gatewayReg == nil` guard before
      ensureGatewayRegistry() inside the TraefikSelfRegister block
      — the closure is idempotent, the guard was noise.
    - Moved the InternalURL registration block out from behind
      `if gatewayReg != nil` and now unconditionally calls
      ensureGatewayRegistry() before it, so admin fan-out stays
      available even when Traefik self-register and terminal URL
      are both off (previously the entire block silently skipped
      if neither feature had already initialised the registry).

CodeRabbit nitpick

  • internal/api/auth_handler_test.go
    TestLogin_GlobalPasswordAuthDisabled now asserts the error
    string contains "password login is disabled" in addition to
    the connect code check. A future refactor that collapses the
    branch into the generic invalid-credentials path would pass
    the code assertion but fail the substring — which is exactly
    the signal we want.

Part of #76 review round 2.
Reviewer called out a coherence lie introduced in the rc10 review
round 2: config.go's comment above the TraefikTTYHost fallback still
promised single-domain terminal deployments, but the new
Config.Validate() correctly rejects that shape when terminal +
Traefik self-registration are both on. The code was right; the
comment and .env.example weren't.

config.go: rewrite the comment block so it states that the
GATEWAY_DOMAIN rung only works when the terminal listener is
disabled or when Traefik self-registration is off, and that a
distinct GATEWAY_TTY_DOMAIN is required in the combined case.

.env.example: same clarification in operator-facing terms, with
the rationale (Traefik TCP passthrough shadows the TTY HTTP router
on a shared SNI) and the fail-fast behavior at startup so an
operator hitting this gets an immediate error instead of a silent
WebSocket break.

No code change; coherence fix only.
@PaulDotterer
Copy link
Copy Markdown
Contributor Author

@coderabbitai review

@coderabbitai
Copy link
Copy Markdown

coderabbitai Bot commented Apr 23, 2026

✅ Actions performed

Review triggered.

Note: CodeRabbit is an incremental review system and does not re-review already reviewed commits. This command is applicable only when automatic reviews are paused.

Copy link
Copy Markdown

@coderabbitai coderabbitai Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 2

Caution

Some comments are outside the diff and can’t be posted inline due to platform limitations.

⚠️ Outside diff range comments (2)
internal/config/config.go (1)

160-171: ⚠️ Potential issue | 🟡 Minor

Update the defaults summary to match the new fallback chain.

Line 169 still says TTYHost falls back to empty, but lines 211-215 now fall back to GATEWAY_DOMAIN.

Suggested wording
-	//   * TTYHost      = GATEWAY_TTY_DOMAIN (falls back to empty -> TTY router disabled)
+	//   * TTYHost      = GATEWAY_TTY_DOMAIN, then GATEWAY_DOMAIN
+	//                    (Validate rejects unsafe shared-host terminal+self-register configs)
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@internal/config/config.go` around lines 160 - 171, Update the Traefik
defaults summary comment so it matches the new fallback chain: change the
TTYHost bullet (currently "falls back to empty -> TTY router disabled") to
indicate it now falls back to GATEWAY_DOMAIN (so TTYHost -> GATEWAY_TTY_DOMAIN,
fallback GATEWAY_DOMAIN) and ensure the wording for
TTYEntryPoint/TTYCertResolver remains consistent; edit the comment block around
the Traefik defaults (the lines describing SelfRegister, RootKey, MTLSHost,
MTLSEntryPoint, TTYHost, TTYEntryPoint, TTYCertResolver) to reflect this updated
fallback behavior.
cmd/gateway/main.go (1)

198-220: ⚠️ Potential issue | 🟡 Minor

Reject port-only URL hosts before enabling redirects or terminal registration.

When a template like https://${UNSET_VAR}:8443/... has an unset env var, it collapses to https://:8443/.... Go's URL parser accepts this, setting u.Host = ":8443" (non-empty) but u.Hostname() = "" (empty). The current hostFromURL check at line 706 (u.Host == "") passes, returning :8443 as the host. This value is then used for bootstrap redirects and terminal registry registration—both invalid since there is no hostname.

Update hostFromURL to also reject the case where hostname is empty:

Suggested fix
-	if u.Host == "" {
+	if u.Host == "" || u.Hostname() == "" {
 		return ""
 	}
 	return u.Host
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@cmd/gateway/main.go` around lines 198 - 220, The hostFromURL helper currently
treats a value like ":8443" as a valid host because it only checks u.Host != "",
which lets port-only hosts through and enables invalid bootstrap
redirects/terminal registration; update hostFromURL to also return "" when
u.Hostname() is empty (i.e., treat u.Hostname() == "" as invalid), so calls in
main.go that rely on hostFromURL (the agent URL and terminal URL checks) will
skip enabling redirects/registration when the template resolves to a port-only
or hostname-less URL.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.

Inline comments:
In `@cmd/control/main.go`:
- Around line 94-108: The startup info log currently emits cfg.GatewayURL
verbatim via logger.Info("starting control server", ..., "gateway_url",
cfg.GatewayURL); change that to log a redacted value by calling
api.RedactGatewayURL(cfg.GatewayURL) so sensitive userinfo doesn't appear in
logs (i.e., replace the "gateway_url", cfg.GatewayURL argument with
"gateway_url", api.RedactGatewayURL(cfg.GatewayURL) in the startup logger.Info
call).

In `@internal/config/config.go`:
- Around line 243-247: The current collision check only triggers when
WebListenAddr is set, missing cases where an explicit TraefikTTYBackend is
provided and it also rejects valid setups where MTLS and TTY use different
entrypoints; update the condition in internal/config/config.go to require
c.TraefikSelfRegister, non-empty c.TraefikTTYHost and c.TraefikMTLSHost,
equality of those hosts, and then treat it as a collision only when either
c.WebListenAddr is non-empty or c.TraefikTTYBackend is non-empty AND the
entrypoints are the same (i.e. c.TraefikTTYEntryPoint ==
c.TraefikMTLSEntryPoint); reference these symbols (c.TraefikSelfRegister,
c.WebListenAddr, c.TraefikTTYBackend, c.TraefikTTYHost, c.TraefikMTLSHost,
c.TraefikTTYEntryPoint, c.TraefikMTLSEntryPoint) to implement the new boolean
logic so explicit TTY backends without WebListenAddr trigger an error while
differing entrypoints do not.

---

Outside diff comments:
In `@cmd/gateway/main.go`:
- Around line 198-220: The hostFromURL helper currently treats a value like
":8443" as a valid host because it only checks u.Host != "", which lets
port-only hosts through and enables invalid bootstrap redirects/terminal
registration; update hostFromURL to also return "" when u.Hostname() is empty
(i.e., treat u.Hostname() == "" as invalid), so calls in main.go that rely on
hostFromURL (the agent URL and terminal URL checks) will skip enabling
redirects/registration when the template resolves to a port-only or
hostname-less URL.

In `@internal/config/config.go`:
- Around line 160-171: Update the Traefik defaults summary comment so it matches
the new fallback chain: change the TTYHost bullet (currently "falls back to
empty -> TTY router disabled") to indicate it now falls back to GATEWAY_DOMAIN
(so TTYHost -> GATEWAY_TTY_DOMAIN, fallback GATEWAY_DOMAIN) and ensure the
wording for TTYEntryPoint/TTYCertResolver remains consistent; edit the comment
block around the Traefik defaults (the lines describing SelfRegister, RootKey,
MTLSHost, MTLSEntryPoint, TTYHost, TTYEntryPoint, TTYCertResolver) to reflect
this updated fallback behavior.
🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

  • Push a commit to this branch (recommended)
  • Create a new PR with the fixes

ℹ️ Review info
⚙️ Run configuration

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Pro

Run ID: 374e4ec0-98ab-4818-8a00-f639cb2a7b58

📥 Commits

Reviewing files that changed from the base of the PR and between 93e4fa1 and 8c7aeda.

⛔ Files ignored due to path filters (1)
  • internal/store/generated/security_alerts.sql.go is excluded by !**/generated/**
📒 Files selected for processing (14)
  • cmd/control/main.go
  • cmd/gateway/main.go
  • deploy/.env.example
  • internal/api/auth_handler_test.go
  • internal/api/errors_parity_test.go
  • internal/api/internal_handler.go
  • internal/api/internal_handler_test.go
  • internal/api/registration_handler.go
  • internal/api/validate_gateway_url_test.go
  • internal/config/config.go
  • internal/config/config_test.go
  • internal/control/inbox_worker.go
  • internal/store/migrations/010_security_alerts_projection.sql
  • internal/store/queries/security_alerts.sql
✅ Files skipped from review due to trivial changes (2)
  • deploy/.env.example
  • internal/api/internal_handler.go
🚧 Files skipped from review as they are similar to previous changes (5)
  • internal/api/validate_gateway_url_test.go
  • internal/api/errors_parity_test.go
  • internal/control/inbox_worker.go
  • internal/api/registration_handler.go
  • internal/api/auth_handler_test.go

Comment thread cmd/control/main.go Outdated
Comment thread internal/config/config.go
Address the four findings CodeRabbit posted on the 8c7aeda diff
(run 374e4ec0). All four are the same class of issue — "the code
now does X but some sibling code/comment/check didn't get the
memo" — which is exactly what happens when a previous review round
fixes one site but misses its twins. Sweeping the twins now.

1. internal/config/config.go (outside diff, line ~169)

   The Traefik defaults summary comment still said "TTYHost falls
   back to empty → TTY router disabled" while the actual code (and
   the matching comment I fixed in the previous push) now falls
   back through GATEWAY_TTY_DOMAIN to GATEWAY_DOMAIN, with
   Validate() rejecting the unsafe shared-host case. Rewrite the
   bullet so the summary matches the body.

2. cmd/gateway/main.go hostFromURL (outside diff, lines ~695-707)

   ValidateGatewayURL got tightened to u.Hostname() == "" in the
   previous round, catching shapes like "https://:8443". The
   sibling hostFromURL helper used for PublicAgentURLTemplate /
   PublicTerminalURLTemplate still only checked u.Host == "",
   which is non-empty for ":8443" — so a template like
   "https://${UNSET}:8443/..." collapsed into a port-only URL
   still passed hostFromURL, enabling invalid bootstrap redirects
   and terminal registry entries. Add the same u.Hostname()
   invariant here so both validators agree.

3. cmd/control/main.go startup log (inline, line ~94)

   The rc10 review round 2 redacted GatewayURL on the error path
   but left the startup Info log emitting cfg.GatewayURL verbatim.
   A bad shape that slipped past the validator (or an operator
   paste mistake that the validator rejects after the log line)
   would still dump any userinfo into every boot log. Wrap the
   startup log's gateway_url field in api.RedactGatewayURL too so
   the redaction is symmetric across both log sites.

4. internal/config/config.go Validate() (inline, lines ~243-252)

   The collision check triggered only on WebListenAddr != "",
   missing the case where the operator sets an explicit
   GATEWAY_TRAEFIK_TTY_BACKEND without WebListenAddr. It also
   flagged topologies where the mTLS and TTY routers bind
   DIFFERENT entrypoints — which don't actually collide because
   Traefik routes per-entrypoint before per-host, so a false
   positive would fail-fast a valid bring-your-own-Traefik split-
   port setup. Split the precondition into:
     terminalEnabled   = WebListenAddr || TraefikTTYBackend
     entrypointsCollide = TTYEntryPoint == MTLSEntryPoint (and non-empty)
   and only flag the host collision when all three preconditions
   line up. TestValidate_TTYMTLSHostCollision now covers two new
   cases (explicit TTYBackend triggers error; different entry-
   points skips the check) alongside the original four.

All targeted go tests green: validator suite, config collision
matrix, login bypass regression, terminal token single-use,
registration handler, error-code parity.

Part of #76 review round 3.
@PaulDotterer
Copy link
Copy Markdown
Contributor Author

@coderabbitai review

@coderabbitai
Copy link
Copy Markdown

coderabbitai Bot commented Apr 23, 2026

✅ Actions performed

Review triggered.

Note: CodeRabbit is an incremental review system and does not re-review already reviewed commits. This command is applicable only when automatic reviews are paused.

Copy link
Copy Markdown

@coderabbitai coderabbitai Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Caution

Some comments are outside the diff and can’t be posted inline due to platform limitations.

⚠️ Outside diff range comments (1)
cmd/gateway/main.go (1)

198-226: ⚠️ Potential issue | 🟠 Major

Don’t fall back to the terminal host after an invalid agent URL template.

Line 224 treats assignedHost == "" as “agent template absent”, but it is also true when GATEWAY_PUBLIC_AGENT_URL_TEMPLATE was explicitly set and resolved to no host. That can re-enable bootstrap redirects to the TTY host instead of disabling the middleware as intended.

🐛 Proposed fix
-	if cfg.PublicAgentURLTemplate != "" {
+	agentURLTemplateConfigured := cfg.PublicAgentURLTemplate != ""
+	if agentURLTemplateConfigured {
 		agentURL := strings.ReplaceAll(cfg.PublicAgentURLTemplate, "{id}", gatewayID)
 		assignedHost = hostFromURL(agentURL)
 		if assignedHost == "" {
 			logger.Warn("GATEWAY_PUBLIC_AGENT_URL_TEMPLATE resolved to a URL with no host — bootstrap redirects disabled; check for unset env vars in the template",
 				"template", cfg.PublicAgentURLTemplate, "resolved", agentURL)
@@
-			if assignedHost == "" {
+			if assignedHost == "" && !agentURLTemplateConfigured {
 				assignedHost = terminalHost
 			}
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@cmd/gateway/main.go` around lines 198 - 226, The code currently treats
assignedHost == "" as “no agent template provided”, but that conflates an absent
template with an explicitly-set-but-malformed one and allows a terminal host
fallback when it should be disabled; fix by tracking whether an agent template
was provided/resolved separately (e.g. introduce a boolean like
agentTemplateProvided or agentTemplateResolvedOK), set it when
cfg.PublicAgentURLTemplate != "" and only set assignedHost when
hostFromURL(agentURL) returns a non-empty host, then change the terminal
fallback to only assign terminalHost when the agent template was not provided
(agentTemplateProvided == false) rather than when assignedHost == "" so
malformed agent templates don’t enable the fallback; reference the existing
symbols cfg.PublicAgentURLTemplate, hostFromURL, assignedHost,
cfg.PublicTerminalURLTemplate and terminalHost.
🧹 Nitpick comments (1)
cmd/gateway/main.go (1)

316-350: Bound Redis publish and refresh calls with shutdown-aware timeouts.

These fail-open paths still use context.Background(), so a slow Redis operation can ignore shutdown and delay cleanup. Match the bounded heartbeat pattern already used by registry.RegisterGateway.

♻️ Proposed fix
-			if err := gatewayReg.PublishTraefikRoute(
-				context.Background(), gatewayID, traefikCfg, registry.DefaultGatewayTTL,
-			); err != nil {
+			publishCtx, cancelPublish := context.WithTimeout(shutdownCtx, 5*time.Second)
+			err := gatewayReg.PublishTraefikRoute(
+				publishCtx, gatewayID, traefikCfg, registry.DefaultGatewayTTL,
+			)
+			cancelPublish()
+			if err != nil {
@@
 					select {
 					case <-ticker.C:
-						if err := gatewayReg.PublishTraefikRoute(
-							context.Background(), gatewayID, traefikCfg, registry.DefaultGatewayTTL,
-						); err != nil {
+						refreshCtx, cancelRefresh := context.WithTimeout(traefikRefreshCtx, 5*time.Second)
+						err := gatewayReg.PublishTraefikRoute(
+							refreshCtx, gatewayID, traefikCfg, registry.DefaultGatewayTTL,
+						)
+						cancelRefresh()
+						if err != nil {
 							logger.Warn("failed to refresh Traefik routing config", "error", err)
 						}
@@
 		if internalURL != "" {
-			if err := gatewayReg.RegisterGatewayInternal(
-				context.Background(), gatewayID, internalURL, registry.DefaultGatewayTTL,
-			); err != nil {
+			registerCtx, cancelRegister := context.WithTimeout(shutdownCtx, 5*time.Second)
+			err := gatewayReg.RegisterGatewayInternal(
+				registerCtx, gatewayID, internalURL, registry.DefaultGatewayTTL,
+			)
+			cancelRegister()
+			if err != nil {
 				logger.Warn("failed to register gateway internal URL", "error", err)
 			}
@@
 					select {
 					case <-ticker.C:
-						if err := gatewayReg.RegisterGatewayInternal(
-							context.Background(), gatewayID, internalURL, registry.DefaultGatewayTTL,
-						); err != nil {
+						refreshCtx, cancelRefresh := context.WithTimeout(internalRefreshCtx, 5*time.Second)
+						err := gatewayReg.RegisterGatewayInternal(
+							refreshCtx, gatewayID, internalURL, registry.DefaultGatewayTTL,
+						)
+						cancelRefresh()
+						if err != nil {
 							logger.Warn("failed to refresh gateway internal URL", "error", err)
 						}

Also applies to: 403-425

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@cmd/gateway/main.go` around lines 316 - 350, The PublishTraefikRoute calls
use context.Background() and can block shutdown; change them to use a
shutdown-aware bounded context derived from shutdownCtx (e.g., create ctx,
cancel := context.WithTimeout(shutdownCtx,
registry.DefaultGatewayRefreshInterval/2) or the same heartbeat timeout used in
registry.RegisterGateway) and pass that ctx into gatewayReg.PublishTraefikRoute
(both the initial publish and the periodic refresh inside the goroutine),
ensuring you call cancel() after each call so Redis ops respect shutdown and the
goroutine exits promptly.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.

Outside diff comments:
In `@cmd/gateway/main.go`:
- Around line 198-226: The code currently treats assignedHost == "" as “no agent
template provided”, but that conflates an absent template with an
explicitly-set-but-malformed one and allows a terminal host fallback when it
should be disabled; fix by tracking whether an agent template was
provided/resolved separately (e.g. introduce a boolean like
agentTemplateProvided or agentTemplateResolvedOK), set it when
cfg.PublicAgentURLTemplate != "" and only set assignedHost when
hostFromURL(agentURL) returns a non-empty host, then change the terminal
fallback to only assign terminalHost when the agent template was not provided
(agentTemplateProvided == false) rather than when assignedHost == "" so
malformed agent templates don’t enable the fallback; reference the existing
symbols cfg.PublicAgentURLTemplate, hostFromURL, assignedHost,
cfg.PublicTerminalURLTemplate and terminalHost.

---

Nitpick comments:
In `@cmd/gateway/main.go`:
- Around line 316-350: The PublishTraefikRoute calls use context.Background()
and can block shutdown; change them to use a shutdown-aware bounded context
derived from shutdownCtx (e.g., create ctx, cancel :=
context.WithTimeout(shutdownCtx, registry.DefaultGatewayRefreshInterval/2) or
the same heartbeat timeout used in registry.RegisterGateway) and pass that ctx
into gatewayReg.PublishTraefikRoute (both the initial publish and the periodic
refresh inside the goroutine), ensuring you call cancel() after each call so
Redis ops respect shutdown and the goroutine exits promptly.

ℹ️ Review info
⚙️ Run configuration

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Pro

Run ID: 03d850c8-4d49-4d13-9941-c6c1d4ff77fe

📥 Commits

Reviewing files that changed from the base of the PR and between 8c7aeda and fb4ab1f.

📒 Files selected for processing (4)
  • cmd/control/main.go
  • cmd/gateway/main.go
  • internal/config/config.go
  • internal/config/config_test.go
✅ Files skipped from review due to trivial changes (1)
  • internal/config/config.go
🚧 Files skipped from review as they are similar to previous changes (2)
  • internal/config/config_test.go
  • cmd/control/main.go

…edis contexts

CodeRabbit round 4 on fb4ab1f (run 03d850c8) surfaced one major
correctness issue and one maintainability nitpick. Both fixed.

1. cmd/gateway/main.go lines 198-226 — major, outside diff

   The fallback at line 224 treated assignedHost == "" as "no agent
   template provided", which conflates two distinct states:

     (a) GATEWAY_PUBLIC_AGENT_URL_TEMPLATE unset entirely
     (b) GATEWAY_PUBLIC_AGENT_URL_TEMPLATE set but resolved to a
         URL with no host (e.g. "https://${UNSET_VAR}" collapses
         to "https://")

   Case (b) already logs a warning and disables bootstrap redirects.
   But the subsequent fallback `if assignedHost == "" { assignedHost
   = terminalHost }` couldn't distinguish (a) from (b), so a
   malformed agent template silently re-enabled bootstrap redirects
   to the TTY host — directly contradicting the intent of disabling
   them when the operator's explicit template couldn't resolve.

   Introduce agentURLTemplateConfigured, set it when
   cfg.PublicAgentURLTemplate != "", and gate the terminal-host
   fallback on `assignedHost == "" && !agentURLTemplateConfigured`.
   An operator who wrote a broken template now gets exactly what
   the warning promises: feature disabled, not silently redirected
   to the wrong host.

2. cmd/gateway/main.go lines 316-350, 403-425 — nitpick

   PublishTraefikRoute + RegisterGatewayInternal (both the initial
   call and the periodic refresh) used context.Background(), so a
   slow or hung Redis would ignore SIGTERM and delay graceful
   shutdown. The RevokeTraefikRoute defer already uses a 5s bound
   for exactly this reason.

   Bound both calls with shutdownCtx + 5s timeout (refresh calls
   derive from the refresh's own cancellable context rather than
   shutdownCtx directly, so the timeout cancels cleanly when the
   refresh goroutine stops). Matches the heartbeat pattern used
   inside registry.RegisterGateway.

Targeted tests all pass: config Validate matrix, parity guard,
registration validator, terminal token single-use, login bypass
regression.

Part of #76 review round 4.
@PaulDotterer
Copy link
Copy Markdown
Contributor Author

@codserabbitai review

@PaulDotterer
Copy link
Copy Markdown
Contributor Author

@coderabbitai review

@coderabbitai
Copy link
Copy Markdown

coderabbitai Bot commented Apr 23, 2026

✅ Actions performed

Review triggered.

Note: CodeRabbit is an incremental review system and does not re-review already reviewed commits. This command is applicable only when automatic reviews are paused.

Copy link
Copy Markdown

@coderabbitai coderabbitai Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 1

🧹 Nitpick comments (1)
cmd/gateway/main.go (1)

406-491: Dead gatewayReg != nil guard — registry is now unconditionally initialized.

Since ensureGatewayRegistry() is called unconditionally at line 406 (as the comment there explicitly states: "if terminal + Traefik are both off, this is where the registry gets built"), gatewayReg cannot be nil when execution reaches line 489. The guard is dead and slightly misleading — a reader now has to re-derive the invariant to convince themselves SetGatewayRouting always runs.

♻️ Proposed simplification
-	if gatewayReg != nil {
-		agentHandler.SetGatewayRouting(gatewayReg, gatewayID)
-	}
+	agentHandler.SetGatewayRouting(gatewayReg, gatewayID)
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@cmd/gateway/main.go` around lines 406 - 491, The nil-check around gatewayReg
is dead because ensureGatewayRegistry() always returns a registry; remove the
conditional and call agentHandler.SetGatewayRouting(gatewayReg, gatewayID)
unconditionally (locate the ensureGatewayRegistry() call and the agentHandler
variable near the agent handler setup and replace the if gatewayReg != nil { ...
} block with a direct call), keeping behavior identical but avoiding misleading
guards.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.

Inline comments:
In `@cmd/gateway/main.go`:
- Around line 241-264: The initial call to gatewayReg.RegisterGateway uses
context.Background() and can hang; change it to use a context derived from
shutdownCtx with a 5s timeout (e.g., ctx, cancel :=
context.WithTimeout(shutdownCtx, 5*time.Second); defer cancel()) when calling
gatewayReg.RegisterGateway so the startup register is bounded like
PublishTraefikRoute and RegisterGatewayInternal; keep the existing error
handling and defer stop() semantics (call cancel immediately after
RegisterGateway returns if needed).

---

Nitpick comments:
In `@cmd/gateway/main.go`:
- Around line 406-491: The nil-check around gatewayReg is dead because
ensureGatewayRegistry() always returns a registry; remove the conditional and
call agentHandler.SetGatewayRouting(gatewayReg, gatewayID) unconditionally
(locate the ensureGatewayRegistry() call and the agentHandler variable near the
agent handler setup and replace the if gatewayReg != nil { ... } block with a
direct call), keeping behavior identical but avoiding misleading guards.
🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

  • Push a commit to this branch (recommended)
  • Create a new PR with the fixes

ℹ️ Review info
⚙️ Run configuration

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Pro

Run ID: 1d41b4b7-c137-41c5-bf12-da9601e7949b

📥 Commits

Reviewing files that changed from the base of the PR and between fb4ab1f and 7f4179a.

📒 Files selected for processing (1)
  • cmd/gateway/main.go

Comment thread cmd/gateway/main.go
…eg nil guard

CodeRabbit round 5 on 7f4179a (run 1d41b4b7) posted one actionable
and one nitpick, both completing patterns the previous rounds
established.

1. cmd/gateway/main.go line 242 — actionable

   RegisterGateway's initial call still used context.Background()
   while its round-4 siblings PublishTraefikRoute and
   RegisterGatewayInternal now use shutdownCtx + 5s. Match the
   pattern here too — a slow/hung Valkey at startup otherwise
   ignores SIGTERM on the registry register path. The stop()
   return value keeps its own refresh-goroutine cancellation
   wiring, so the bound only affects the one-shot initial call.

2. cmd/gateway/main.go line 499 — nitpick

   After the round-3 refactor moved ensureGatewayRegistry() to
   fire unconditionally above the internal-URL block,
   gatewayReg cannot be nil by the time we reach
   SetGatewayRouting. The `if gatewayReg != nil` guard is dead
   and misleading — a reader has to re-derive the invariant.
   Drop the guard, leave a one-line comment recording why it's
   safe to call unconditionally.

Targeted go tests unchanged and still green (no test surface
touched this round).

Part of #76 review round 5.
@PaulDotterer
Copy link
Copy Markdown
Contributor Author

@coderabbitai review

@coderabbitai
Copy link
Copy Markdown

coderabbitai Bot commented Apr 23, 2026

✅ Actions performed

Review triggered.

Note: CodeRabbit is an incremental review system and does not re-review already reviewed commits. This command is applicable only when automatic reviews are paused.

Copy link
Copy Markdown

@coderabbitai coderabbitai Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🧹 Nitpick comments (1)
cmd/gateway/main.go (1)

322-404: Skip-on-empty-backend guard and bounded publish/refresh/revoke all look right.

Gating publication on a non-empty mtlsBackend (L325) prevents poisoning the shared pm-mtls TCP router with an unusable entry, which matches internal/gateway/registry/traefik.go's PublishTraefikRoute semantics — it writes all keys under the same TTL, so a bad initial publish would be observable to every other replica. The refresh goroutine deriving from shutdownCtx (L363) and the revoke using a fresh context.Background()+5s (L389) correctly handles the "shutdownCtx already done" case at teardown.

One very small nit: the logger.Info("traefik self-registration enabled", …) at L396 fires even when the initial publish warned at L348-358, which could mislead an operator reading logs after a transient Valkey failure — the refresh loop will eventually reconcile it, so "enabled" is functionally correct, but something like an extra "initial_publish_ok": err == nil field would make the log self-diagnosing. Optional.

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@cmd/gateway/main.go` around lines 322 - 404, The final logger.Info call logs
"traefik self-registration enabled" even if the initial PublishTraefikRoute
returned an error; include a diagnostic field so the log shows whether the
initial publish succeeded by adding something like "initial_publish_ok", err ==
nil to the logger.Info invocation (referencing the err from the initial publish
block and the logger.Info call).
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.

Nitpick comments:
In `@cmd/gateway/main.go`:
- Around line 322-404: The final logger.Info call logs "traefik
self-registration enabled" even if the initial PublishTraefikRoute returned an
error; include a diagnostic field so the log shows whether the initial publish
succeeded by adding something like "initial_publish_ok", err == nil to the
logger.Info invocation (referencing the err from the initial publish block and
the logger.Info call).

ℹ️ Review info
⚙️ Run configuration

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Pro

Run ID: b2171b1c-3992-48a7-aeaa-828c62552edc

📥 Commits

Reviewing files that changed from the base of the PR and between 7f4179a and e8294f6.

📒 Files selected for processing (1)
  • cmd/gateway/main.go

… status in log

CR review on e8294f6 posted one low-severity doc mismatch and one
nitpick. Practicing the sibling-sweep discipline from the pre-push
memory: grep for every IMAGE_TAG reference in the repo before
fixing, and grep for every "feature enabled" log site before
touching the one CR flagged.

1. deploy/.env.example IMAGE_TAG contract — CR flagged

   Comment said "default: latest" with a commented-out
   `IMAGE_TAG=latest`. compose.yml's `${IMAGE_TAG:?...}` enforces
   the opposite: operators who don't set IMAGE_TAG get a hard
   startup refusal, which is the intent. Rewrite the doc to match
   reality: IMAGE_TAG is required, uncommented by default with a
   sensible placeholder (v2026.05), with a comment explaining why
   the compose file enforces it.

2. deploy/deploy.sh IMAGE_TAG fallback — sibling sweep caught this

   Not in CR's findings but the same contract. deploy.sh reads
   IMAGE_TAG from the remote .env and falls back to "latest" when
   missing:
       IMAGE_TAG=$(grep -oP '(?<=^IMAGE_TAG=).*' .env || echo latest)
   But compose.yml refuses to start without IMAGE_TAG explicitly
   set. So the fallback retags images to :latest, then compose
   rejects them with its own less-specific error. Fail here
   instead with a pointer at .env.example so the operator's
   actual fix is visible in the first error.

3. cmd/gateway/main.go "traefik self-registration enabled" log —
   CR nit

   After a failed initial PublishTraefikRoute the Warn line above
   still emits. The later Info line says "enabled" unconditionally,
   which is true (refresh goroutine will retry) but slightly muddy
   — an operator skimming a Valkey-wobble startup can't tell
   "registered now" from "will retry in 45s" without
   cross-referencing the Warn. Add `initial_publish_ok` to the
   structured fields so the state is explicit.

   Sibling check: the only other "feature enabled" log is the
   "multi-gateway routing enabled" line at ~269, which is already
   correctly scoped inside the else branch of its register-error
   check — it only fires when RegisterGateway succeeded. No
   action needed there.

Part of #76 review round 6.
Reverts the "IMAGE_TAG required" part of the rc10 review round 6
commit (69abe32). That change was based on a wrong assumption about
the release-engineering workflow: I treated ${IMAGE_TAG:-latest} as a
silent-rolling-forward footgun, but the CI tag contract here is
deliberate:

  :latest     — latest stable release (CI promotes only after a
                stable version is cut; does not advance on every rc)
  :latest-rc  — latest pre-release / release candidate
  vYYYY.MM    — pinned-release specific tag

So :latest is a curated pointer, not an accidental one. Requiring
explicit pinning broke quickstart (docker compose up on a fresh
checkout) and CI-reference-stack workflows for zero real gain — the
supply-chain concern I was mitigating doesn't apply to a promoted-by-
CI floating tag.

Restores:
  * compose.yml: ${IMAGE_TAG:-latest} on all three image lines
  * .env.example: commented-out IMAGE_TAG=latest default, with a
    comment block that documents all three tag shapes (latest,
    latest-rc, vYYYY.MM) so operators pick the right one for their
    stance on rollbacks
  * deploy.sh: |-  || echo latest fallback restored; matches the
    compose default rather than failing hard on unset IMAGE_TAG

Memory saved under project_image_tag_contract.md so this doesn't
get re-litigated in a future review round.

The Traefik "initial_publish_ok" log field from 69abe32 stays —
that's an unrelated nit fix on its own merits.
@PaulDotterer PaulDotterer merged commit 88cd900 into main Apr 23, 2026
4 checks passed
@PaulDotterer PaulDotterer deleted the fix/rc10-consolidation branch April 23, 2026 18:15
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

fix: gateway Traefik backend auto-derivation publishes unresolvable container ID

1 participant