fix: rc10 consolidation — auth bypass, fail-open gateway, deploy hardening, projections, data integrity#76
Conversation
…e-IP / fail-open WIP Security fix: `CONTROL_PASSWORD_AUTH_ENABLED=false` previously only gated the SSO handler's ListAuthMethods response, which is a UI hint for the login form. The Login RPC itself never checked the flag, so an attacker POSTing directly to /pm.v1.ControlService/Login with valid email+password still authenticated against any account whose HasPassword column was true. This is the bypass a penetration test would find immediately; SSO-only deployments were not SSO-only in the enforced sense. Plumb passwordAuthEnabled into AuthHandler and enforce it BEFORE the DB lookup. Burn the dummy bcrypt cycle on the disabled path too so the "is password auth disabled?" question doesn't leak via timing either. Regression test: TestLogin_GlobalPasswordAuthDisabled covers the exact bypass path (valid credentials, global switch off, expect CodeUnauthenticated + password_login_disabled). Also folds in the pre-rc10 WIP from the staging debugging session: - routableIP() helper replacing os.Hostname() for Traefik backend auto-derivation (container IDs are not in Docker's embedded DNS). - Fail-open on a malformed GATEWAY_PUBLIC_TERMINAL_URL_TEMPLATE so a broken optional template disables terminal registration with a warning instead of crash-looping the gateway. - GATEWAY_TRAEFIK_* defaults collapsed into the binary so compose.yml only needs GATEWAY_DOMAIN / GATEWAY_TTY_DOMAIN in the common case. Remaining rc10 scope continues in subsequent commits on this branch. Part of #74.
…KS stream-id coherence, deploy hardening
Gateway startup: convert five more os.Exit paths to Warn+feature-
disabled so a broken optional knob no longer takes agent mTLS down
with it.
- registry.RegisterGateway failure → terminal disabled for this
replica, agent mTLS keeps serving.
- routableIP() failure for GATEWAY_INTERNAL_URL → admin fan-out
disabled, mTLS keeps serving.
- routableIP() failure for Traefik backend auto-derive → skip
publish entirely rather than poison Redis KV with an empty
backend; operator can still set GATEWAY_TRAEFIK_{MTLS,TTY}_BACKEND
explicitly.
- PublishTraefikRoute failure at startup → rely on the refresh
goroutine's normal retry cadence; the previous crash loop caused
the pm-mtls TCP route to expire during restart backoff and fall
through to the HTTP router (wrong cert).
- BootstrapHost without a derivable assignedHost → disable the
bootstrap middleware instead of crashing.
The remaining os.Exit paths in gateway startup (TLS cert / key / CA
flag missing, ValkeyAddr / ControlURL missing, cert parse, tls.Config
build, HTTP/2 server configure) are truly fatal and stay fatal — the
gateway cannot serve its primary mTLS role without them.
Terminal tokens: enforce single-use via Valkey GETDEL (go-redis
GetDel) in the new SessionBackend.GetAndDelete primitive. Validate()
now atomically consumes the token on success, so a leaked bearer
token cannot replay a second WebSocket within the TTL. Forgery
attempts restore the unused entry with the remaining TTL so a
legitimate client is not locked out by a guess. Regression tests
TestTokenStore_Validate_IsSingleUse and
TestTokenStore_Validate_MismatchPreservesSession cover both paths.
LUKS revocation three-phase audit: the inbox worker now looks up
the luks_key stream ID minted at request time (Requested /
Dispatched phases in api/device_handler.go) via a new sqlc query
GetLuksRevocationStreamID, so the async Revoked / Failed event
lands on the SAME stream as the earlier phases. Previously the
worker generated a fresh ULID, splitting every revocation across
two streams and breaking the projection's stitch. Falls back to a
fresh ULID only if the Requested event can't be found — an orphan
Failed event is still better than silently dropping the agent's
report.
Config coherence:
- GATEWAY_TTY host fallback now goes GATEWAY_TRAEFIK_TTY_HOST →
GATEWAY_TTY_DOMAIN → GATEWAY_DOMAIN. .env.example has always
documented the third rung; config.go finally implements it.
- Empty CONTROL_GATEWAY_URL is fatal at startup (was only a warn),
and the registration handler refuses to hand an empty URL back
even if the startup guard ever regresses. Defence in depth on a
path that was otherwise silently broken.
- deploy/compose.yml: GATEWAY_VALKEY_DB passthrough added (the
code reads it; without this the env var was dead).
- IMAGE_TAG now required (fail with clear message if unset) rather
than silently defaulting to :latest, which was causing
unintended upgrades on every `docker compose pull`.
- redis-stack-server pinned to 7.4.0-v0 so upstream changes don't
silently land.
- initdb.d/01-indexer-user.sh now hard-fails on unset
INDEXER_POSTGRES_PASSWORD instead of returning 0, so the
misconfiguration is visible at init time rather than hours later
at indexer first-connect.
Auth: TestLogin_GlobalPasswordAuthDisabled regression test (from
prior commit) covers the rc10 bypass fix on the password-auth
disable path.
Part of #74.
…uard SecurityAlert projection: new migration 010 installs a security_alerts_projection table plus a sidecar AFTER INSERT trigger that routes SecurityAlert / SecurityAlertAcknowledged events into it. Previously the inbox worker appended the event to the device stream but no projection handled it, so the UI had no way to list alerts without scanning the append-only events log. Using a sidecar trigger (instead of rewriting project_event()) keeps the change additive and avoids the risk of dropping an existing case arm from the central dispatcher. OutputChunk events deliberately kept projection-less: they already have a dedicated retrieval path via LoadOutputChunks which scans events by (stream_id, event_type='OutputChunk') in order. That's the intended access pattern for streaming output — a projection table would just duplicate the events table. Docs: fix the stale :8080 example in cmd/control/README.md (compose routes via :443 with TCP passthrough; port was a leftover from before the SNI collapse). Correct the server README claim that GATEWAY_WEB_LISTEN_ADDR is a TLS listener — it's cleartext HTTP behind Traefik, which terminates TLS upstream. Error-code parity guard: new unit test TestErrorCodeParityWithTSSDK in internal/api walks errors.go and the sdk's ts/errors.ts via filesystem and asserts the two code sets match. Skips gracefully when run outside the multi-repo workspace (so `go test ./...` from a read-only mirror still passes). This is the test that would have caught the 4-code drift the rc10 audit found. Part of #74.
|
Warning Rate limit exceeded
Your organization is not enrolled in usage-based pricing. Contact your admin to enable usage-based pricing to continue reviews beyond the rate limit, or try again in 39 minutes and 33 seconds. ⌛ How to resolve this issue?After the wait time has elapsed, a review can be triggered using the We recommend that you space out your commits to avoid hitting the rate limit. 🚦 How do rate limits work?CodeRabbit enforces hourly rate limits for each developer per organization. Our paid plans have higher rate limits than the trial, open-source and free plans. In all cases, we re-allow further reviews after a brief timeout. Please see our FAQ for further information. ℹ️ Review info⚙️ Run configurationConfiguration used: Path: .coderabbit.yaml Review profile: CHILL Plan: Pro Run ID: 📒 Files selected for processing (4)
📝 WalkthroughWalkthroughTightened startup validation for gateway and control; added gateway config validation and safer Traefik self-registration; terminal session tokens are now single-use via atomic GetAndDelete; password-auth can be globally disabled; security-alerts projection and TypeScript error-code parity test added. Changes
Sequence Diagram(s)sequenceDiagram
autonumber
participant Client
participant GatewayAPI as Internal API
participant SessionStore as Terminal Store
participant Backend as SessionBackend
rect rgba(150,200,100,0.5)
Note over Client,Backend: Atomic consume-on-success flow
Client->>GatewayAPI: ProxyValidateTerminalToken(sessionID, bearer)
GatewayAPI->>SessionStore: Validate(sessionID, bearer)
SessionStore->>Backend: GetAndDelete(sessionID) (atomic)
Backend->>Backend: retrieve key and delete
alt payload present
SessionStore->>SessionStore: parse payload, compare hash
alt match
SessionStore-->>GatewayAPI: session metadata (success)
Note right of Backend: token consumed (single-use)
else mismatch
SessionStore->>Backend: Restore(sessionID, payload) with remaining TTL
SessionStore-->>GatewayAPI: ErrTokenMismatch (auth failure)
end
else missing/expired
SessionStore-->>GatewayAPI: ErrTokenNotFound / Unauthenticated
end
GatewayAPI-->>Client: success or Unauthenticated
end
Estimated code review effort🎯 4 (Complex) | ⏱️ ~50 minutes Possibly related PRs
Poem
🚥 Pre-merge checks | ✅ 4 | ❌ 1❌ Failed checks (1 warning)
✅ Passed checks (4 passed)
✏️ Tip: You can configure your own custom pre-merge checks in the settings. ✨ Finishing Touches🧪 Generate unit tests (beta)
Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out. Comment |
…rityAlert projection, parity-test CI hook Address review findings on #76 before merge: URL validation was too weak. The previous check was url.Parse + empty string guard, which lets obviously-invalid shapes through: bare hostnames like "gateway.example.com" parse without error but aren't absolute URLs; http:// downgrades silently; wss:// confuses the streaming client; userinfo in the URL leaks on every enrollment response. Introduce api.ValidateGatewayURL that requires scheme=https, non-empty host, no userinfo, no fragment. Wire it into cmd/control/main.go (fatal on violation) and into NewRegistrationHandler (panic on violation — programmer error by the time we reach the constructor) and Register (re-validation for defence in depth). New TestValidateGatewayURL covers all the footgun shapes the review flagged. Existing registration_handler_test.go fixtures used wss://gateway.test:443 which the new validator correctly rejects — updated to https://gateway.test:8080 to match the production CONTROL_GATEWAY_URL shape. SecurityAlert projection idempotency. Migration 010 was using a fresh UUID as the primary key, so event replay (backfill, rebuild, trigger re-fire) would duplicate alerts. Key the projection by the originating event_id (REFERENCES events(id)) and use INSERT ... ON CONFLICT (event_id) DO NOTHING in the projector. This is the event-sourcing idempotency pattern that keeps projections deterministic across replays. SecurityAlertAcknowledged payloads carry alert_id = the event_id of the originating SecurityAlert, so the UPDATE path is unchanged in spirit but keyed by the stable id. SecurityAlert is explicitly Phase 1 — the migration + sidecar trigger + sqlc queries (ListSecurityAlertsForDevice, ListUnacknowledgedSecurityAlerts, GetSecurityAlert, CountUnacknowledgedSecurityAlerts) make the projection queryable from Go. The RPC surface and web UI wiring remain Phase 2 because they need a proto change (ControlService.ListSecurityAlerts), which is better in its own PR to keep the scope tight. LUKS stream-id correctness assumption. The inbox worker looks up the luks_key stream ID by (device_id, action_id) and picks the latest matching Requested/Dispatched event via ORDER BY sequence_num DESC LIMIT 1. Documented the assumption that there is at most one outstanding revocation per (device, action) at a time, and what the behavior is when the assumption is violated (worst case: the latest request gets the terminal event, older abandoned streams miss their terminal — not a projection correctness issue, but worth flagging for audit export). Parity test CI hook. TestErrorCodeParityWithTSSDK previously silently skipped when the sibling SDK repo wasn't present, so standalone server CI couldn't rely on it. Now reads PM_SDK_TS_ERRORS env var as the primary source (CI sets this when it checks out the SDK beside server), falls back to the local dev-workspace path, and loudly fails when PM_SDK_PARITY_REQUIRED=1 is set but the file still isn't reachable. Skips with a clear log line otherwise so a plain `go test ./...` in a standalone server checkout still passes. Part of #76 review feedback.
|
@coderabbitai Full review |
✅ Actions performedFull review triggered. |
There was a problem hiding this comment.
Actionable comments posted: 9
🧹 Nitpick comments (6)
internal/api/auth_handler_test.go (1)
77-91: Regression test correctly covers the bypass; consider also asserting the error code string.The test validates the critical behavior: valid credentials +
passwordAuthEnabled=false→CodeUnauthenticated. As an optional hardening, you could also assert the returned error carriesErrPasswordLoginDisabled(rather thanErrInvalidCredentials) so a future refactor that collapses the branch into the generic "invalid credentials" path would fail this test. Existing tests likeTestLogin_PasswordDisabledByProvideralready useassert.Contains(t, err.Error(), "password login is disabled")— the same check here would keep the error-surface contract explicit.🧪 Optional tightening
require.Error(t, err, "login must fail when global password auth is disabled") assert.Equal(t, connect.CodeUnauthenticated, connect.CodeOf(err)) + assert.Contains(t, err.Error(), "password login is disabled")🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@internal/api/auth_handler_test.go` around lines 77 - 91, The test TestLogin_GlobalPasswordAuthDisabled currently only checks the connect error code; tighten it by also asserting the returned error string indicates the specific disabled-password case (match existing pattern used in TestLogin_PasswordDisabledByProvider) — e.g., assert that err.Error() contains "password login is disabled" or otherwise contains the sentinel ErrPasswordLoginDisabled text so the test fails if the handler collapses this branch into a generic "invalid credentials" path; update the assertion in TestLogin_GlobalPasswordAuthDisabled accordingly.internal/store/queries/security_alerts.sql (2)
36-37: Avoid narrowingCOUNT(*)toint.
COUNT(*)in PostgreSQL returnsbigint; explicitly casting to::intsilently truncates at ~2.1B and also propagates anint32return type into the generated Go code, which is inconsistent withbuildNextPageTokenthat works withint64totals (seeinternal/api/helpers.go:77-83). Prefer returningbigint.♻️ Proposed change
-- name: CountUnacknowledgedSecurityAlerts :one -SELECT COUNT(*)::int FROM security_alerts_projection WHERE NOT acknowledged; +SELECT COUNT(*)::bigint FROM security_alerts_projection WHERE NOT acknowledged;🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@internal/store/queries/security_alerts.sql` around lines 36 - 37, The query CountUnacknowledgedSecurityAlerts currently casts COUNT(*) to ::int which truncates large counts and causes a mismatch with functions expecting int64; update the SQL in CountUnacknowledgedSecurityAlerts to return bigint (remove the ::int cast) so the result remains BIGINT and aligns with buildNextPageToken's int64 handling; ensure any generated Go types or usages of the security_alerts_projection count expect an int64 accordingly.
11-28: Missing companionCountSecurityAlertsForDevicefor paginated responses.
ListSecurityAlertsForDevicereturns a page but there is no count query scoped todevice_id(with the sameinclude_acknowledgedfilter).buildNextPageTokenneeds atotalCountto decide whether to emit a next token (internal/api/helpers.go:77-83), so the forthcomingListSecurityAlertshandler will either have to fabricate one or over-emit tokens. Consider adding a matching:onecount now while the schema is fresh in memory.🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@internal/store/queries/security_alerts.sql` around lines 11 - 28, Add a companion count query named CountSecurityAlertsForDevice (:one) that returns the total number of alerts scoped to a device_id and using the same include_acknowledged filter logic as ListSecurityAlertsForDevice (i.e., WHERE device_id = $1 AND (sqlc.arg(include_acknowledged)::bool OR NOT acknowledged)); this count will be used by buildNextPageToken/internal/api/helpers.go to compute totalCount for pagination so the ListSecurityAlerts handler can emit correct next-page tokens.internal/store/migrations/010_security_alerts_projection.sql (1)
67-75: Silent no-op when acknowledging an unknownalert_id.If a
SecurityAlertAcknowledgedevent carries analert_idthat does not match any row (e.g., ack arrives before the alert projection due to out-of-order replay, or an operator ack'd an alert that was garbage-collected), theUPDATEaffects 0 rows and the trigger completes successfully, so there is no trace inprojection_errors. Consider checkingFOUNDafter the UPDATE and raising so the error lands inprojection_errorsfor ops visibility.♻️ Proposed change
WHEN 'SecurityAlertAcknowledged' THEN UPDATE security_alerts_projection SET acknowledged = TRUE, acknowledged_at = event.occurred_at, acknowledged_by = event.data->>'acknowledged_by' - WHERE event_id::TEXT = event.data->>'alert_id'; + WHERE event_id = (event.data->>'alert_id')::uuid; + IF NOT FOUND THEN + RAISE EXCEPTION 'SecurityAlertAcknowledged references unknown alert_id=%', event.data->>'alert_id'; + END IF;🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@internal/store/migrations/010_security_alerts_projection.sql` around lines 67 - 75, The UPDATE for the 'SecurityAlertAcknowledged' CASE can silently affect 0 rows; after updating security_alerts_projection using event.data->>'alert_id' check FOUND and if false raise an error (include the alert_id and event_id in the message) so the trigger surfaces the problem into projection_errors; modify the CASE branch for SecurityAlertAcknowledged to perform the UPDATE then IF NOT FOUND THEN RAISE EXCEPTION ... END IF to ensure missing-ack targets are recorded.internal/api/errors_parity_test.go (1)
143-148: Dead sanity guard — remove.
strings.Count("Err", "E")is a compile-time-constant expression over string literals; it cannot guard against "string-builder typos" in the tests since the tests don't usestrings.Countor assemble these literals dynamically. It adds noise without value.♻️ Proposed removal
-// Guard against string-builder typos in the tests above. -func init() { - if strings.Count("Err", "E") != 1 { - panic("sanity") - } -}If
stringsbecomes unused after this, also drop the import.🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@internal/api/errors_parity_test.go` around lines 143 - 148, Remove the dead "sanity guard" init() block that calls strings.Count("Err", "E") — it's a compile-time constant check and adds no value; delete the entire init() function (the strings.Count call and panic) and, if that makes the strings import unused, remove the strings import as well. Locate the init() function with the strings.Count("Err", "E") call and cleanly delete it and any resultant unused-import references.cmd/gateway/main.go (1)
253-259: Redundant nil check —ensureGatewayRegistryis already idempotent.The closure at L154-166 returns early when
gatewayReg != nil, so the outer guard here just adds a layer. Drop it and callensureGatewayRegistry()unconditionally to keep the intent obvious.- if gatewayReg == nil { - // Need a Valkey-backed registry even when terminal URLs are - // disabled; Traefik self-registration is the agent mTLS - // routing layer. - gatewayReg = ensureGatewayRegistry() - } + // Need a Valkey-backed registry even when terminal URLs are + // disabled; Traefik self-registration is the agent mTLS + // routing layer. + gatewayReg = ensureGatewayRegistry()🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@cmd/gateway/main.go` around lines 253 - 259, The nil check around gatewayReg is redundant because ensureGatewayRegistry() is idempotent; when cfg.TraefikSelfRegister is true, call ensureGatewayRegistry() unconditionally. Remove the inner if gatewayReg == nil block and invoke ensureGatewayRegistry() directly in the cfg.TraefikSelfRegister branch so the code simply calls ensureGatewayRegistry() when TraefikSelfRegister is enabled, relying on ensureGatewayRegistry() to no-op if gatewayReg is already set.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.
Inline comments:
In `@cmd/gateway/main.go`:
- Around line 366-414: The internal URL registration is skipped when gatewayReg
== nil, making admin fan-out unavailable even if GATEWAY_INTERNAL_URL is
configured; call ensureGatewayRegistry() to create or obtain a registry before
using gatewayReg. Update the block that currently checks "if gatewayReg != nil"
to first call ensureGatewayRegistry(ctx, ...) (or the existing function name
ensureGatewayRegistry) to initialize gatewayReg when needed, then proceed to
compute internalURL (using cfg.InternalURL, routableIP(), portOfListenAddr) and
call gatewayReg.RegisterGatewayInternal/gatewayReg.RegisterGatewayInternal
refresh goroutine with gatewayID and registry.DefaultGatewayTTL; ensure errors
from ensureGatewayRegistry are logged and handled so we don't silently skip
registration.
In `@internal/api/errors_parity_test.go`:
- Around line 113-118: The TS extraction regex
export\s+const\s+Err\w+\s*=\s*'([a-z][a-z0-9_]*)' is too strict; update the
regex used in the TS extraction (the variable/match logic around re and matches
in the block) to accept single, double, or backtick quotes and optionally allow
a TypeScript type annotation (e.g., : string) before the = so it robustly
captures the error identifier; make the same quote-tolerance change to the
server-side regex used earlier (the regex at the other extraction block
referenced in the file, originally around line 56) so both sides are symmetric;
and remove the pointless init() sanity check function call/logic (the init()
sanity block that does strings.Count("Err", "E") != 1) to clean up dead code.
- Line 92: The fallback path in the candidates list uses too many parent
traversals; update the filepath.Join call that appends
"../../../../sdk/ts/errors.ts" (the candidates append) to use
"../../../sdk/ts/errors.ts" so it correctly resolves the SDK from internal/api/,
or add an additional fallback entry for "../../../sdk/ts/errors.ts" alongside
the existing one to cover the documented repo layout; modify the append(s) that
reference the SDK file in the candidates construction to include the corrected
relative path.
In `@internal/api/registration_handler.go`:
- Around line 78-79: The panic/log currently includes the raw gatewayURL (seen
in NewRegistrationHandler when calling ValidateGatewayURL and similarly at the
other checks around lines 102-105), which can leak userinfo; before including
gatewayURL in any panic or structured log, parse it with net/url and remove User
(userinfo) and/or replace the authority with a redacted form (e.g., host only or
host with "[REDACTED]")—implement a small helper like RedactGatewayURL(urlStr
string) string and call it wherever the code panics or logs the invalid gateway
URL (e.g., in the NewRegistrationHandler validation error paths) so only the
redacted URL appears in error output.
- Around line 51-53: Replace the current host emptiness check that uses u.Host
with u.Hostname() so malformed URLs like "https://:8443" (empty hostname with
only a port) are rejected; specifically update the check in
registration_handler.go to use u.Hostname() == "" and return the same error
message, and add the regression test case {"empty hostname with port",
"https://:8443", true, "host"} to internal/api/validate_gateway_url_test.go so
validateGatewayURL (or the test suite for Host validation) verifies this
scenario.
In `@internal/config/config.go`:
- Around line 196-206: The TraefikTTYHost fallback to GATEWAY_DOMAIN can collide
with MTLSHost when terminal sessions are enabled; update the config load to
detect this and either warn or fail fast: after computing TraefikTTYHost (via
firstNonEmpty) and MTLSHost and reading GATEWAY_WEB_LISTEN_ADDR and TTYBackend,
add a startup check that if GATEWAY_WEB_LISTEN_ADDR is non-empty (terminal
enabled) and TTYBackend is non-empty and TraefikTTYHost == MTLSHost then log a
clear error (or return a config validation error) explaining the SNI/Host
collision and refuse startup, otherwise emit a warning; also update the existing
fallback comment near TraefikTTYHost to note that the third rung is only safe
when an explicit GATEWAY_TTY_DOMAIN is provided when terminals are enabled.
In `@internal/control/inbox_worker.go`:
- Around line 562-573: The current lookup treats any error from
GetLuksRevocationStreamID as "not found" and creates a fresh ULID, which can
fork streams on transient DB/context errors; change the logic in the inbox
worker around GetLuksRevocationStreamID so that you only assign a new ulid to
luksStreamID when errors.Is(err, sql.ErrNoRows) (or your store's sentinel "no
rows" error), and for any other err return or propagate the error (so Asynq can
retry) — ensure GetLuksRevocationStreamID, luksStreamID and the surrounding
function return the non-ErrNoRows error and do not swallow context cancellation.
In `@internal/store/migrations/010_security_alerts_projection.sql`:
- Around line 67-72: The UPDATE uses a cast on event_id which prevents the PK
index from being used; change the predicate in the WHEN
'SecurityAlertAcknowledged' branch to cast the right-hand side value to UUID
(cast event.data->>'alert_id' to UUID) instead of casting event_id to TEXT so
the WHERE compares event_id (UUID) to a UUID value and leverages the primary-key
index on security_alerts_projection; this also ensures malformed alert_id values
surface as projection errors via the sidecar trigger rather than causing a
full-table scan that silently matches nothing.
---
Nitpick comments:
In `@cmd/gateway/main.go`:
- Around line 253-259: The nil check around gatewayReg is redundant because
ensureGatewayRegistry() is idempotent; when cfg.TraefikSelfRegister is true,
call ensureGatewayRegistry() unconditionally. Remove the inner if gatewayReg ==
nil block and invoke ensureGatewayRegistry() directly in the
cfg.TraefikSelfRegister branch so the code simply calls ensureGatewayRegistry()
when TraefikSelfRegister is enabled, relying on ensureGatewayRegistry() to no-op
if gatewayReg is already set.
In `@internal/api/auth_handler_test.go`:
- Around line 77-91: The test TestLogin_GlobalPasswordAuthDisabled currently
only checks the connect error code; tighten it by also asserting the returned
error string indicates the specific disabled-password case (match existing
pattern used in TestLogin_PasswordDisabledByProvider) — e.g., assert that
err.Error() contains "password login is disabled" or otherwise contains the
sentinel ErrPasswordLoginDisabled text so the test fails if the handler
collapses this branch into a generic "invalid credentials" path; update the
assertion in TestLogin_GlobalPasswordAuthDisabled accordingly.
In `@internal/api/errors_parity_test.go`:
- Around line 143-148: Remove the dead "sanity guard" init() block that calls
strings.Count("Err", "E") — it's a compile-time constant check and adds no
value; delete the entire init() function (the strings.Count call and panic) and,
if that makes the strings import unused, remove the strings import as well.
Locate the init() function with the strings.Count("Err", "E") call and cleanly
delete it and any resultant unused-import references.
In `@internal/store/migrations/010_security_alerts_projection.sql`:
- Around line 67-75: The UPDATE for the 'SecurityAlertAcknowledged' CASE can
silently affect 0 rows; after updating security_alerts_projection using
event.data->>'alert_id' check FOUND and if false raise an error (include the
alert_id and event_id in the message) so the trigger surfaces the problem into
projection_errors; modify the CASE branch for SecurityAlertAcknowledged to
perform the UPDATE then IF NOT FOUND THEN RAISE EXCEPTION ... END IF to ensure
missing-ack targets are recorded.
In `@internal/store/queries/security_alerts.sql`:
- Around line 36-37: The query CountUnacknowledgedSecurityAlerts currently casts
COUNT(*) to ::int which truncates large counts and causes a mismatch with
functions expecting int64; update the SQL in CountUnacknowledgedSecurityAlerts
to return bigint (remove the ::int cast) so the result remains BIGINT and aligns
with buildNextPageToken's int64 handling; ensure any generated Go types or
usages of the security_alerts_projection count expect an int64 accordingly.
- Around line 11-28: Add a companion count query named
CountSecurityAlertsForDevice (:one) that returns the total number of alerts
scoped to a device_id and using the same include_acknowledged filter logic as
ListSecurityAlertsForDevice (i.e., WHERE device_id = $1 AND
(sqlc.arg(include_acknowledged)::bool OR NOT acknowledged)); this count will be
used by buildNextPageToken/internal/api/helpers.go to compute totalCount for
pagination so the ListSecurityAlerts handler can emit correct next-page tokens.
🪄 Autofix (Beta)
Fix all unresolved CodeRabbit comments on this PR:
- Push a commit to this branch (recommended)
- Create a new PR with the fixes
ℹ️ Review info
⚙️ Run configuration
Configuration used: Path: .coderabbit.yaml
Review profile: CHILL
Plan: Pro
Run ID: 352d924a-9903-49b2-add3-7dcbf6f89c2f
⛔ Files ignored due to path filters (3)
internal/store/generated/luks.sql.gois excluded by!**/generated/**internal/store/generated/models.gois excluded by!**/generated/**internal/store/generated/security_alerts.sql.gois excluded by!**/generated/**
📒 Files selected for processing (22)
README.mdcmd/control/README.mdcmd/control/main.gocmd/gateway/main.godeploy/compose.ymldeploy/initdb.d/01-indexer-user.shinternal/api/auth_handler.gointernal/api/auth_handler_test.gointernal/api/errors_parity_test.gointernal/api/registration_handler.gointernal/api/registration_handler_test.gointernal/api/service.gointernal/api/validate_gateway_url_test.gointernal/config/config.gointernal/control/inbox_worker.gointernal/store/migrations/010_security_alerts_projection.sqlinternal/store/queries/luks.sqlinternal/store/queries/security_alerts.sqlinternal/terminal/fake_backend.gointernal/terminal/store.gointernal/terminal/store_test.gointernal/terminal/valkey_backend.go
| payload, err := s.backend.GetAndDelete(ctx, sessionID) | ||
| if err != nil { | ||
| return nil, err | ||
| } | ||
| var session Session | ||
| if err := json.Unmarshal(payload, &session); err != nil { | ||
| return nil, fmt.Errorf("terminal: decode session %s: %w", sessionID, err) | ||
| } | ||
| if subtle.ConstantTimeCompare([]byte(session.TokenHash), []byte(hashToken(bearerToken))) != 1 { | ||
| // Forgery attempt — restore the real session so the | ||
| // legitimate client isn't locked out. Compute the remaining | ||
| // TTL from ExpiresAt; if already expired we just drop it. | ||
| remaining := session.ExpiresAt.Sub(s.now()) | ||
| if remaining > 0 { | ||
| if restoreErr := s.backend.Set(ctx, sessionID, payload, remaining); restoreErr != nil { | ||
| // Log via caller — we don't have a logger here. Returning | ||
| // mismatch is the priority; the caller surfaces it as | ||
| // Unauthenticated and the audit pipeline flags it. | ||
| return nil, ErrTokenMismatch | ||
| } | ||
| } | ||
| return nil, ErrTokenMismatch |
There was a problem hiding this comment.
Avoid deleting the session before proving the bearer matches.
GetAndDelete removes the key before the token hash comparison. A forged request with a known session_id can briefly erase the real session; a legitimate concurrent Validate in that window gets ErrTokenNotFound, and a restore failure drops the session entirely.
Prefer a conditional consume: first read and compare, then atomically consume only after the bearer matches, or move compare-and-delete into the backend as one atomic operation.
🐛 Safer minimal shape using read-before-consume
func (s *TokenStore) Validate(ctx context.Context, sessionID, bearerToken string) (*Session, error) {
- payload, err := s.backend.GetAndDelete(ctx, sessionID)
+ payload, err := s.backend.Get(ctx, sessionID)
if err != nil {
return nil, err
}
var session Session
if err := json.Unmarshal(payload, &session); err != nil {
return nil, fmt.Errorf("terminal: decode session %s: %w", sessionID, err)
}
if subtle.ConstantTimeCompare([]byte(session.TokenHash), []byte(hashToken(bearerToken))) != 1 {
- // Forgery attempt — restore the real session so the
- // legitimate client isn't locked out. Compute the remaining
- // TTL from ExpiresAt; if already expired we just drop it.
- remaining := session.ExpiresAt.Sub(s.now())
- if remaining > 0 {
- if restoreErr := s.backend.Set(ctx, sessionID, payload, remaining); restoreErr != nil {
- // Log via caller — we don't have a logger here. Returning
- // mismatch is the priority; the caller surfaces it as
- // Unauthenticated and the audit pipeline flags it.
- return nil, ErrTokenMismatch
- }
- }
return nil, ErrTokenMismatch
}
+
+ payload, err = s.backend.GetAndDelete(ctx, sessionID)
+ if err != nil {
+ return nil, err
+ }
+ if err := json.Unmarshal(payload, &session); err != nil {
+ return nil, fmt.Errorf("terminal: decode session %s: %w", sessionID, err)
+ }
+ if subtle.ConstantTimeCompare([]byte(session.TokenHash), []byte(hashToken(bearerToken))) != 1 {
+ return nil, ErrTokenMismatch
+ }
return &session, nil
}📝 Committable suggestion
‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.
| payload, err := s.backend.GetAndDelete(ctx, sessionID) | |
| if err != nil { | |
| return nil, err | |
| } | |
| var session Session | |
| if err := json.Unmarshal(payload, &session); err != nil { | |
| return nil, fmt.Errorf("terminal: decode session %s: %w", sessionID, err) | |
| } | |
| if subtle.ConstantTimeCompare([]byte(session.TokenHash), []byte(hashToken(bearerToken))) != 1 { | |
| // Forgery attempt — restore the real session so the | |
| // legitimate client isn't locked out. Compute the remaining | |
| // TTL from ExpiresAt; if already expired we just drop it. | |
| remaining := session.ExpiresAt.Sub(s.now()) | |
| if remaining > 0 { | |
| if restoreErr := s.backend.Set(ctx, sessionID, payload, remaining); restoreErr != nil { | |
| // Log via caller — we don't have a logger here. Returning | |
| // mismatch is the priority; the caller surfaces it as | |
| // Unauthenticated and the audit pipeline flags it. | |
| return nil, ErrTokenMismatch | |
| } | |
| } | |
| return nil, ErrTokenMismatch | |
| payload, err := s.backend.Get(ctx, sessionID) | |
| if err != nil { | |
| return nil, err | |
| } | |
| var session Session | |
| if err := json.Unmarshal(payload, &session); err != nil { | |
| return nil, fmt.Errorf("terminal: decode session %s: %w", sessionID, err) | |
| } | |
| if subtle.ConstantTimeCompare([]byte(session.TokenHash), []byte(hashToken(bearerToken))) != 1 { | |
| return nil, ErrTokenMismatch | |
| } | |
| payload, err = s.backend.GetAndDelete(ctx, sessionID) | |
| if err != nil { | |
| return nil, err | |
| } | |
| if err := json.Unmarshal(payload, &session); err != nil { | |
| return nil, fmt.Errorf("terminal: decode session %s: %w", sessionID, err) | |
| } | |
| if subtle.ConstantTimeCompare([]byte(session.TokenHash), []byte(hashToken(bearerToken))) != 1 { | |
| return nil, ErrTokenMismatch | |
| } | |
| return &session, nil |
Address the 8 actionable + 6 nitpick comments CodeRabbit posted on #76 (run 352d924a) plus the reviewer's blocker finding on the terminal-token contract. Blocker — terminal token contract mismatch TestProxyValidateTerminalToken_DoesNotConsumeToken was the pre-rc10 contract ("validation does NOT consume the token") — which is exactly the replay vulnerability the rc10 audit flagged. The updated terminal.TokenStore.Validate consumes on success via GETDEL, so the test was asserting the old contract and failing under the new one. Replace with TestProxyValidateTerminalToken_IsSingleUse asserting the correct rc10 semantic: first validation succeeds, second returns Unauthenticated. Rewrite the RPC doc comment on ProxyValidateTerminalToken to document the single-use contract and the forgery-restore behaviour. terminal_bridge.go only calls the RPC once per WebSocket and uses the returned metadata for the connection's lifetime, so single-use is compatible with the real gateway flow. CodeRabbit actionables • internal/api/errors_parity_test.go - Path used 4 `..` which over-traverses from server/internal/api up to / instead of to the workspace root. Primary fallback is now 3 `..` (workspace layout); the 4-`..` candidate is kept as a secondary fallback for harnesses that run tests from one level higher. PM_SDK_TS_ERRORS env var still takes precedence. - Relaxed the TS extraction regex to accept single, double, or backtick quotes plus an optional `: string` type annotation so a future `export const ErrFoo: string = "…"` refactor still matches. Symmetric change not needed on the Go side — errors.go uses simple `ErrX = "…"` literals. - Removed the dead strings.Count("Err", "E") sanity-init stub; it was a compile-time constant and guarded nothing. • internal/api/registration_handler.go - ValidateGatewayURL now uses u.Hostname() rather than u.Host so pathological shapes like "https://:8443" (port only, no host) are rejected — u.Host would accept them. The validate_gateway_url_test.go suite gains a regression case for exactly that shape. - Exported RedactGatewayURL so cmd/control/main.go can strip userinfo from the URL before logging (the validator rejects URLs with credentials, but the failure log should not echo them back verbatim). Called on all three sites that surface the raw URL: main.go startup error, NewRegistrationHandler panic, and Register's defensive re-validation. • internal/store/migrations/010_security_alerts_projection.sql - SecurityAlertAcknowledged UPDATE now casts alert_id to UUID on the RHS instead of casting event_id to TEXT; lets the primary-key index satisfy the predicate and surfaces a malformed alert_id as a projection error rather than a silent full-scan no-match. - If the UPDATE hits zero rows (out-of-order replay, purged alert) RAISE EXCEPTION so the sidecar trigger's handler logs it into projection_errors. Previously silent. • internal/store/queries/security_alerts.sql - CountUnacknowledgedSecurityAlerts now returns bigint instead of ::int, matching PostgreSQL's native COUNT(*) return and keeping parity with buildNextPageToken's int64 handling. - Added CountSecurityAlertsForDevice companion so the forthcoming ListSecurityAlerts handler has the total it needs to compute next-page tokens correctly. • internal/control/inbox_worker.go - handleRevokeLuksDeviceKeyResult no longer masks every error from GetLuksRevocationStreamID as "not found". Only pgx.ErrNoRows triggers the fresh-ULID fallback (the genuinely- missing-Requested case). Any other error (context, transient DB, etc.) returns so Asynq retries rather than forking the audit stream on every DB flake. • internal/config/config.go - New Config.Validate() method catches the TTY/MTLS host collision when terminal + Traefik self-register are both enabled. Without this, Traefik's TCP passthrough router for the shared SNI wins over the TTY HTTP router and WebSocket handshakes silently break. cfg.Validate() is called at the top of cmd/gateway/main.go so the failure is visible at startup rather than on first terminal attempt. - TestValidate_TTYMTLSHostCollision covers the four shapes (collision+terminal-on, collision+terminal-off, distinct+ terminal-on, collision+self-register-off). • cmd/gateway/main.go - Dropped the redundant `if gatewayReg == nil` guard before ensureGatewayRegistry() inside the TraefikSelfRegister block — the closure is idempotent, the guard was noise. - Moved the InternalURL registration block out from behind `if gatewayReg != nil` and now unconditionally calls ensureGatewayRegistry() before it, so admin fan-out stays available even when Traefik self-register and terminal URL are both off (previously the entire block silently skipped if neither feature had already initialised the registry). CodeRabbit nitpick • internal/api/auth_handler_test.go TestLogin_GlobalPasswordAuthDisabled now asserts the error string contains "password login is disabled" in addition to the connect code check. A future refactor that collapses the branch into the generic invalid-credentials path would pass the code assertion but fail the substring — which is exactly the signal we want. Part of #76 review round 2.
Reviewer called out a coherence lie introduced in the rc10 review round 2: config.go's comment above the TraefikTTYHost fallback still promised single-domain terminal deployments, but the new Config.Validate() correctly rejects that shape when terminal + Traefik self-registration are both on. The code was right; the comment and .env.example weren't. config.go: rewrite the comment block so it states that the GATEWAY_DOMAIN rung only works when the terminal listener is disabled or when Traefik self-registration is off, and that a distinct GATEWAY_TTY_DOMAIN is required in the combined case. .env.example: same clarification in operator-facing terms, with the rationale (Traefik TCP passthrough shadows the TTY HTTP router on a shared SNI) and the fail-fast behavior at startup so an operator hitting this gets an immediate error instead of a silent WebSocket break. No code change; coherence fix only.
|
@coderabbitai review |
✅ Actions performedReview triggered.
|
There was a problem hiding this comment.
Actionable comments posted: 2
Caution
Some comments are outside the diff and can’t be posted inline due to platform limitations.
⚠️ Outside diff range comments (2)
internal/config/config.go (1)
160-171:⚠️ Potential issue | 🟡 MinorUpdate the defaults summary to match the new fallback chain.
Line 169 still says TTYHost falls back to empty, but lines 211-215 now fall back to
GATEWAY_DOMAIN.Suggested wording
- // * TTYHost = GATEWAY_TTY_DOMAIN (falls back to empty -> TTY router disabled) + // * TTYHost = GATEWAY_TTY_DOMAIN, then GATEWAY_DOMAIN + // (Validate rejects unsafe shared-host terminal+self-register configs)🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@internal/config/config.go` around lines 160 - 171, Update the Traefik defaults summary comment so it matches the new fallback chain: change the TTYHost bullet (currently "falls back to empty -> TTY router disabled") to indicate it now falls back to GATEWAY_DOMAIN (so TTYHost -> GATEWAY_TTY_DOMAIN, fallback GATEWAY_DOMAIN) and ensure the wording for TTYEntryPoint/TTYCertResolver remains consistent; edit the comment block around the Traefik defaults (the lines describing SelfRegister, RootKey, MTLSHost, MTLSEntryPoint, TTYHost, TTYEntryPoint, TTYCertResolver) to reflect this updated fallback behavior.cmd/gateway/main.go (1)
198-220:⚠️ Potential issue | 🟡 MinorReject port-only URL hosts before enabling redirects or terminal registration.
When a template like
https://${UNSET_VAR}:8443/...has an unset env var, it collapses tohttps://:8443/.... Go's URL parser accepts this, settingu.Host = ":8443"(non-empty) butu.Hostname() = ""(empty). The currenthostFromURLcheck at line 706 (u.Host == "") passes, returning:8443as the host. This value is then used for bootstrap redirects and terminal registry registration—both invalid since there is no hostname.Update
hostFromURLto also reject the case where hostname is empty:Suggested fix
- if u.Host == "" { + if u.Host == "" || u.Hostname() == "" { return "" } return u.Host🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@cmd/gateway/main.go` around lines 198 - 220, The hostFromURL helper currently treats a value like ":8443" as a valid host because it only checks u.Host != "", which lets port-only hosts through and enables invalid bootstrap redirects/terminal registration; update hostFromURL to also return "" when u.Hostname() is empty (i.e., treat u.Hostname() == "" as invalid), so calls in main.go that rely on hostFromURL (the agent URL and terminal URL checks) will skip enabling redirects/registration when the template resolves to a port-only or hostname-less URL.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.
Inline comments:
In `@cmd/control/main.go`:
- Around line 94-108: The startup info log currently emits cfg.GatewayURL
verbatim via logger.Info("starting control server", ..., "gateway_url",
cfg.GatewayURL); change that to log a redacted value by calling
api.RedactGatewayURL(cfg.GatewayURL) so sensitive userinfo doesn't appear in
logs (i.e., replace the "gateway_url", cfg.GatewayURL argument with
"gateway_url", api.RedactGatewayURL(cfg.GatewayURL) in the startup logger.Info
call).
In `@internal/config/config.go`:
- Around line 243-247: The current collision check only triggers when
WebListenAddr is set, missing cases where an explicit TraefikTTYBackend is
provided and it also rejects valid setups where MTLS and TTY use different
entrypoints; update the condition in internal/config/config.go to require
c.TraefikSelfRegister, non-empty c.TraefikTTYHost and c.TraefikMTLSHost,
equality of those hosts, and then treat it as a collision only when either
c.WebListenAddr is non-empty or c.TraefikTTYBackend is non-empty AND the
entrypoints are the same (i.e. c.TraefikTTYEntryPoint ==
c.TraefikMTLSEntryPoint); reference these symbols (c.TraefikSelfRegister,
c.WebListenAddr, c.TraefikTTYBackend, c.TraefikTTYHost, c.TraefikMTLSHost,
c.TraefikTTYEntryPoint, c.TraefikMTLSEntryPoint) to implement the new boolean
logic so explicit TTY backends without WebListenAddr trigger an error while
differing entrypoints do not.
---
Outside diff comments:
In `@cmd/gateway/main.go`:
- Around line 198-220: The hostFromURL helper currently treats a value like
":8443" as a valid host because it only checks u.Host != "", which lets
port-only hosts through and enables invalid bootstrap redirects/terminal
registration; update hostFromURL to also return "" when u.Hostname() is empty
(i.e., treat u.Hostname() == "" as invalid), so calls in main.go that rely on
hostFromURL (the agent URL and terminal URL checks) will skip enabling
redirects/registration when the template resolves to a port-only or
hostname-less URL.
In `@internal/config/config.go`:
- Around line 160-171: Update the Traefik defaults summary comment so it matches
the new fallback chain: change the TTYHost bullet (currently "falls back to
empty -> TTY router disabled") to indicate it now falls back to GATEWAY_DOMAIN
(so TTYHost -> GATEWAY_TTY_DOMAIN, fallback GATEWAY_DOMAIN) and ensure the
wording for TTYEntryPoint/TTYCertResolver remains consistent; edit the comment
block around the Traefik defaults (the lines describing SelfRegister, RootKey,
MTLSHost, MTLSEntryPoint, TTYHost, TTYEntryPoint, TTYCertResolver) to reflect
this updated fallback behavior.
🪄 Autofix (Beta)
Fix all unresolved CodeRabbit comments on this PR:
- Push a commit to this branch (recommended)
- Create a new PR with the fixes
ℹ️ Review info
⚙️ Run configuration
Configuration used: Path: .coderabbit.yaml
Review profile: CHILL
Plan: Pro
Run ID: 374e4ec0-98ab-4818-8a00-f639cb2a7b58
⛔ Files ignored due to path filters (1)
internal/store/generated/security_alerts.sql.gois excluded by!**/generated/**
📒 Files selected for processing (14)
cmd/control/main.gocmd/gateway/main.godeploy/.env.exampleinternal/api/auth_handler_test.gointernal/api/errors_parity_test.gointernal/api/internal_handler.gointernal/api/internal_handler_test.gointernal/api/registration_handler.gointernal/api/validate_gateway_url_test.gointernal/config/config.gointernal/config/config_test.gointernal/control/inbox_worker.gointernal/store/migrations/010_security_alerts_projection.sqlinternal/store/queries/security_alerts.sql
✅ Files skipped from review due to trivial changes (2)
- deploy/.env.example
- internal/api/internal_handler.go
🚧 Files skipped from review as they are similar to previous changes (5)
- internal/api/validate_gateway_url_test.go
- internal/api/errors_parity_test.go
- internal/control/inbox_worker.go
- internal/api/registration_handler.go
- internal/api/auth_handler_test.go
Address the four findings CodeRabbit posted on the 8c7aeda diff (run 374e4ec0). All four are the same class of issue — "the code now does X but some sibling code/comment/check didn't get the memo" — which is exactly what happens when a previous review round fixes one site but misses its twins. Sweeping the twins now. 1. internal/config/config.go (outside diff, line ~169) The Traefik defaults summary comment still said "TTYHost falls back to empty → TTY router disabled" while the actual code (and the matching comment I fixed in the previous push) now falls back through GATEWAY_TTY_DOMAIN to GATEWAY_DOMAIN, with Validate() rejecting the unsafe shared-host case. Rewrite the bullet so the summary matches the body. 2. cmd/gateway/main.go hostFromURL (outside diff, lines ~695-707) ValidateGatewayURL got tightened to u.Hostname() == "" in the previous round, catching shapes like "https://:8443". The sibling hostFromURL helper used for PublicAgentURLTemplate / PublicTerminalURLTemplate still only checked u.Host == "", which is non-empty for ":8443" — so a template like "https://${UNSET}:8443/..." collapsed into a port-only URL still passed hostFromURL, enabling invalid bootstrap redirects and terminal registry entries. Add the same u.Hostname() invariant here so both validators agree. 3. cmd/control/main.go startup log (inline, line ~94) The rc10 review round 2 redacted GatewayURL on the error path but left the startup Info log emitting cfg.GatewayURL verbatim. A bad shape that slipped past the validator (or an operator paste mistake that the validator rejects after the log line) would still dump any userinfo into every boot log. Wrap the startup log's gateway_url field in api.RedactGatewayURL too so the redaction is symmetric across both log sites. 4. internal/config/config.go Validate() (inline, lines ~243-252) The collision check triggered only on WebListenAddr != "", missing the case where the operator sets an explicit GATEWAY_TRAEFIK_TTY_BACKEND without WebListenAddr. It also flagged topologies where the mTLS and TTY routers bind DIFFERENT entrypoints — which don't actually collide because Traefik routes per-entrypoint before per-host, so a false positive would fail-fast a valid bring-your-own-Traefik split- port setup. Split the precondition into: terminalEnabled = WebListenAddr || TraefikTTYBackend entrypointsCollide = TTYEntryPoint == MTLSEntryPoint (and non-empty) and only flag the host collision when all three preconditions line up. TestValidate_TTYMTLSHostCollision now covers two new cases (explicit TTYBackend triggers error; different entry- points skips the check) alongside the original four. All targeted go tests green: validator suite, config collision matrix, login bypass regression, terminal token single-use, registration handler, error-code parity. Part of #76 review round 3.
|
@coderabbitai review |
✅ Actions performedReview triggered.
|
There was a problem hiding this comment.
Caution
Some comments are outside the diff and can’t be posted inline due to platform limitations.
⚠️ Outside diff range comments (1)
cmd/gateway/main.go (1)
198-226:⚠️ Potential issue | 🟠 MajorDon’t fall back to the terminal host after an invalid agent URL template.
Line 224 treats
assignedHost == ""as “agent template absent”, but it is also true whenGATEWAY_PUBLIC_AGENT_URL_TEMPLATEwas explicitly set and resolved to no host. That can re-enable bootstrap redirects to the TTY host instead of disabling the middleware as intended.🐛 Proposed fix
- if cfg.PublicAgentURLTemplate != "" { + agentURLTemplateConfigured := cfg.PublicAgentURLTemplate != "" + if agentURLTemplateConfigured { agentURL := strings.ReplaceAll(cfg.PublicAgentURLTemplate, "{id}", gatewayID) assignedHost = hostFromURL(agentURL) if assignedHost == "" { logger.Warn("GATEWAY_PUBLIC_AGENT_URL_TEMPLATE resolved to a URL with no host — bootstrap redirects disabled; check for unset env vars in the template", "template", cfg.PublicAgentURLTemplate, "resolved", agentURL) @@ - if assignedHost == "" { + if assignedHost == "" && !agentURLTemplateConfigured { assignedHost = terminalHost }🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@cmd/gateway/main.go` around lines 198 - 226, The code currently treats assignedHost == "" as “no agent template provided”, but that conflates an absent template with an explicitly-set-but-malformed one and allows a terminal host fallback when it should be disabled; fix by tracking whether an agent template was provided/resolved separately (e.g. introduce a boolean like agentTemplateProvided or agentTemplateResolvedOK), set it when cfg.PublicAgentURLTemplate != "" and only set assignedHost when hostFromURL(agentURL) returns a non-empty host, then change the terminal fallback to only assign terminalHost when the agent template was not provided (agentTemplateProvided == false) rather than when assignedHost == "" so malformed agent templates don’t enable the fallback; reference the existing symbols cfg.PublicAgentURLTemplate, hostFromURL, assignedHost, cfg.PublicTerminalURLTemplate and terminalHost.
🧹 Nitpick comments (1)
cmd/gateway/main.go (1)
316-350: Bound Redis publish and refresh calls with shutdown-aware timeouts.These fail-open paths still use
context.Background(), so a slow Redis operation can ignore shutdown and delay cleanup. Match the bounded heartbeat pattern already used byregistry.RegisterGateway.♻️ Proposed fix
- if err := gatewayReg.PublishTraefikRoute( - context.Background(), gatewayID, traefikCfg, registry.DefaultGatewayTTL, - ); err != nil { + publishCtx, cancelPublish := context.WithTimeout(shutdownCtx, 5*time.Second) + err := gatewayReg.PublishTraefikRoute( + publishCtx, gatewayID, traefikCfg, registry.DefaultGatewayTTL, + ) + cancelPublish() + if err != nil { @@ select { case <-ticker.C: - if err := gatewayReg.PublishTraefikRoute( - context.Background(), gatewayID, traefikCfg, registry.DefaultGatewayTTL, - ); err != nil { + refreshCtx, cancelRefresh := context.WithTimeout(traefikRefreshCtx, 5*time.Second) + err := gatewayReg.PublishTraefikRoute( + refreshCtx, gatewayID, traefikCfg, registry.DefaultGatewayTTL, + ) + cancelRefresh() + if err != nil { logger.Warn("failed to refresh Traefik routing config", "error", err) } @@ if internalURL != "" { - if err := gatewayReg.RegisterGatewayInternal( - context.Background(), gatewayID, internalURL, registry.DefaultGatewayTTL, - ); err != nil { + registerCtx, cancelRegister := context.WithTimeout(shutdownCtx, 5*time.Second) + err := gatewayReg.RegisterGatewayInternal( + registerCtx, gatewayID, internalURL, registry.DefaultGatewayTTL, + ) + cancelRegister() + if err != nil { logger.Warn("failed to register gateway internal URL", "error", err) } @@ select { case <-ticker.C: - if err := gatewayReg.RegisterGatewayInternal( - context.Background(), gatewayID, internalURL, registry.DefaultGatewayTTL, - ); err != nil { + refreshCtx, cancelRefresh := context.WithTimeout(internalRefreshCtx, 5*time.Second) + err := gatewayReg.RegisterGatewayInternal( + refreshCtx, gatewayID, internalURL, registry.DefaultGatewayTTL, + ) + cancelRefresh() + if err != nil { logger.Warn("failed to refresh gateway internal URL", "error", err) }Also applies to: 403-425
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@cmd/gateway/main.go` around lines 316 - 350, The PublishTraefikRoute calls use context.Background() and can block shutdown; change them to use a shutdown-aware bounded context derived from shutdownCtx (e.g., create ctx, cancel := context.WithTimeout(shutdownCtx, registry.DefaultGatewayRefreshInterval/2) or the same heartbeat timeout used in registry.RegisterGateway) and pass that ctx into gatewayReg.PublishTraefikRoute (both the initial publish and the periodic refresh inside the goroutine), ensuring you call cancel() after each call so Redis ops respect shutdown and the goroutine exits promptly.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.
Outside diff comments:
In `@cmd/gateway/main.go`:
- Around line 198-226: The code currently treats assignedHost == "" as “no agent
template provided”, but that conflates an absent template with an
explicitly-set-but-malformed one and allows a terminal host fallback when it
should be disabled; fix by tracking whether an agent template was
provided/resolved separately (e.g. introduce a boolean like
agentTemplateProvided or agentTemplateResolvedOK), set it when
cfg.PublicAgentURLTemplate != "" and only set assignedHost when
hostFromURL(agentURL) returns a non-empty host, then change the terminal
fallback to only assign terminalHost when the agent template was not provided
(agentTemplateProvided == false) rather than when assignedHost == "" so
malformed agent templates don’t enable the fallback; reference the existing
symbols cfg.PublicAgentURLTemplate, hostFromURL, assignedHost,
cfg.PublicTerminalURLTemplate and terminalHost.
---
Nitpick comments:
In `@cmd/gateway/main.go`:
- Around line 316-350: The PublishTraefikRoute calls use context.Background()
and can block shutdown; change them to use a shutdown-aware bounded context
derived from shutdownCtx (e.g., create ctx, cancel :=
context.WithTimeout(shutdownCtx, registry.DefaultGatewayRefreshInterval/2) or
the same heartbeat timeout used in registry.RegisterGateway) and pass that ctx
into gatewayReg.PublishTraefikRoute (both the initial publish and the periodic
refresh inside the goroutine), ensuring you call cancel() after each call so
Redis ops respect shutdown and the goroutine exits promptly.
ℹ️ Review info
⚙️ Run configuration
Configuration used: Path: .coderabbit.yaml
Review profile: CHILL
Plan: Pro
Run ID: 03d850c8-4d49-4d13-9941-c6c1d4ff77fe
📒 Files selected for processing (4)
cmd/control/main.gocmd/gateway/main.gointernal/config/config.gointernal/config/config_test.go
✅ Files skipped from review due to trivial changes (1)
- internal/config/config.go
🚧 Files skipped from review as they are similar to previous changes (2)
- internal/config/config_test.go
- cmd/control/main.go
…edis contexts CodeRabbit round 4 on fb4ab1f (run 03d850c8) surfaced one major correctness issue and one maintainability nitpick. Both fixed. 1. cmd/gateway/main.go lines 198-226 — major, outside diff The fallback at line 224 treated assignedHost == "" as "no agent template provided", which conflates two distinct states: (a) GATEWAY_PUBLIC_AGENT_URL_TEMPLATE unset entirely (b) GATEWAY_PUBLIC_AGENT_URL_TEMPLATE set but resolved to a URL with no host (e.g. "https://${UNSET_VAR}" collapses to "https://") Case (b) already logs a warning and disables bootstrap redirects. But the subsequent fallback `if assignedHost == "" { assignedHost = terminalHost }` couldn't distinguish (a) from (b), so a malformed agent template silently re-enabled bootstrap redirects to the TTY host — directly contradicting the intent of disabling them when the operator's explicit template couldn't resolve. Introduce agentURLTemplateConfigured, set it when cfg.PublicAgentURLTemplate != "", and gate the terminal-host fallback on `assignedHost == "" && !agentURLTemplateConfigured`. An operator who wrote a broken template now gets exactly what the warning promises: feature disabled, not silently redirected to the wrong host. 2. cmd/gateway/main.go lines 316-350, 403-425 — nitpick PublishTraefikRoute + RegisterGatewayInternal (both the initial call and the periodic refresh) used context.Background(), so a slow or hung Redis would ignore SIGTERM and delay graceful shutdown. The RevokeTraefikRoute defer already uses a 5s bound for exactly this reason. Bound both calls with shutdownCtx + 5s timeout (refresh calls derive from the refresh's own cancellable context rather than shutdownCtx directly, so the timeout cancels cleanly when the refresh goroutine stops). Matches the heartbeat pattern used inside registry.RegisterGateway. Targeted tests all pass: config Validate matrix, parity guard, registration validator, terminal token single-use, login bypass regression. Part of #76 review round 4.
|
@codserabbitai review |
|
@coderabbitai review |
✅ Actions performedReview triggered.
|
There was a problem hiding this comment.
Actionable comments posted: 1
🧹 Nitpick comments (1)
cmd/gateway/main.go (1)
406-491: DeadgatewayReg != nilguard — registry is now unconditionally initialized.Since
ensureGatewayRegistry()is called unconditionally at line 406 (as the comment there explicitly states: "if terminal + Traefik are both off, this is where the registry gets built"),gatewayRegcannot benilwhen execution reaches line 489. The guard is dead and slightly misleading — a reader now has to re-derive the invariant to convince themselvesSetGatewayRoutingalways runs.♻️ Proposed simplification
- if gatewayReg != nil { - agentHandler.SetGatewayRouting(gatewayReg, gatewayID) - } + agentHandler.SetGatewayRouting(gatewayReg, gatewayID)🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@cmd/gateway/main.go` around lines 406 - 491, The nil-check around gatewayReg is dead because ensureGatewayRegistry() always returns a registry; remove the conditional and call agentHandler.SetGatewayRouting(gatewayReg, gatewayID) unconditionally (locate the ensureGatewayRegistry() call and the agentHandler variable near the agent handler setup and replace the if gatewayReg != nil { ... } block with a direct call), keeping behavior identical but avoiding misleading guards.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.
Inline comments:
In `@cmd/gateway/main.go`:
- Around line 241-264: The initial call to gatewayReg.RegisterGateway uses
context.Background() and can hang; change it to use a context derived from
shutdownCtx with a 5s timeout (e.g., ctx, cancel :=
context.WithTimeout(shutdownCtx, 5*time.Second); defer cancel()) when calling
gatewayReg.RegisterGateway so the startup register is bounded like
PublishTraefikRoute and RegisterGatewayInternal; keep the existing error
handling and defer stop() semantics (call cancel immediately after
RegisterGateway returns if needed).
---
Nitpick comments:
In `@cmd/gateway/main.go`:
- Around line 406-491: The nil-check around gatewayReg is dead because
ensureGatewayRegistry() always returns a registry; remove the conditional and
call agentHandler.SetGatewayRouting(gatewayReg, gatewayID) unconditionally
(locate the ensureGatewayRegistry() call and the agentHandler variable near the
agent handler setup and replace the if gatewayReg != nil { ... } block with a
direct call), keeping behavior identical but avoiding misleading guards.
🪄 Autofix (Beta)
Fix all unresolved CodeRabbit comments on this PR:
- Push a commit to this branch (recommended)
- Create a new PR with the fixes
ℹ️ Review info
⚙️ Run configuration
Configuration used: Path: .coderabbit.yaml
Review profile: CHILL
Plan: Pro
Run ID: 1d41b4b7-c137-41c5-bf12-da9601e7949b
📒 Files selected for processing (1)
cmd/gateway/main.go
…eg nil guard CodeRabbit round 5 on 7f4179a (run 1d41b4b7) posted one actionable and one nitpick, both completing patterns the previous rounds established. 1. cmd/gateway/main.go line 242 — actionable RegisterGateway's initial call still used context.Background() while its round-4 siblings PublishTraefikRoute and RegisterGatewayInternal now use shutdownCtx + 5s. Match the pattern here too — a slow/hung Valkey at startup otherwise ignores SIGTERM on the registry register path. The stop() return value keeps its own refresh-goroutine cancellation wiring, so the bound only affects the one-shot initial call. 2. cmd/gateway/main.go line 499 — nitpick After the round-3 refactor moved ensureGatewayRegistry() to fire unconditionally above the internal-URL block, gatewayReg cannot be nil by the time we reach SetGatewayRouting. The `if gatewayReg != nil` guard is dead and misleading — a reader has to re-derive the invariant. Drop the guard, leave a one-line comment recording why it's safe to call unconditionally. Targeted go tests unchanged and still green (no test surface touched this round). Part of #76 review round 5.
|
@coderabbitai review |
✅ Actions performedReview triggered.
|
There was a problem hiding this comment.
🧹 Nitpick comments (1)
cmd/gateway/main.go (1)
322-404: Skip-on-empty-backend guard and bounded publish/refresh/revoke all look right.Gating publication on a non-empty
mtlsBackend(L325) prevents poisoning the shared pm-mtls TCP router with an unusable entry, which matchesinternal/gateway/registry/traefik.go'sPublishTraefikRoutesemantics — it writes all keys under the same TTL, so a bad initial publish would be observable to every other replica. The refresh goroutine deriving fromshutdownCtx(L363) and the revoke using a freshcontext.Background()+5s (L389) correctly handles the "shutdownCtx already done" case at teardown.One very small nit: the
logger.Info("traefik self-registration enabled", …)at L396 fires even when the initial publish warned at L348-358, which could mislead an operator reading logs after a transient Valkey failure — the refresh loop will eventually reconcile it, so "enabled" is functionally correct, but something like an extra"initial_publish_ok": err == nilfield would make the log self-diagnosing. Optional.🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@cmd/gateway/main.go` around lines 322 - 404, The final logger.Info call logs "traefik self-registration enabled" even if the initial PublishTraefikRoute returned an error; include a diagnostic field so the log shows whether the initial publish succeeded by adding something like "initial_publish_ok", err == nil to the logger.Info invocation (referencing the err from the initial publish block and the logger.Info call).
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.
Nitpick comments:
In `@cmd/gateway/main.go`:
- Around line 322-404: The final logger.Info call logs "traefik
self-registration enabled" even if the initial PublishTraefikRoute returned an
error; include a diagnostic field so the log shows whether the initial publish
succeeded by adding something like "initial_publish_ok", err == nil to the
logger.Info invocation (referencing the err from the initial publish block and
the logger.Info call).
ℹ️ Review info
⚙️ Run configuration
Configuration used: Path: .coderabbit.yaml
Review profile: CHILL
Plan: Pro
Run ID: b2171b1c-3992-48a7-aeaa-828c62552edc
📒 Files selected for processing (1)
cmd/gateway/main.go
… status in log CR review on e8294f6 posted one low-severity doc mismatch and one nitpick. Practicing the sibling-sweep discipline from the pre-push memory: grep for every IMAGE_TAG reference in the repo before fixing, and grep for every "feature enabled" log site before touching the one CR flagged. 1. deploy/.env.example IMAGE_TAG contract — CR flagged Comment said "default: latest" with a commented-out `IMAGE_TAG=latest`. compose.yml's `${IMAGE_TAG:?...}` enforces the opposite: operators who don't set IMAGE_TAG get a hard startup refusal, which is the intent. Rewrite the doc to match reality: IMAGE_TAG is required, uncommented by default with a sensible placeholder (v2026.05), with a comment explaining why the compose file enforces it. 2. deploy/deploy.sh IMAGE_TAG fallback — sibling sweep caught this Not in CR's findings but the same contract. deploy.sh reads IMAGE_TAG from the remote .env and falls back to "latest" when missing: IMAGE_TAG=$(grep -oP '(?<=^IMAGE_TAG=).*' .env || echo latest) But compose.yml refuses to start without IMAGE_TAG explicitly set. So the fallback retags images to :latest, then compose rejects them with its own less-specific error. Fail here instead with a pointer at .env.example so the operator's actual fix is visible in the first error. 3. cmd/gateway/main.go "traefik self-registration enabled" log — CR nit After a failed initial PublishTraefikRoute the Warn line above still emits. The later Info line says "enabled" unconditionally, which is true (refresh goroutine will retry) but slightly muddy — an operator skimming a Valkey-wobble startup can't tell "registered now" from "will retry in 45s" without cross-referencing the Warn. Add `initial_publish_ok` to the structured fields so the state is explicit. Sibling check: the only other "feature enabled" log is the "multi-gateway routing enabled" line at ~269, which is already correctly scoped inside the else branch of its register-error check — it only fires when RegisterGateway succeeded. No action needed there. Part of #76 review round 6.
Reverts the "IMAGE_TAG required" part of the rc10 review round 6 commit (69abe32). That change was based on a wrong assumption about the release-engineering workflow: I treated ${IMAGE_TAG:-latest} as a silent-rolling-forward footgun, but the CI tag contract here is deliberate: :latest — latest stable release (CI promotes only after a stable version is cut; does not advance on every rc) :latest-rc — latest pre-release / release candidate vYYYY.MM — pinned-release specific tag So :latest is a curated pointer, not an accidental one. Requiring explicit pinning broke quickstart (docker compose up on a fresh checkout) and CI-reference-stack workflows for zero real gain — the supply-chain concern I was mitigating doesn't apply to a promoted-by- CI floating tag. Restores: * compose.yml: ${IMAGE_TAG:-latest} on all three image lines * .env.example: commented-out IMAGE_TAG=latest default, with a comment block that documents all three tag shapes (latest, latest-rc, vYYYY.MM) so operators pick the right one for their stance on rollbacks * deploy.sh: |- || echo latest fallback restored; matches the compose default rather than failing hard on unset IMAGE_TAG Memory saved under project_image_tag_contract.md so this doesn't get re-litigated in a future review round. The Traefik "initial_publish_ok" log field from 69abe32 stays — that's an unrelated nit fix on its own merits.
Summary
Large consolidation PR addressing findings from the rc10 audit sweep. Breaking changes allowed per scope decision. Organized by theme:
Security-critical fixes
CONTROL_PASSWORD_AUTH_ENABLEDbypass. The global flag previously only gated the SSO handler's `ListAuthMethods` response (a UI hint); the Login RPC itself never checked it, so direct POST-based password login succeeded against any account with `has_password=true` even in SSO-only deployments. Now enforced in `AuthHandler.Login` before the DB lookup, burning the dummy bcrypt cycle on the disabled path so timing doesn't leak whether password login is enabled. Regression test: `TestLogin_GlobalPasswordAuthDisabled`.Fail-open audit (gateway startup)
Six `os.Exit(1)` paths converted to `Warn`+feature-disabled so a malformed optional knob no longer takes agent mTLS down:
Remaining `os.Exit` paths (TLS cert/key/CA flags, Valkey/Control URL missing, cert parse, `tls.Config`, `http2.ConfigureServer`) stay fatal — gateway cannot serve primary mTLS role without them.
Data integrity
Config coherence
Deploy hardening
Docs
Error-code parity guard
New unit test `TestErrorCodeParityWithTSSDK` walks `errors.go` and the sdk's `ts/errors.ts` via filesystem, asserts the two code sets match. Catches the drift the rc10 audit found (4 codes missing on TS side).
Deferred to follow-ups (explicit non-goals for this PR)
The rc10 plan also called for:
Each of these has been captured on the rc10 audit notes and can land as focused follow-ups without the scope creep risk of piling them all into one PR.
Audit claims verified and rejected
Publishing these explicitly so future devs don't chase ghosts:
Fixes #74.
Test plan
Summary by CodeRabbit
New Features
Improvements
Documentation
Tests