fix: lockbox RPC migration + automation hardening#113
Conversation
Supabase's PostgREST rejects schema-qualified table access for non-public schemas when using service_role. Switch oauth-core LockboxAdapter to call SECURITY DEFINER RPC functions (lockbox_get/save/delete/list/exists) in public schema, which access the lockbox.user_secrets table internally. Includes migration SQL, grant hardening (service_role only), and tests. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
📝 WalkthroughWalkthroughAdds job status tracking, file-backed storage for automation run messages, RPC-hardened lockbox functions, and an execution gate that prevents automation actions outside the expected environment; updates engine, worker, API routes, DB schema/migrations, types, and tests to support these behaviors. Changes
Sequence Diagram(s)sequenceDiagram
participant Client
participant API as Web API
participant Auth
participant Gate as ExecutionGate
participant Service as ServiceClient/DB
participant Engine
Client->>API: POST /api/automations/:id/trigger
API->>Auth: validate session
Auth-->>API: user
API->>Gate: getAutomationExecutionGate()
Gate-->>API: {allowed, reason}
alt allowed
API->>Service: createServiceAppClient / insert run / claim job
Service-->>Engine: claim/enqueue job (status -> running)
Engine->>Engine: write messages to file -> returns messages_uri
Engine->>Service: update run with messages_uri and status
Service-->>API: success
API-->>Client: 200 OK
else denied
API-->>Client: 403 Forbidden (reason)
end
Estimated code review effort🎯 4 (Complex) | ⏱️ ~50 minutes Possibly related PRs
Poem
🚥 Pre-merge checks | ✅ 2 | ❌ 1❌ Failed checks (1 warning)
✅ Passed checks (2 passed)
✏️ Tip: You can configure your own custom pre-merge checks in the settings. ✨ Finishing touches
🧪 Generate unit tests (beta)
Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out. Comment |
There was a problem hiding this comment.
Actionable comments posted: 9
🤖 Fix all issues with AI agents
In `@apps/web/app/api/automations/`[id]/runs/[runId]/route.ts:
- Around line 68-74: The code silently asserts run.messages as unknown[];
instead perform a runtime type check and only assign when it's actually an array
(or safely parse JSON if run.messages can be a JSON string). Update the block
around messages, run.messages_uri and readMessagesFromUri to: if
run.messages_uri use readMessagesFromUri, else if Array.isArray(run.messages)
assign it to messages, and if run.messages is a string try JSON.parse in a
try/catch and validate the result is an array before assigning; do not use "as"
assertions—use these runtime guards to narrow types safely.
In `@apps/web/app/api/automations/`[id]/trigger/__tests__/route.test.ts:
- Around line 25-30: The test currently mocks structuredErrorResponse which
returns the wrong shape and bypasses NextResponse.json(); remove the mock block
for "@/lib/api/responses" in apps/web/.../route.test.ts so the real
structuredErrorResponse is used, or replace the mock with
vi.importActual/vi.requireActual to delegate to the real implementation; ensure
tests import/use structuredErrorResponse from the actual module and not a
handcrafted vi.fn so the returned object includes { ok:false, error, message,
details? } and uses NextResponse.json() (preserving CORS headers).
In `@apps/web/migrations/automation-hardening.sql`:
- Around line 40-60: The CHECK constraints on app.automation_jobs
(chk_cron_requires_schedule, chk_non_cron_no_schedule, chk_timeout_range,
chk_status_running_consistent) are not idempotent; wrap each ADD CONSTRAINT in a
DO block that first checks pg_catalog.pg_constraint (or pg_constraint) for
conname and only executes ALTER TABLE ... ADD CONSTRAINT if that name doesn't
exist so re-running the migration won't raise duplicate_object errors.
- Around line 27-30: The UPDATE backfill for app.automation_jobs is
unconditional and will overwrite existing statuses on re-run; modify each UPDATE
(the three statements that set status = 'running' / 'disabled' / 'idle' using
running_at and is_active) to include an additional guard that only updates rows
that are still at the default/unset state (e.g., WHERE status = 'idle' or status
IS NULL) so the migration only affects rows that haven't been explicitly set by
the application.
In `@apps/web/migrations/lockbox-rpc-hardening.sql`:
- Around line 258-281: The REVOKE statements for functions like lockbox_get,
lockbox_save, lockbox_delete, lockbox_list, and lockbox_exists unconditionally
revoke from roles anon and authenticated and will error if those roles don't
exist; wrap each REVOKE ... FROM anon and REVOKE ... FROM authenticated in a DO
block that checks pg_roles (IF EXISTS (SELECT 1 FROM pg_roles WHERE rolname =
'anon') THEN REVOKE ... END IF) and similarly for 'authenticated' so the revoke
is conditional and won't abort the transaction (mirror the role-check pattern
used in the init migration).
In `@packages/automation-engine/src/engine.ts`:
- Around line 271-275: When writeMessagesToFile(ctx.runId, result.messages)
returns null while result.messages exists, the messages are silently lost;
update the logic that prepares the run record (the code using messagesUri and
result.messages) to handle this failure by either (A) falling back to storing
the messages inline in the DB: set messages = result.messages and messages_uri =
null when messagesUri is null, or (B) mark the run metadata with a failure flag:
add messages_write_failed: true (and keep messages or messages_uri as-is).
Locate the callsite that assigns messagesUri and the subsequent run record
creation (references: writeMessagesToFile, messagesUri, result.messages,
ctx.runId) and implement one of these two behaviors so the run record reflects
the write failure instead of silently dropping messages.
In `@packages/database/migrations/0001_init.sql`:
- Around line 786-819: The hardening migration performs bare REVOKE/GRANT for
roles like anon and authenticated which will error on Postgres installs without
those roles; update lockbox-rpc-hardening.sql to mirror the pattern used in the
init migration by wrapping each REVOKE/GRANT for anon and authenticated (and any
other non-guaranteed roles) inside IF EXISTS checks (e.g., IF EXISTS (SELECT 1
FROM pg_roles WHERE rolname = 'anon') THEN REVOKE ... FROM anon; END IF;) so
role operations only run when the role exists, and consider guarding
service_role the same way for consistency.
- Around line 626-670: The current version calc (SELECT
COALESCE(MAX(us.version), 0) + 1 INTO v_next_version) has a race with concurrent
lockbox_save calls — acquire a lightweight per-key advisory lock before
computing v_next_version and performing the UPDATE/INSERT to make the sequence
atomic; specifically, call pg_advisory_xact_lock (or pg_try_advisory_xact_lock)
using a hash of (p_user_id, p_instance_id, p_namespace, p_name) at the start of
the procedure, then perform the SELECT MAX(...), UPDATE lockbox.user_secrets
(is_current -> false), and INSERT returning user_secret_id as you already do, so
duplicates against user_secrets_instance_version_idx are prevented. Ensure the
lock acquisition is inside the function/transaction and use the same locking key
derivation wherever this sequence runs.
In `@packages/oauth-core/src/storage.ts`:
- Around line 209-211: The return uses a double cast "(data || []) as unknown as
UserSecret[]" which bypasses type safety; fix by aligning types or explicitly
mapping: either change LockboxRpcFunctions.lockbox_list.Returns to UserSecret[]
if LockboxListRow already matches UserSecret, or replace the cast with a
transformation that maps each LockboxListRow to a UserSecret (implement a mapper
function that converts fields and returns UserSecret[]), and return the mapped
array instead of casting; update any related type aliases (LockboxListRow,
UserSecret) so the compiler no longer requires an assertion.
🧹 Nitpick comments (11)
apps/web/lib/automation/execution-guard.ts (1)
8-8: Hardcoded domain violates coding guideline.
AUTOMATION_PRIMARY_MAIN_DOMAIN = "alive.best"is a hardcoded domain. The coding guidelines require all domain configuration to come fromserver-config.jsonat runtime. If this is an intentional security pin (only the primary production server should run automations), consider adding an inline comment documenting why the guideline is intentionally overridden, so future maintainers don't accidentally relax it.As per coding guidelines: "Do not hardcode domains; all domain configuration must come from
server-config.jsonat runtime."apps/web/app/api/automations/[id]/trigger/__tests__/route.test.ts (1)
60-94: Missing happy path test.The test suite covers 401 (no session) and 403 (gate denied) but lacks a happy path test. Per the coding guidelines and retrieved learnings, all new API routes must have tests covering authentication, happy path, and at least one error case.
A happy path test would mock a valid session, allowed gate, a found job with matching ownership/server, successful claim, and verify the 202 response.
Based on learnings: "For all new API routes in app/api/, write mandatory tests covering: authentication (401 without session), happy path, and at least one error case."
apps/web/lib/automation/__tests__/execution-guard.test.ts (1)
4-32: Missing test for themainDomainnot configured branch.The gate has four code paths, but only three are tested. The missing case is when
mainDomainis falsy (empty string or undefined), which triggers the "Main domain is not configured" denial at Line 29 ofexecution-guard.ts.💡 Suggested additional test case
+ it("blocks when mainDomain is not configured", () => { + const gate = getAutomationExecutionGate({ + streamEnv: "production", + mainDomain: "", + }) + + expect(gate.allowed).toBe(false) + expect(gate.reason).toContain("not configured") + })apps/worker/src/execution-guard.ts (1)
1-36: Duplicated execution guard logic — consider extracting to@webalive/shared.This file is essentially identical to
apps/web/lib/automation/execution-guard.ts(same interface, same constant, same function, same logic). The only difference is that the web version exportsAUTOMATION_PRIMARY_MAIN_DOMAINwhile this version doesn't.Extracting
AutomationExecutionGate,AUTOMATION_PRIMARY_MAIN_DOMAIN, andgetAutomationExecutionGateinto@webalive/sharedwould eliminate the duplication and ensure both apps stay in sync.apps/web/app/api/internal/automation/trigger/__tests__/route.test.ts (1)
49-85: Consider adding a happy-path test for the gate-allowed scenario.The two tests effectively cover the new gate logic (401 + 403), but per coding guidelines, tests should also include a happy path. A test verifying that when the gate allows execution, the route proceeds to parse the body and interact with the DB would strengthen coverage and catch regressions in the gate integration.
Based on learnings: "For all new API routes in app/api/, write mandatory tests covering: authentication (401 without session), happy path, and at least one error case."
packages/automation-engine/src/engine.ts (1)
33-34: Hardcoded absolute path for message storage.
MESSAGES_DIRis hardcoded to/var/log/automation-runs/messages. This works for a known deployment environment, but consider making it configurable (e.g., via env var with this as default) for portability and testing.♻️ Optional: make configurable
-const MESSAGES_DIR = "/var/log/automation-runs/messages" +const MESSAGES_DIR = process.env.AUTOMATION_MESSAGES_DIR ?? "/var/log/automation-runs/messages"packages/oauth-core/src/__tests__/storage.test.ts (1)
148-159: Missing test fordeleteRPC.The
lockbox_deleteRPC path is not covered. While it's a simple delegation, adding a test would complete the coverage for all five RPC operations and catch regressions if the RPC name or args change.🧪 Suggested test for delete
+ it("deletes secrets via lockbox_delete RPC", async () => { + rpcMock.mockResolvedValueOnce({ data: null, error: null }) + const adapter = new LockboxAdapter({ instanceId: "google:prod" }) + + await adapter.delete("user_123", "oauth_connections", "google") + + expect(rpcMock).toHaveBeenCalledWith("lockbox_delete", { + p_user_id: "user_123", + p_instance_id: "google:prod", + p_namespace: "oauth_connections", + p_name: "google", + }) + })apps/web/app/api/automations/[id]/__tests__/route.test.ts (1)
62-95: Mock doesn't verify which job ID the update targets.The
updateMock→.eq()chain doesn't assert the job ID value. The test verifies the update payload but not that it's applied to the correct record.♻️ Capture and assert the eq parameter
- const updateMock = vi.fn((updates: Record<string, unknown>) => ({ - eq: vi.fn(() => ({ + const eqMock = vi.fn() + const updateMock = vi.fn((updates: Record<string, unknown>) => ({ + eq: eqMock.mockReturnValue({ select: vi.fn(() => ({ single: vi.fn(() => Promise.resolve({ data: { id: "job_1", trigger_type: "one-time", cron_schedule: null, cron_timezone: null, ...updated, ...updates, }, error: null, }), ), })), - })), + }), }))Then in the test assertion:
expect(eqMock).toHaveBeenCalledWith("id", "job_1")apps/web/lib/automation/engine.ts (2)
1-12: Significant code duplication withpackages/automation-engine/src/engine.ts.Nearly all shared logic (
writeMessagesToFile,readMessagesFromUri,isWithinDirectory,finishJob,claimJob,claimDueJobs, constants) is duplicated between this file and the package version. This creates a maintenance risk — a bug fix or behavior change in one copy may not propagate to the other.Consider having the web layer import from
@webalive/automation-engineand extend with web-specific logic (executeJob, SSE broadcast) rather than maintaining two nearly-identical copies.Also applies to: 75-76
117-117:as AutomationJob[]type assertion.This violates the coding guideline against TypeScript
astype assertions. The RPC return type should ideally be typed correctly so the cast isn't needed. If the Supabase client types don't match, consider a type guard or narrowing function.As per coding guidelines: "Never use TypeScript
astype assertions oranytype; fix types properly at the source instead of silencing the compiler."apps/web/migrations/automation-hardening.sql (1)
66-68: Partial index could be tighter by includingis_active = true.The
claim_due_jobsfunction filters on bothstatus = 'idle'andis_active = true, but the partial index only filters onstatus = 'idle'. Addingis_active = trueto the index predicate would make it a more precise covering index for the query.♻️ Tighter partial index
CREATE INDEX IF NOT EXISTS idx_automation_jobs_due ON app.automation_jobs (next_run_at ASC) - WHERE status = 'idle' AND run_id IS NULL AND next_run_at IS NOT NULL; + WHERE status = 'idle' AND is_active = true AND run_id IS NULL AND next_run_at IS NOT NULL;
apps/web/app/api/automations/[id]/trigger/__tests__/route.test.ts
Outdated
Show resolved
Hide resolved
- Replace `as unknown[]` with `Array.isArray` runtime check in run details route - Remove mock of `structuredErrorResponse` in trigger test — use real function - Add DB fallback when file write fails for automation run messages - Eliminate `as unknown as UserSecret[]` double cast by aligning LockboxListRow type Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
- Rename middleware.ts → proxy.ts (Next.js 16 file convention) - Replace Sentry disableLogger with webpack.treeshake.removeDebugLogging - Export onRouterTransitionStart for Sentry navigation instrumentation Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
There was a problem hiding this comment.
Actionable comments posted: 2
Caution
Some comments are outside the diff and can’t be posted inline due to platform limitations.
⚠️ Outside diff range comments (1)
packages/automation-engine/src/engine.ts (1)
237-269:⚠️ Potential issue | 🔴 CriticalHandle
"skipped"status in addition to"success"and"failure".The
FinishOptionstype allows three possible status values:"success","failure", and"skipped". Currently, the code only handles the first two. If a job finishes with status"skipped", neitherconsecutiveFailuresnornextRunAtwill be updated, causing the job to silently stop scheduling without being disabled.
🤖 Fix all issues with AI agents
In `@apps/web/app/api/automations/`[id]/trigger/__tests__/route.test.ts:
- Around line 57-91: Add a "happy path" test that stubs a valid session and a
successful job claim: mock getSessionUserMock to resolve a user, ensure
getAutomationExecutionGateMock returns {allowed:true}, stub claimJob to return a
valid context (e.g., { jobId: "job_1", ... }) and have
createServiceAppClientMock return the service client, then call
POST(makeRequest(), { params: Promise.resolve({ id: "job_1" }) }) and assert
response.status === 202 and payload equals { ok: true, status: "queued" }; also
assert claimJob (and any service client methods invoked) were called with the
expected job id to verify the happy path flow.
In `@packages/automation-engine/src/engine.ts`:
- Around line 384-405: Remove the duplicated readMessagesFromUri from the web
app's local engine module and update tests to import the function from the
shared package: delete the readMessagesFromUri function declaration in
apps/web/lib/automation/engine.ts (leave executeJob and all other local exports
intact), then in apps/web/lib/automation/__tests__/engine.test.ts replace any
local import of readMessagesFromUri with an import from
`@webalive/automation-engine` (keeping the test usage unchanged) so tests use the
package implementation instead of the removed local copy.
🧹 Nitpick comments (3)
apps/web/vitest.config.ts (1)
91-91: Add a comment explaining why this test is excluded.The other exclusions (lines 87–90) are grouped under the TODO on line 86 explaining the root cause. This new entry has no context — is it the same
jsx-dev-runtimeresolution issue, or something else (e.g., a flaky timeout)? A brief inline comment will help someone later decide when it's safe to re-enable.apps/web/app/api/automations/[id]/trigger/__tests__/route.test.ts (1)
25-27: No-op mock can be removed.This
vi.mockwithimportOriginaljust re-exports the real module unchanged — it has no effect. It was likely left over from addressing the previous review comment. You can safely delete these three lines.♻️ Proposed cleanup
-vi.mock("@/lib/api/responses", async importOriginal => { - return await importOriginal<typeof import("@/lib/api/responses")>() -}) -packages/automation-engine/src/engine.ts (1)
272-287: Add a runtimeArray.isArrayguard on theJSON.parseresult.Line 282 assigns the
JSON.parseresult (typedany) directly tomessagesFallback: Json[] | nullwithout runtime validation. While the logic guarantees an array (since the source isresult.messages), adding anArray.isArraycheck aligns with the codebase guideline to avoid implicitanypropagation and is consistent with thereadMessagesFromUrifunction (line 401) which does validate withArray.isArray.Proposed fix
try { - messagesFallback = JSON.parse(JSON.stringify(result.messages)) + const parsed: unknown = JSON.parse(JSON.stringify(result.messages)) + messagesFallback = Array.isArray(parsed) ? parsed : null } catch {As per coding guidelines: "Never use TypeScript
astype assertions oranytype; fix types properly at the source instead of silencing the compiler."
- Add happy-path test (202 queued) for trigger route - Remove duplicate readMessagesFromUri from web app engine - Move readMessagesFromUri tests to automation-engine package - Add vitest to automation-engine package Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
…ssages
Two bugs fixed:
1. Integration endpoints (GET/DELETE/POST) used raw MCP provider key
(e.g. "gmail") instead of mapping to OAuth key ("google") via
getOAuthKeyForProvider(), causing INTEGRATION_NOT_CONNECTED errors
when the token exists under the correct key.
2. Alrighty API client used raw error code strings as ApiError.message
instead of the user-friendly message field from createErrorResponse().
Now prefers err.message over err.error for all API error responses.
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
There was a problem hiding this comment.
Caution
Some comments are outside the diff and can’t be posted inline due to platform limitations.
⚠️ Outside diff range comments (1)
apps/web/lib/automation/engine.ts (1)
326-358:⚠️ Potential issue | 🟠 Major
skippedstatus does not computenext_run_atfor cron jobs — they will stop scheduling.When
finishJobis called withstatus: "skipped", neither thesuccessnor thefailurebranch executes, sonextRunAtremainsnull. For a cron job, the DB update at Line 380 setsnext_run_at: null, which means the scheduler will never pick the job up again.If "skipped" is a valid non-terminal outcome, the cron reschedule logic should apply here too.
Proposed fix
if (result.status === "success") { consecutiveFailures = 0 if (ctx.job.trigger_type === "one-time") { isActive = false jobStatus = "disabled" } else if (ctx.job.trigger_type === "cron" && ctx.job.cron_schedule) { const nextMs = computeNextRunAtMs( { kind: "cron", expr: ctx.job.cron_schedule, tz: ctx.job.cron_timezone ?? undefined }, now, ) if (nextMs) { nextRunAt = new Date(nextMs).toISOString() } } } else if (result.status === "failure") { // ... existing failure logic ... + } else if (result.status === "skipped") { + // Skipped is non-terminal — still schedule the next cron run + if (ctx.job.trigger_type === "cron" && ctx.job.cron_schedule) { + const nextMs = computeNextRunAtMs( + { kind: "cron", expr: ctx.job.cron_schedule, tz: ctx.job.cron_timezone ?? undefined }, + now, + ) + if (nextMs) { + nextRunAt = new Date(nextMs).toISOString() + } + } }
🧹 Nitpick comments (4)
apps/web/app/api/automations/[id]/trigger/__tests__/route.test.ts (1)
25-27: No-op mock can be removed.This
vi.mockimports and re-exports the original module unchanged — it has no effect. It appears to be a leftover from the previous iteration where a faulty mock was replaced. Removing it reduces noise.apps/web/lib/automation/engine.ts (1)
75-76: Unbounded growth of message files in/var/log/automation-runs/messages.Each run writes a new JSON file that is never cleaned up. Over time this directory will consume significant disk. Consider adding a retention/rotation policy (e.g., a periodic cleanup of files older than N days, or integrating with logrotate).
apps/web/lib/automation/__tests__/engine.test.ts (1)
13-50: Filesystem mocks are wired up but never exercised —writeMessagesToFilehas no test coverage.
mkdirMock,writeFileMock,renameMock, andrmMockare declared and registered viavi.mock, but no test callsfinishJobwithmessagesto verify the file-write path. Consider adding at least:
- A happy-path test:
finishJobwith messages → verifywriteFile+renamecalled,messages_uristored in the run insert.- An error-path test:
writeFilerejects → verifyrmcleanup attempted,messages_uriisnull, andfinishJobstill completes.packages/automation-engine/src/__tests__/readMessagesFromUri.test.ts (1)
67-78: Consider adding a test for non-file://URI schemes.The non-array JSON test is good. One gap: there's no test for URIs with unexpected schemes (e.g.,
http://...or a bare path withoutfile://). IfreadMessagesFromUriis expected to reject those, a test would prevent regressions.
Summary
oauth-coreLockboxAdapter from directlockbox.user_secretstable access toSECURITY DEFINERRPC functions (lockbox_get/save/delete/list/exists) in public schema. Fixes PostgREST schema rejection that broke Google OAuth connect/disconnect flows.Verification (post-deploy, all clean)
lockbox_*RPCs executable byservice_roleonly (notauthenticated/anon)node_modules/@webalive/oauth-core/dist/storage.jsgmail OAuth fetch failed/ noschema must be one oferrorsfetch-oauth-tokens.tsorfetch-user-env-keys.tsTest plan
packages/oauth-core/src/__tests__/storage.test.ts)🤖 Generated with Claude Code
Summary by CodeRabbit
New Features
Improvements
Database
Tests