Skip to content

Conversation

@amikofalvy
Copy link
Collaborator

Summary

This PRD defines a new feature for time-based agent triggers, enabling:

  • Recurring schedules via standard 5-field cron expressions
  • One-time delayed execution for deferred tasks
  • Exactly-once execution guarantees using Postgres advisory locks
  • Retry with exponential backoff for transient failures
  • Full observability with trace linkage, execution history, and debugging

Key Design Decisions

  1. Internal scheduler over Vercel Cron - Vercel cron requires static vercel.json config, unsuitable for dynamic multi-tenant schedules
  2. Postgres-backed durable queues using existing @workflow/* infrastructure
  3. UTC-only scheduling for v1 (no timezone/DST complexity)

Architecture Diagrams Included

The PRD contains 13 Mermaid diagrams covering:

  • High-level system overview
  • Scheduler worker architecture
  • Invocation state machine
  • Database entity relationships
  • Exactly-once execution guarantees
  • Retry flow with exponential backoff
  • Leader election
  • User journey flows

Surfaces Impacted

  • API (manage + run domains)
  • TypeScript SDK (scheduledTrigger() builder)
  • CLI (pull/push support)
  • Manage UI (trigger forms, invocation history)
  • Observability (new trace spans)

Open Questions for Review

  1. Missed execution handling (catch-up vs skip)
  2. Execution overlap policy (queue vs skip)
  3. Alert webhook retry logic
  4. Natural language parsing approach (LLM vs regex)

🤖 Generated with Claude Code

This PRD defines time-based agent triggers with:
- Recurring schedules via cron expressions
- One-time delayed execution
- Exactly-once execution guarantees
- Retry with exponential backoff
- Full observability integration
- Multi-surface support (API, UI, SDK, CLI)

Includes comprehensive architecture diagrams for review.

Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
@changeset-bot
Copy link

changeset-bot bot commented Jan 27, 2026

⚠️ No Changeset found

Latest commit: cc8a5c3

Merging this PR will not cause a version bump for any packages. If these changes should not result in a new version, you're good to go. If these changes should result in a version bump, you need to add a changeset.

This PR includes no changesets

When changesets are added to this PR, you'll see the packages that this PR includes changesets for and the associated semver types

Click here to learn what changesets are, and how to add one.

Click here if you're a maintainer who wants to add a changeset to this PR

@vercel
Copy link

vercel bot commented Jan 27, 2026

The latest updates on your projects. Learn more about Vercel for GitHub.

Project Deployment Review Updated (UTC)
agents-api Ready Ready Preview, Comment Jan 27, 2026 10:00pm
agents-docs Ready Ready Preview, Comment Jan 27, 2026 10:00pm
agents-manage-ui Ready Ready Preview, Comment Jan 27, 2026 10:00pm

Request Review

@claude
Copy link

claude bot commented Jan 27, 2026

Claude finished @amikofalvy's task —— View job


Technical Review: PRD - Scheduled Agent Triggers (Cron Jobs)

As a TypeScript Staff Engineer and System Architect, I've conducted a thorough technical review of this PRD. The design is well-thought-out and follows existing patterns, but there are several architectural considerations and potential improvements to address.

🏗️ Excellent Design Decisions

Proper Architecture Separation

  • Two-database approach correctly separates configuration (manage DB) from runtime state (runtime DB)
  • Exactly-once semantics using Postgres advisory locks is the right choice for reliability
  • Reusing ExecutionHandler maintains consistency with existing webhook trigger execution

Strong Observability

  • Comprehensive tracing attributes following existing invocation.type patterns
  • Trace linkage via traceId enables debugging across the stack
  • OpenTelemetry spans align with current architecture

Sound Multi-tenancy

  • Proper scoping patterns consistent with existing triggers
  • Advisory lock keys include tenant context to prevent cross-tenant interference

⚠️ Critical Technical Concerns

🔴 Database Schema Issues

Problem 1: Advisory Lock Collision Risk

-- PRD suggests using just job ID for advisory locks
pg_try_advisory_xact_lock(hashtext(id))

Risk: Hash collisions across tenants could cause lock contention between unrelated jobs.

Recommendation:

-- Include tenant/project context in lock key
pg_try_advisory_xact_lock(hashtext(tenant_id || ':' || project_id || ':' || id))

Problem 2: Missing Critical Indexes
The runtime schema is missing indexes for scheduler polling patterns:

Current in PRD:

index('sched_invocations_pending_idx').on(table.status, table.scheduledFor)

Missing indexes needed:

-- For claim + lock operations in single query
index('sched_invocations_claim_idx').on(table.status, table.scheduledFor, table.tenantId),
-- For efficient leader election pattern
index('sched_invocations_leader_idx').on(table.tenantId, table.projectId, table.status)

Problem 3: Idempotency Key Design

// PRD suggests: triggerId + scheduledFor
const idempotencyKey = `${triggerId}-${scheduledFor}`

Issue: This doesn't account for manual re-runs or retry attempts, potentially causing duplicates.

Recommendation:

// Include attempt context for better deduplication
const idempotencyKey = `${triggerId}-${scheduledFor.toISOString()}-${contextId || 'auto'}`

🟡 System Design Concerns

Scheduler Scalability Pattern
The leader election approach using pg_try_advisory_lock(scheduler_leader_key) is simple but has limitations:

Current Issues:

  • Single leader bottleneck for job creation across ALL tenants
  • No graceful failover mechanism described
  • Potential thundering herd on leader restart

Recommendation: Consider tenant-sharded leader election:

-- Per-tenant leader locks enable horizontal scaling
pg_try_advisory_lock(hashtext('scheduler_leader:' || tenant_id))

Polling Interval Trade-offs
10-second polling interval may be too aggressive:

  • High database load with hundreds of active schedules
  • 10s latency is significant for time-sensitive triggers

Suggestion: Implement adaptive polling with exponential backoff when no jobs are ready.


🔧 Implementation Quality Issues

Missing Error Classifications

The PRD doesn't distinguish between error types for retry logic:

Current: Generic retry with exponential backoff
Needed: Error-specific retry strategies:

enum SchedulerError {
  TRANSIENT = 'transient',     // Retry with backoff
  AGENT_ERROR = 'agent_error', // Retry with limit
  CONFIG_ERROR = 'config',     // No retry
  RATE_LIMIT = 'rate_limit'    // Custom backoff
}

Template Variable Security

The template interpolation in messageTemplate and payload lacks input sanitization:

Risk: Template injection if user input gets into template variables
Recommendation: Implement sandboxed template execution with whitelisted variables only.

Job Claiming Race Conditions

The current claiming algorithm has a subtle race condition:

PRD Algorithm:

-- Step 1: SELECT with lock
SELECT * FROM invocations WHERE ... AND pg_try_advisory_xact_lock(...)
-- Step 2: UPDATE status
UPDATE invocations SET status = 'running'

Issue: Between SELECT and UPDATE, another process could theoretically interfere.

Fix: Use single atomic operation:

UPDATE invocations 
SET status = 'running', started_at = NOW()
WHERE id IN (
  SELECT id FROM invocations 
  WHERE status = 'pending' 
    AND scheduled_for <= NOW()
    AND pg_try_advisory_xact_lock(hashtext(tenant_id || ':' || id))
  LIMIT 100
)
RETURNING *;

📊 Surface Area & Integration Analysis

API Design Quality

The REST API endpoints follow existing patterns well, but missing key features:

Missing Endpoints:

  • GET /scheduled-triggers/{id}/next-runs - Preview next N execution times
  • POST /scheduled-triggers/{id}/test - Dry-run execution for debugging
  • GET /health/scheduler - Scheduler health check endpoint

SDK Builder Pattern

The scheduledTrigger() builder needs refinement:

Current PRD:

export function scheduledTrigger(config: ScheduledTriggerConfig): ScheduledTrigger

Enhanced Design:

export function scheduledTrigger(config: ScheduledTriggerConfig): ScheduledTrigger {
  // Runtime validation for better DX
  if (config.cronExpression) {
    const nextRuns = cronParser.parseExpression(config.cronExpression)
      .iterate(5).map(d => d.toDate());
    console.log(`Next 5 runs: ${nextRuns.map(d => d.toISOString()).join(', ')}`);
  }
  return new ScheduledTrigger(config);
}

CLI Integration Gaps

The PRD mentions pull/push support but doesn't address:

  • Validation feedback: How does CLI show cron validation errors?
  • Status display: Should agents status show pending scheduled executions?
  • Time zone handling: How does CLI handle user's local time vs UTC?

🚀 Open Questions - Architectural Decisions

1. Execution Overlap Policy

Question: When a cron job is still running at next scheduled time?

Options Analysis:

  • A. Queue: Simple but can cause resource exhaustion
  • B. Skip: Safe but users lose executions
  • C. Configurable: Best UX but adds complexity

Recommendation: Start with Skip (option B) for safety, add configuration later.

2. Missed Execution Handling

Question: Scheduler downtime causes missed executions?

Technical Analysis:

  • Catch-up mode: Complex to implement correctly, risk of execution storms
  • Skip mode: Simple but users might miss important scheduled work

Recommendation: Implement limited catch-up - only catch up if downtime < 1 hour, skip older executions.

3. Natural Language Parsing

Question: LLM vs regex for "every Monday at 9am" parsing?

Implementation Trade-offs:

  • LLM approach: Better UX, higher accuracy, but adds latency/cost
  • Regex approach: Fast, predictable, but limited patterns

Recommendation: Hybrid approach - regex for common patterns, LLM fallback for complex expressions.


🎯 High-Priority Recommendations

1. Database Schema Fixes (Critical)

  • Fix advisory lock key collision risk
  • Add missing indexes for scheduler queries
  • Improve idempotency key design

2. Error Handling Enhancement (High)

  • Implement error classification for smart retries
  • Add circuit breaker pattern for external dependencies
  • Sanitize template variable interpolation

3. Scalability Preparation (Medium)

  • Design tenant-sharded leader election
  • Implement adaptive polling intervals
  • Add scheduler health check endpoints

4. Developer Experience (Medium)

  • Add cron validation with next-run preview in SDK
  • Enhance CLI with time zone conversion helpers
  • Implement dry-run/test execution endpoints

PRD Quality Assessment

Strengths:

  • ✅ Comprehensive user stories with clear acceptance criteria
  • ✅ Detailed Mermaid diagrams for system flows
  • ✅ Proper consideration of exactly-once semantics
  • ✅ Well-thought-out integration across all surfaces
  • ✅ Good separation of concerns (manage vs runtime)

Areas for Improvement:

  • ⚠️ Missing critical database index specifications
  • ⚠️ Insufficient error classification for retry logic
  • ⚠️ Limited scalability analysis beyond single-worker leader election
  • ⚠️ Security considerations for template interpolation not addressed

Overall Assessment: This is a well-architected PRD that demonstrates strong system design thinking. The technical approach is sound and builds appropriately on existing patterns. With the critical database schema fixes and error handling improvements, this would be a robust foundation for scheduled triggers.

Recommendation:Approve with revisions - address the database schema issues and error classification before implementation.


Remove quotes and special characters that were causing parse errors.

Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>

## Introduction

Add time-based scheduling capabilities to agent triggers, enabling both recurring schedules (cron-style) and one-time delayed execution. Currently, agents can only be triggered via webhooks or direct API calls. Users need the ability to schedule agents to run automatically—daily reports, periodic data syncs, scheduled reminders, or deferred task execution.
Copy link
Collaborator Author

@amikofalvy amikofalvy Jan 28, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

callout on the requirement I added to do deferred task execution. We might want to defer this work item to a later date, but it might be worth considering when designing the API for other types of scheduled workflows.


2. **Exactly-Once Execution**: Using Postgres advisory locks and transactional job claiming to prevent duplicate executions, with idempotency keys for downstream operations.

3. **UTC-Only Scheduling**: All schedules stored and executed in UTC. Users convert their local times; no DST complexity in v1.
Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We may want to rethink this requirement. I worry that if we don't handle this now, we will neglect to do it later. It shouldn't be too hard to add in the initial implementation.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants