Skip to content

Conversation

@abcxff
Copy link
Contributor

@abcxff abcxff commented Nov 8, 2025

Fixes RVT-5269

@vercel
Copy link

vercel bot commented Nov 8, 2025

The latest updates on your projects. Learn more about Vercel for GitHub.

Project Deployment Preview Comments Updated (UTC)
rivetkit-serverless Ready Ready Preview Comment Nov 12, 2025 8:46pm
3 Skipped Deployments
Project Deployment Preview Comments Updated (UTC)
rivet-cloud Ignored Ignored Nov 12, 2025 8:46pm
rivet-inspector Ignored Ignored Preview Nov 12, 2025 8:46pm
rivet-site Ignored Ignored Preview Nov 12, 2025 8:46pm

Copy link
Contributor Author

abcxff commented Nov 8, 2025


How to use the Graphite Merge Queue

Add the label merge-queue to this PR to add it to the merge queue.

You must have a Graphite account in order to use the merge queue. Sign up using this link.

An organization admin has enabled the Graphite Merge Queue in this repository.

Please do not merge from GitHub as this will restart CI on PRs being processed by the merge queue.

This stack of pull requests is managed by Graphite. Learn more about stacking.

@abcxff abcxff changed the title feat: pegboard serverless pool feat(pegboard): move pegboard serverless to wf Nov 8, 2025
@abcxff abcxff requested a review from MasterPtato November 8, 2025 01:25
@abcxff abcxff marked this pull request as ready for review November 8, 2025 01:25
@linear
Copy link

linear bot commented Nov 8, 2025

@pkg-pr-new
Copy link

pkg-pr-new bot commented Nov 8, 2025

More templates

@rivetkit/actor

pnpm add https://pkg.pr.new/rivet-dev/rivet/@rivetkit/actor@3416

@rivetkit/cloudflare-workers

pnpm add https://pkg.pr.new/rivet-dev/rivet/@rivetkit/cloudflare-workers@3416

@rivetkit/core

pnpm add https://pkg.pr.new/rivet-dev/rivet/@rivetkit/core@3416

@rivetkit/db

pnpm add https://pkg.pr.new/rivet-dev/rivet/@rivetkit/db@3416

@rivetkit/framework-base

pnpm add https://pkg.pr.new/rivet-dev/rivet/@rivetkit/framework-base@3416

@rivetkit/next-js

pnpm add https://pkg.pr.new/rivet-dev/rivet/@rivetkit/next-js@3416

@rivetkit/react

pnpm add https://pkg.pr.new/rivet-dev/rivet/@rivetkit/react@3416

rivetkit

pnpm add https://pkg.pr.new/rivet-dev/rivet/rivetkit@3416

@rivetkit/sql-loader

pnpm add https://pkg.pr.new/rivet-dev/rivet/@rivetkit/sql-loader@3416

@rivetkit/engine-runner

pnpm add https://pkg.pr.new/rivet-dev/rivet/@rivetkit/engine-runner@3416

@rivetkit/engine-runner-protocol

pnpm add https://pkg.pr.new/rivet-dev/rivet/@rivetkit/engine-runner-protocol@3416

commit: 9db0849

@claude
Copy link

claude bot commented Nov 8, 2025

PR Review: Move Pegboard Serverless to Workflow

Summary

This PR migrates pegboard serverless functionality from a separate package into the main pegboard package as workflows, and removes the unused internal package. The serverless runner lifecycle is now managed via the gasoline workflow engine.

Critical Issues

1. Corrupted Error Artifact Generation (BLOCKER)

The PR includes 88 malformed error artifact files with corrupted names containing autocomplete markers like raCompletionMarker and partial error codes.

Examples:

  • serverless_runner.invalid_configraCompletionMarker.json
  • seraCompletionMarkerrunner_config.invalid.json
  • serverless_runner.cnot_found.json

Root Cause: The error definitions in errors.rs are correct, but the artifact generation script has a bug creating incremental artifacts.

Action Required:

  • Delete all 86 malformed error artifact files
  • Fix the error artifact generation script
  • Regenerate only valid error artifacts

Code Quality Issues

2. Unnecessary field name in pool.rs:35
Line has redundant field initialization: config: config
Should be: config

3. Debug comment in runtime.rs:1
Leftover comment: runner wf see how signal fail handling

4. Activity Retry Strategy (runner.rs:115-116)
Using MAX values for timeout and retries is concerning. Infinite retries could cause zombie workflows. Is this intentional? Please add a comment explaining the rationale or set reasonable limits.

5. TODO about lifecycle state (runner.rs:229-234)
Known issue about lifecycle state management when connections close unexpectedly. Consider creating a GitHub issue to track this.

Positive Changes

Architecture Simplification

  • Consolidates serverless into main pegboard package
  • Reduces workspace complexity (removed 2 packages)
  • Improves code cohesion

Workflow-Based State Management

  • Provides durable state persistence
  • Better observability
  • Automatic retry and recovery

Security

  • Headers properly validated preventing injection attacks
  • Tokens handled securely

ActivityCtx Enhancements
Added useful methods: find_workflow, get_workflows, state accessor, update_workflow_tags

Test Coverage

No test files found for new serverless workflows. Consider adding:

  • Unit tests for serverless configuration parsing
  • Integration tests for pool scaling logic
  • Tests for graceful draining behavior

Recommendations

  1. CRITICAL: Fix error artifact generation bug
  2. HIGH: Document or fix infinite timeout/retry strategy
  3. MEDIUM: Add tests for serverless workflows
  4. LOW: Clean up debug comment and use shortened field init

Final Verdict

NEEDS CHANGES - Good architecture and code quality, but corrupted error artifacts must be fixed before merging. Once resolved, this is a solid refactoring.

@MasterPtato MasterPtato force-pushed the 11-05-feat_pegboard_serverless_pool branch from ddb0414 to be0a1d2 Compare November 11, 2025 23:41
Comment on lines 532 to 538
ctx.v(2)
.signal(crate::workflows::serverless::pool::Bump {})
.to_workflow::<crate::workflows::serverless::pool::Workflow>()
.tag("namespace_id", input.namespace_id)
.tag("runner_name", input.runner_name_selector.clone())
.send()
.await?;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Needs to handle WorkflowNotFound gracefully

Comment on lines 55 to 61
ctx.v(2)
.signal(crate::workflows::serverless::pool::Bump {})
.to_workflow::<crate::workflows::serverless::pool::Workflow>()
.tag("namespace_id", input.namespace_id)
.tag("runner_name", res.runner_name_selector.clone())
.send()
.await?;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Needs to handle WorkflowNotFound gracefully

Comment on lines 619 to 627
if allocate_res.serverless {
ctx.v(2)
.signal(crate::workflows::serverless::pool::Bump {})
.to_workflow::<crate::workflows::serverless::pool::Workflow>()
.tag("namespace_id", input.namespace_id)
.tag("runner_name", input.runner_name_selector.clone())
.send()
.await?;
}
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Needs to handle WorkflowNotFound gracefully

Comment on lines 576 to 584
if allocate_res.serverless {
ctx.v(2)
.signal(crate::workflows::serverless::pool::Bump {})
.to_workflow::<crate::workflows::serverless::pool::Workflow>()
.tag("namespace_id", input.namespace_id)
.tag("runner_name", input.runner_name_selector.clone())
.send()
.await?;
}
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Needs to handle WorkflowNotFound gracefully

Comment on lines 20 to 25
// Service::new(
// "pegboard_serverless",
// // There should only be one of these, since it's auto-scaling requests
// ServiceKind::Singleton,
// |config, pools| Box::pin(pegboard_serverless::start(config, pools)),
// ),
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

remove comment

Comment on lines 48 to 53
ctx.signal(pegboard::workflows::serverless::pool::BumpConfig {})
.to_workflow::<pegboard::workflows::serverless::pool::Workflow>()
.tag("runner_name", body.runner_name)
.tag("namespace_id", body.namespace_id)
.send()
.await?;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Needs to handle WorkflowNotFound gracefully

Comment on lines +100 to +101
#[timeout = u64::MAX]
#[max_retries = usize::MAX]
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I was wrong about using this for outbound req retries, the issue is the backoff will never reset and will get longer and longer after each request.

Ideally we only have an increasing backoff for actual errors and not the request stopping normally or failing after drain.

This will require a loop with state tracking last retry and retry count and a resetting mechanism (see pb actor reschedule_actor fn for reference)

Tracking with https://linear.app/rivet-gg/issue/RVT-5274/add-backoff-to-pb-serverless

@claude
Copy link

claude bot commented Nov 12, 2025

PR Review: Move Pegboard Serverless to Workflow

Summary

This PR successfully migrates the pegboard serverless autoscaler from a standalone service to a workflow-based architecture. The refactoring removes ~524 lines from pegboard-serverless package and integrates the functionality into the pegboard workflows using the Gasoline workflow engine.

Code Quality & Architecture

Strengths

  1. Clean Separation of Concerns: The new implementation splits responsibilities across three well-defined workflows (pool.rs, runner.rs, connection.rs)

  2. Improved Reliability: Moving to workflows provides durable state management, automatic retries, and better observability

  3. Package Consolidation: Removes two unused packages

  4. Error Handling: Proper structured errors using the custom RivetError system

  5. API Modernization: The bump autoscaler endpoint now uses proper signal-based workflow communication

Areas for Improvement

1. Critical: Unclosed Event Source on Early Exit (connection.rs:196-233)

When the stream ends early, the source EventSource is not properly closed before returning. This can lead to resource leaks.
Recommendation: Call source.close() before returning in the StreamEnded error case.

2. Potential Race Condition (connection.rs:241-262)

The code attempts to send RunnerDrainStarted signal inline with bypass_signal_from_workflow_I_KNOW_WHAT_IM_DOING(), but if it fails, relies on the workflow to send it durably.
Question: Is there a race where both the activity retry and workflow durable send could succeed?

3. TODO Comment About Lifecycle State (connection.rs:225-232)

Known issue where unexpected connection closures before receiving runner_id can cause lifecycle state problems.
Recommendation: Create tracking issue or implement timeout-based cleanup.

4. Gasoline API Extension

The PR adds signal(), find_workflow(), and get_workflows() methods to ActivityCtx. Activities sending signals breaks typical workflow orchestration patterns.
Question: Is this a deliberate architectural decision? What are the implications for workflow determinism?

5. Infinite Retry Configuration (connection.rs:72-73)

Activity has max_retries = usize::MAX and timeout = u64::MAX without documentation.
Recommendation: Add comment explaining why infinite retries are safe.

Security Considerations

Good: Comprehensive header validation, URL validation, proper base64 error handling

Concern: Admin token added to headers - ensure TLS-only and trusted endpoints. Response bodies in error messages (though properly limited to 512 bytes).

Performance

Good: Efficient streaming queries, concurrent workflow operations, graceful draining

Opportunity: Consider aggressive caching for high-frequency scaling, batch signaling for runner draining

Test Coverage

Missing: No new test files for workflow implementations.
Recommendation: Add unit tests for autoscaling logic, integration tests for state transitions, error handling tests, lifecycle tests

Summary

This is a well-architected refactoring that improves maintainability and reliability. Main concerns are resource cleanup and activities sending signals.

Recommendation: Request changes to address EventSource leak, then approve after review.

High Priority

  • Fix EventSource resource leak on early exit
  • Document lifecycle state TODO
  • Clarify design decision on activities sending signals

Medium Priority

  • Add test coverage
  • Document infinite retry strategy
  • Add runner count validation

Generated with Claude Code

@abcxff abcxff requested a review from MasterPtato November 12, 2025 23:20
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants