-
Notifications
You must be signed in to change notification settings - Fork 131
feat(pegboard): move pegboard serverless to wf #3416
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
Conversation
|
The latest updates on your projects. Learn more about Vercel for GitHub.
3 Skipped Deployments
|
How to use the Graphite Merge QueueAdd the label merge-queue to this PR to add it to the merge queue. You must have a Graphite account in order to use the merge queue. Sign up using this link. An organization admin has enabled the Graphite Merge Queue in this repository. Please do not merge from GitHub as this will restart CI on PRs being processed by the merge queue. This stack of pull requests is managed by Graphite. Learn more about stacking. |
More templates
@rivetkit/actor
@rivetkit/cloudflare-workers
@rivetkit/core
@rivetkit/db
@rivetkit/framework-base
@rivetkit/next-js
@rivetkit/react
rivetkit
@rivetkit/sql-loader
@rivetkit/engine-runner
@rivetkit/engine-runner-protocol
commit: |
PR Review: Move Pegboard Serverless to WorkflowSummaryThis PR migrates pegboard serverless functionality from a separate package into the main pegboard package as workflows, and removes the unused internal package. The serverless runner lifecycle is now managed via the gasoline workflow engine. Critical Issues1. Corrupted Error Artifact Generation (BLOCKER) The PR includes 88 malformed error artifact files with corrupted names containing autocomplete markers like raCompletionMarker and partial error codes. Examples:
Root Cause: The error definitions in errors.rs are correct, but the artifact generation script has a bug creating incremental artifacts. Action Required:
Code Quality Issues2. Unnecessary field name in pool.rs:35 3. Debug comment in runtime.rs:1 4. Activity Retry Strategy (runner.rs:115-116) 5. TODO about lifecycle state (runner.rs:229-234) Positive ChangesArchitecture Simplification
Workflow-Based State Management
Security
ActivityCtx Enhancements Test CoverageNo test files found for new serverless workflows. Consider adding:
Recommendations
Final VerdictNEEDS CHANGES - Good architecture and code quality, but corrupted error artifacts must be fixed before merging. Once resolved, this is a solid refactoring. |
250de22 to
ddb0414
Compare
ddb0414 to
be0a1d2
Compare
| ctx.v(2) | ||
| .signal(crate::workflows::serverless::pool::Bump {}) | ||
| .to_workflow::<crate::workflows::serverless::pool::Workflow>() | ||
| .tag("namespace_id", input.namespace_id) | ||
| .tag("runner_name", input.runner_name_selector.clone()) | ||
| .send() | ||
| .await?; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Needs to handle WorkflowNotFound gracefully
| ctx.v(2) | ||
| .signal(crate::workflows::serverless::pool::Bump {}) | ||
| .to_workflow::<crate::workflows::serverless::pool::Workflow>() | ||
| .tag("namespace_id", input.namespace_id) | ||
| .tag("runner_name", res.runner_name_selector.clone()) | ||
| .send() | ||
| .await?; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Needs to handle WorkflowNotFound gracefully
| if allocate_res.serverless { | ||
| ctx.v(2) | ||
| .signal(crate::workflows::serverless::pool::Bump {}) | ||
| .to_workflow::<crate::workflows::serverless::pool::Workflow>() | ||
| .tag("namespace_id", input.namespace_id) | ||
| .tag("runner_name", input.runner_name_selector.clone()) | ||
| .send() | ||
| .await?; | ||
| } |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Needs to handle WorkflowNotFound gracefully
| if allocate_res.serverless { | ||
| ctx.v(2) | ||
| .signal(crate::workflows::serverless::pool::Bump {}) | ||
| .to_workflow::<crate::workflows::serverless::pool::Workflow>() | ||
| .tag("namespace_id", input.namespace_id) | ||
| .tag("runner_name", input.runner_name_selector.clone()) | ||
| .send() | ||
| .await?; | ||
| } |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Needs to handle WorkflowNotFound gracefully
| // Service::new( | ||
| // "pegboard_serverless", | ||
| // // There should only be one of these, since it's auto-scaling requests | ||
| // ServiceKind::Singleton, | ||
| // |config, pools| Box::pin(pegboard_serverless::start(config, pools)), | ||
| // ), |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
remove comment
| ctx.signal(pegboard::workflows::serverless::pool::BumpConfig {}) | ||
| .to_workflow::<pegboard::workflows::serverless::pool::Workflow>() | ||
| .tag("runner_name", body.runner_name) | ||
| .tag("namespace_id", body.namespace_id) | ||
| .send() | ||
| .await?; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Needs to handle WorkflowNotFound gracefully
| #[timeout = u64::MAX] | ||
| #[max_retries = usize::MAX] |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I was wrong about using this for outbound req retries, the issue is the backoff will never reset and will get longer and longer after each request.
Ideally we only have an increasing backoff for actual errors and not the request stopping normally or failing after drain.
This will require a loop with state tracking last retry and retry count and a resetting mechanism (see pb actor reschedule_actor fn for reference)
Tracking with https://linear.app/rivet-gg/issue/RVT-5274/add-backoff-to-pb-serverless
PR Review: Move Pegboard Serverless to WorkflowSummaryThis PR successfully migrates the pegboard serverless autoscaler from a standalone service to a workflow-based architecture. The refactoring removes ~524 lines from pegboard-serverless package and integrates the functionality into the pegboard workflows using the Gasoline workflow engine. Code Quality & ArchitectureStrengths
Areas for Improvement1. Critical: Unclosed Event Source on Early Exit (connection.rs:196-233)When the stream ends early, the source EventSource is not properly closed before returning. This can lead to resource leaks. 2. Potential Race Condition (connection.rs:241-262)The code attempts to send RunnerDrainStarted signal inline with bypass_signal_from_workflow_I_KNOW_WHAT_IM_DOING(), but if it fails, relies on the workflow to send it durably. 3. TODO Comment About Lifecycle State (connection.rs:225-232)Known issue where unexpected connection closures before receiving runner_id can cause lifecycle state problems. 4. Gasoline API ExtensionThe PR adds signal(), find_workflow(), and get_workflows() methods to ActivityCtx. Activities sending signals breaks typical workflow orchestration patterns. 5. Infinite Retry Configuration (connection.rs:72-73)Activity has max_retries = usize::MAX and timeout = u64::MAX without documentation. Security ConsiderationsGood: Comprehensive header validation, URL validation, proper base64 error handling Concern: Admin token added to headers - ensure TLS-only and trusted endpoints. Response bodies in error messages (though properly limited to 512 bytes). PerformanceGood: Efficient streaming queries, concurrent workflow operations, graceful draining Opportunity: Consider aggressive caching for high-frequency scaling, batch signaling for runner draining Test CoverageMissing: No new test files for workflow implementations. SummaryThis is a well-architected refactoring that improves maintainability and reliability. Main concerns are resource cleanup and activities sending signals. Recommendation: Request changes to address EventSource leak, then approve after review. High Priority
Medium Priority
Generated with Claude Code |

Fixes RVT-5269