Skip to content

Commit e4fc88f

Browse files
Fix to prevent workflow & step retries repeating infinitely
1 parent c8e2025 commit e4fc88f

File tree

10 files changed

+252
-148
lines changed

10 files changed

+252
-148
lines changed

ARCHITECTURE.md

Lines changed: 12 additions & 3 deletions
Original file line numberDiff line numberDiff line change
@@ -234,12 +234,21 @@ execution, replay reaches the failed step and re-executes its function.
234234
### 4.2. Workflow Failures & Retries
235235

236236
If an error is unhandled by the workflow code, the entire workflow run fails.
237-
The workflow run is rescheduled with backoff according to its **retry policy**.
238-
By default, retries continue until canceled or until a configured deadline is
239-
reached. If the run can no longer be retried (for example, because the next
237+
Workflow-level retries are **disabled by default** (`maximumAttempts: 1`): an
238+
unhandled error immediately marks the run as `failed`. To enable automatic
239+
workflow-level retries, supply a `retryPolicy` when defining the workflow.
240+
Set `maximumAttempts: 0` for unlimited retries.
241+
If the run can no longer be retried (for example, because the next
240242
retry would exceed `deadlineAt` or `maximumAttempts` has been reached), its
241243
status is set to `failed` permanently.
242244

245+
When a worker claims a run but does not have the matching workflow definition
246+
in its registry, this is treated as a deployment concern rather than an
247+
application failure. The run is rescheduled with its own generous backoff
248+
policy (5s initial, 5min cap, unlimited attempts) so it remains available
249+
for a worker that does have the definition — for example during a rolling
250+
deploy.
251+
243252
### 4.3. Retry Policy
244253

245254
OpenWorkflow uses the same `RetryPolicy` shape for two separate concerns:

packages/docs/docs/retries.mdx

Lines changed: 65 additions & 114 deletions
Original file line numberDiff line numberDiff line change
@@ -3,28 +3,29 @@ title: Retries
33
description: Automatic retry behavior for failed steps and workflows
44
---
55

6-
In any application, things fail sometimes - a third-party API returns a
7-
500 error, a database connection times out, or a network blip drops a request.
8-
These are transient failures: they go away on their own if you try again.
6+
In any app, things fail sometimes - a third-party API returns a 500, a database
7+
connection times out, or a network blip drops a request. These are transient
8+
failures that go away if you try again.
99

10-
OpenWorkflow handles this automatically. When a step throws an error, the
11-
workflow is rescheduled with an exponential backoff (increasing delays between
12-
retries). Previously completed steps aren't re-run - only the failed step is
13-
retried.
10+
OpenWorkflow handles this automatically by retrying failed steps. When a step
11+
throws, the workflow is rescheduled with an exponential backoff. Previously
12+
completed steps aren't re-run, only the failed step re-executes.
1413

1514
## How Retries Work
1615

1716
When a step throws an error:
1817

19-
1. The step attempt is marked as `failed`
20-
2. The error is recorded in the database
21-
3. The workflow is rescheduled with exponential backoff
22-
4. When the workflow resumes, it replays to the failed step
23-
5. The step function executes again (not the cached result)
18+
1. The step attempt is marked as `failed` and the error is recorded
19+
2. The workflow run is rescheduled with exponential backoff
20+
3. When the workflow resumes, it replays to the failed step
21+
4. The step function executes again
2422

25-
## Automatic Retries in Steps
23+
Steps retry up to 10 times by default. If the step still fails after all
24+
attempts, the workflow is permanently marked as `failed`.
2625

27-
Steps that throw are automatically retried:
26+
## Step Retries
27+
28+
Steps that throw are retried automatically:
2829

2930
```ts
3031
await step.run({ name: "call-api" }, async () => {
@@ -39,19 +40,18 @@ await step.run({ name: "call-api" }, async () => {
3940
});
4041
```
4142

42-
Each retry:
43-
44-
- Replays the workflow from the beginning
45-
- Returns cached results for completed steps
46-
- Re-executes the failed step
43+
### Step Retry Policy
4744

48-
## Retry Policy
45+
Each step can define its own retry policy. If omitted, steps use these defaults:
4946

50-
Both steps and workflows use the same retry policy shape. A retry policy
51-
controls exponential backoff — how long to wait between retries, how fast delays
52-
grow, and when to stop retrying.
47+
| Field | Default | Description |
48+
| -------------------- | -------- | ---------------------------------------------------------- |
49+
| `initialInterval` | `"1s"` | Delay before the first retry |
50+
| `backoffCoefficient` | `2` | Multiplier applied to each subsequent retry delay |
51+
| `maximumInterval` | `"100s"` | Upper bound for retry delay |
52+
| `maximumAttempts` | `10` | Total attempts including the initial one (`0` = unlimited) |
5353

54-
With the defaults, retry delays look like this:
54+
With these defaults, retry delays look like this:
5555

5656
| Attempt | Delay |
5757
| ------- | --------- |
@@ -62,23 +62,7 @@ With the defaults, retry delays look like this:
6262
| 5 | ~8s |
6363
| ... | ... |
6464

65-
This prevents overwhelming external services during outages. Retries continue
66-
until canceled, until `deadlineAt` is reached (or the next retry would pass it),
67-
or until `maximumAttempts` is exhausted.
68-
69-
Retry policies have the following fields:
70-
71-
| Field | Default | Description |
72-
| -------------------- | ---------- | --------------------------------------------------- |
73-
| `initialInterval` | `"1s"` | Delay before the first retry after a failed attempt |
74-
| `backoffCoefficient` | `2` | Multiplier applied to each subsequent retry delay |
75-
| `maximumInterval` | `"100s"` | Upper bound for retry delay |
76-
| `maximumAttempts` | `Infinity` | Maximum attempts, including the initial one |
77-
78-
### Step Retry Policy
79-
80-
Each `step.run(...)` can define its own retry policy. If you omit `retryPolicy`,
81-
OpenWorkflow uses the defaults shown above.
65+
Override the defaults per step:
8266

8367
```ts
8468
await step.run(
@@ -97,11 +81,16 @@ await step.run(
9781
);
9882
```
9983

100-
### Workflow Retry Policy
84+
Retries also stop early if the workflow has a `deadlineAt` and the next retry
85+
would exceed it.
86+
87+
## Workflow Retries
10188

102-
Workflow-level `retryPolicy` applies to non-step failures — for example, missing
103-
workflow definitions or errors thrown outside `step.run`. If you omit
104-
`retryPolicy` (or individual fields), OpenWorkflow uses the same defaults.
89+
Errors thrown outside of `step.run(...)` are workflow-level failures.
90+
**Workflow-level failures are not retried by default** — the workflow is
91+
marked as `failed`.
92+
93+
To enable workflow-level retries, set a `retryPolicy` on the workflow spec:
10594

10695
```ts
10796
import { defineWorkflow } from "openworkflow";
@@ -122,23 +111,38 @@ defineWorkflow(
122111
);
123112
```
124113

114+
<Note>
115+
Step retries and workflow retries are independent. Step failures use the
116+
step's own retry policy. The workflow retry policy only applies to errors
117+
thrown outside steps.
118+
</Note>
119+
120+
## Missing Workflow Definitions
121+
122+
If a worker claims a run but doesn't have the matching workflow registered, it
123+
reschedules the run with exponential backoff (starting at 5s, capped at 5min).
124+
This keeps the run alive during rolling deploys or multi-worker setups where the
125+
right worker hasn't started yet.
126+
127+
Once a worker with the correct definition comes online, it claims the run and
128+
executes normally.
129+
125130
## What Triggers a Retry
126131

127132
Retries happen when:
128133

129-
- A step function throws an exception
130-
- A step function returns a rejected promise
131-
- The worker crashes during step execution
134+
- A step function throws an error or returns a rejected promise
135+
- A worker crashes during step execution (the step is re-executed on recovery)
132136

133137
Retries do **not** happen for:
134138

135-
- Completed steps (they return cached results)
139+
- Completed steps (cached results are returned)
136140
- Explicitly canceled workflows
137-
- Workflows that complete successfully
141+
- Workflow-level errors (unless a workflow `retryPolicy` is configured)
138142

139143
## Error Handling
140144

141-
You can catch and handle errors within your workflow:
145+
You can catch step errors inside a workflow to run fallback logic:
142146

143147
```ts
144148
defineWorkflow({ name: "with-error-handling" }, async ({ input, step }) => {
@@ -147,82 +151,29 @@ defineWorkflow({ name: "with-error-handling" }, async ({ input, step }) => {
147151
await externalApi.call();
148152
});
149153
} catch (error) {
150-
// Log the error and continue with fallback
151-
console.error("API call failed:", error);
152-
153-
await step.run({ name: "fallback-operation" }, async () => {
154+
await step.run({ name: "fallback" }, async () => {
154155
await fallbackApi.call();
155156
});
156157
}
157158
});
158159
```
159160

160161
<Note>
161-
When you catch an error, the workflow continues normally. The step is still
162-
marked as failed in the database, but the workflow doesn't retry from that
163-
point.
162+
When you catch an error the workflow continues normally. The step is still
163+
recorded as failed, but no retry is triggered.
164164
</Note>
165165

166-
## Permanent Failures
167-
168-
A workflow is marked as `failed` permanently when it can no longer be retried
169-
(for example, because `deadlineAt` is reached, the next retry would exceed that
170-
deadline, or `maximumAttempts` has been reached):
171-
172-
- The error is stored in the workflow run record
173-
- No more automatic retries occur
174-
- You can view failed workflows in the dashboard
175-
- Failed workflows can be manually retried or investigated
176-
177-
## Transient vs. Permanent Errors
178-
179-
Design your steps to distinguish between transient and permanent errors:
166+
## Terminal Failures
180167

181-
```ts
182-
await step.run({ name: "call-api" }, async () => {
183-
const response = await fetch("https://api.example.com/data");
184-
185-
if (response.status === 503) {
186-
// Transient - throw to trigger retry
187-
throw new Error("Service temporarily unavailable");
188-
}
189-
190-
if (response.status === 400) {
191-
// Permanent - bad request won't succeed on retry
192-
// Handle differently (return error result, cancel workflow, etc.)
193-
return { success: false, error: "Invalid request" };
194-
}
195-
196-
return await response.json();
197-
});
198-
```
199-
200-
## Best Practices
201-
202-
### Use Meaningful Error Messages
203-
204-
Include context in errors for debugging:
205-
206-
```ts
207-
await step.run({ name: "fetch-user" }, async () => {
208-
const user = await db.users.findOne({ id: input.userId });
209-
210-
if (!user) {
211-
throw new Error(`User not found: ${input.userId}`);
212-
}
213-
214-
return user;
215-
});
216-
```
168+
A workflow is permanently marked `failed` when step retries are exhausted
169+
(`maximumAttempts` reached) or `deadlineAt` expires.
217170

218-
## Monitoring Retries
171+
Once terminal, no more automatic retries occur. You can inspect and manually
172+
retry failed workflows from the [dashboard](/docs/dashboard).
219173

220-
Use the dashboard to monitor workflow health:
174+
## Monitoring
221175

222-
- View failed workflow runs
223-
- Inspect step attempt errors
224-
- See retry history for a workflow
225-
- Identify patterns in failures
176+
Use the [dashboard](/docs/dashboard) to monitor retry health:
226177

227178
<CodeGroup>
228179
```bash npm

packages/openworkflow/CHANGELOG.md

Lines changed: 8 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -1,5 +1,13 @@
11
# openworkflow
22

3+
## Unreleased
4+
5+
- Fix to prevent workflows retrying indefinitely on default policies
6+
- Unbounded retries are still supported by setting `retryPolicy.maximumAttempts`
7+
to `Infinity` or 0
8+
- Unregistered workflows are still rescheduled infinitely with backoff instead
9+
of failing terminally so runs survive long rolling deploys
10+
311
## 0.7.0
412

513
- Add configurable workflow and step retry policies (#279, #294)

packages/openworkflow/backend.testsuite.ts

Lines changed: 10 additions & 5 deletions
Original file line numberDiff line numberDiff line change
@@ -26,6 +26,11 @@ export interface TestBackendOptions {
2626
*/
2727
export function testBackend(options: TestBackendOptions): void {
2828
const { setup, teardown } = options;
29+
const RESCHEDULING_RETRY_POLICY = {
30+
...DEFAULT_WORKFLOW_RETRY_POLICY,
31+
maximumAttempts: 3,
32+
} as const;
33+
2934
describe("Backend", () => {
3035
let backend: Backend;
3136

@@ -837,7 +842,7 @@ export function testBackend(options: TestBackendOptions): void {
837842
workflowRunId: claimed.id,
838843
workerId,
839844
error,
840-
retryPolicy: DEFAULT_WORKFLOW_RETRY_POLICY,
845+
retryPolicy: RESCHEDULING_RETRY_POLICY,
841846
});
842847

843848
// rescheduled, not permanently failed
@@ -874,7 +879,7 @@ export function testBackend(options: TestBackendOptions): void {
874879
workflowRunId: claimed.id,
875880
workerId,
876881
error: { message: "first failure" },
877-
retryPolicy: DEFAULT_WORKFLOW_RETRY_POLICY,
882+
retryPolicy: RESCHEDULING_RETRY_POLICY,
878883
});
879884

880885
expect(firstFailed.status).toBe("pending");
@@ -895,7 +900,7 @@ export function testBackend(options: TestBackendOptions): void {
895900
workflowRunId: claimed.id,
896901
workerId,
897902
error: { message: "second failure" },
898-
retryPolicy: DEFAULT_WORKFLOW_RETRY_POLICY,
903+
retryPolicy: RESCHEDULING_RETRY_POLICY,
899904
});
900905

901906
expect(secondFailed.status).toBe("pending");
@@ -1435,7 +1440,7 @@ export function testBackend(options: TestBackendOptions): void {
14351440
workflowRunId: created.id,
14361441
workerId,
14371442
error: { message: "test error" },
1438-
retryPolicy: DEFAULT_WORKFLOW_RETRY_POLICY,
1443+
retryPolicy: RESCHEDULING_RETRY_POLICY,
14391444
});
14401445

14411446
expect(failed.status).toBe("failed");
@@ -1473,7 +1478,7 @@ export function testBackend(options: TestBackendOptions): void {
14731478
workflowRunId: created.id,
14741479
workerId,
14751480
error: { message: "test error" },
1476-
retryPolicy: DEFAULT_WORKFLOW_RETRY_POLICY,
1481+
retryPolicy: RESCHEDULING_RETRY_POLICY,
14771482
});
14781483

14791484
expect(failed.status).toBe("pending");

packages/openworkflow/client.test.ts

Lines changed: 4 additions & 4 deletions
Original file line numberDiff line numberDiff line change
@@ -227,19 +227,19 @@ describe("OpenWorkflow", () => {
227227
expect(claimed).not.toBeNull();
228228
if (!claimed) throw new Error("workflow run was not claimed");
229229

230-
// mark as failed (should reschedule))
230+
// mark as failed (terminal by default)
231231
await backend.failWorkflowRun({
232232
workflowRunId: claimed.id,
233233
workerId,
234234
error: { message: "boom" },
235235
retryPolicy: DEFAULT_WORKFLOW_RETRY_POLICY,
236236
});
237237

238-
const rescheduled = await backend.getWorkflowRun({
238+
const failedRun = await backend.getWorkflowRun({
239239
workflowRunId: claimed.id,
240240
});
241-
expect(rescheduled?.status).toBe("pending");
242-
expect(rescheduled?.error).toEqual({ message: "boom" });
241+
expect(failedRun?.status).toBe("failed");
242+
expect(failedRun?.error).toEqual({ message: "boom" });
243243
});
244244

245245
test("creates workflow run with deadline", async () => {

0 commit comments

Comments
 (0)