Skip to content

Commit 6c2b0a9

Browse files
Fix descriptor closed error caused by workers restart while request is in progress (#1970)
A `descriptor closed` error happens when the node renderer restarts while it's still handling an in-progress request (especially if it's a streaming request that may take more time to handle). Implemented a fix that master doesn't kill the worker directly; instead, it sends a graceful shutdown message to the worker. When a worker receives a shutdown message, it disconnects from the cluster, preventing it from receiving further requests. Then, the worker waits until all active requests are handled and then shuts itself down. The master initiates a new worker. When a worker receives the shutdown message, if it doesn't shut down during the `gracefulWorkerRestartTimeout`, the master forcibly kills it.
1 parent 40dab84 commit 6c2b0a9

File tree

11 files changed

+141
-38
lines changed

11 files changed

+141
-38
lines changed

react_on_rails_pro/CHANGELOG.md

Lines changed: 1 addition & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -40,6 +40,7 @@ You can find the **package** version numbers from this repo's tags and below in
4040

4141
### Fixed
4242
- Fixed an issue where, when React Server Components (RSC) support was disabled, the Node Renderer unnecessarily requested bundles on every render. Now, bundles are only requested when actually needed, improving performance and reducing redundant network traffic. [PR 545](https://github.com/shakacode/react_on_rails_pro/pull/545) by [AbanoubGhadban](https://github.com/AbanoubGhadban).
43+
- **Fix `descriptor closed` error**: The errors happens when the node renderer restarts while it's still handling an in-progress request (especially if it's a streaming request that may take more time to handle). Implemented a fix that makes worker shutdown gracefully after it finishes all active requests. When a worker receives the shutdown message, if it doesn't shut down during the `gracefulWorkerRestartTimeout`, the master forcibly kills it. [PR #1970][https://github.com/shakacode/react_on_rails/pull/1970] by [AbanoubGhadban](https://github.com/AbanoubGhadban).
4344

4445
### Changed
4546
- Upgraded HTTPX dependency from 1.3.4 to ~> 1.5 (currently 1.5.1). [PR 520](https://github.com/shakacode/react_on_rails_pro/pull/520) by [AbanoubGhadban](https://github.com/AbanoubGhadban).

react_on_rails_pro/docs/installation.md

Lines changed: 3 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -129,6 +129,9 @@ const config = {
129129
// allWorkersRestartInterval: 15,
130130
// time in minutes between each worker restarting when restarting all workers
131131
// delayBetweenIndividualWorkerRestarts: 2,
132+
// Also, you can set he parameter gracefulWorkerRestartTimeout to force the worker to restart
133+
// If it's the time for the worker to restart, the worker waits until it serves all active requests before restarting
134+
// If a worker stuck because of a memory leakage or an infinite loop, you can set a timeout that master waits for it before killing the worker
132135
}
133136
134137
// Renderer detects a total number of CPUs on virtual hostings like Heroku

react_on_rails_pro/docs/node-renderer/js-configuration.md

Lines changed: 1 addition & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -22,6 +22,7 @@ Here are the options available for the JavaScript renderer configuration object,
2222
If no password is set, no authentication will be required.
2323
1. **allWorkersRestartInterval** (default: `env.RENDERER_ALL_WORKERS_RESTART_INTERVAL`) - Interval in minutes between scheduled restarts of all workers. By default restarts are not enabled. If restarts are enabled, `delayBetweenIndividualWorkerRestarts` should also be set.
2424
1. **delayBetweenIndividualWorkerRestarts** (default: `env.RENDERER_DELAY_BETWEEN_INDIVIDUAL_WORKER_RESTARTS`) - Interval in minutes between individual worker restarts (when cluster restart is triggered). By default restarts are not enabled. If restarts are enabled, `allWorkersRestartInterval` should also be set.
25+
1. **gracefulWorkerRestartTimeout**: (default: `env.GRACEFUL_WORKER_RESTART_TIMEOUT`) - Time in seconds that the master waits for a worker to gracefully restart (after serving all active requests) before killing it. Use this when you want to avoid situations where a worker gets stuck in an infinite loop and never restarts. This config is only usable if worker restart is enabled. The timeout starts when the worker should restart; if it elapses without a restart, the worker is killed.
2526
1. **maxDebugSnippetLength** (default: 1000) - If the rendering request is longer than this, it will be truncated in exception and logging messages.
2627
1. **supportModules** - (default: `env.RENDERER_SUPPORT_MODULES || null`) - If set to true, `supportModules` enables the server-bundle code to call a default set of NodeJS global objects and functions that get added to the VM context:
2728
`{ Buffer, TextDecoder, TextEncoder, URLSearchParams, ReadableStream, process, setTimeout, setInterval, setImmediate, clearTimeout, clearInterval, clearImmediate, queueMicrotask }`.
Lines changed: 2 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -1,3 +1,5 @@
11
# Node Renderer Troubleshooting
22

33
- If you enabled restarts (having `allWorkersRestartInterval` and `delayBetweenIndividualWorkerRestarts`), you should set it with a high number to avoid the app from crashing because all Node renderer workers are stopped/killed.
4+
5+
- If your app contains streamed pages that take too much time to be streamed to the client, ensure to not set the `gracefulWorkerRestartTimeout` parameter or set to a high number, so the worker is not killed while it's still serving an active request.

react_on_rails_pro/packages/node-renderer/src/master.ts

Lines changed: 15 additions & 4 deletions
Original file line numberDiff line numberDiff line change
@@ -19,7 +19,12 @@ export = function masterRun(runningConfig?: Partial<Config>) {
1919

2020
// Store config in app state. From now it can be loaded by any module using getConfig():
2121
const config = buildConfig(runningConfig);
22-
const { workersCount, allWorkersRestartInterval, delayBetweenIndividualWorkerRestarts } = config;
22+
const {
23+
workersCount,
24+
allWorkersRestartInterval,
25+
delayBetweenIndividualWorkerRestarts,
26+
gracefulWorkerRestartTimeout,
27+
} = config;
2328

2429
logSanitizedConfig();
2530

@@ -48,9 +53,15 @@ export = function masterRun(runningConfig?: Partial<Config>) {
4853
allWorkersRestartInterval,
4954
delayBetweenIndividualWorkerRestarts,
5055
);
51-
setInterval(() => {
52-
restartWorkers(delayBetweenIndividualWorkerRestarts);
53-
}, allWorkersRestartInterval * MILLISECONDS_IN_MINUTE);
56+
57+
const allWorkersRestartIntervalMS = allWorkersRestartInterval * MILLISECONDS_IN_MINUTE;
58+
const scheduleWorkersRestart = () => {
59+
void restartWorkers(delayBetweenIndividualWorkerRestarts, gracefulWorkerRestartTimeout).finally(() => {
60+
setTimeout(scheduleWorkersRestart, allWorkersRestartIntervalMS);
61+
});
62+
};
63+
64+
setTimeout(scheduleWorkersRestart, allWorkersRestartIntervalMS);
5465
} else if (allWorkersRestartInterval || delayBetweenIndividualWorkerRestarts) {
5566
log.error(
5667
"Misconfiguration, please provide both 'allWorkersRestartInterval' and " +

react_on_rails_pro/packages/node-renderer/src/master/restartWorkers.ts

Lines changed: 39 additions & 17 deletions
Original file line numberDiff line numberDiff line change
@@ -5,6 +5,7 @@
55

66
import cluster from 'cluster';
77
import log from '../shared/log';
8+
import { SHUTDOWN_WORKER_MESSAGE } from '../shared/utils';
89

910
const MILLISECONDS_IN_MINUTE = 60000;
1011

@@ -14,26 +15,47 @@ declare module 'cluster' {
1415
}
1516
}
1617

17-
export = function restartWorkers(delayBetweenIndividualWorkerRestarts: number) {
18+
export = async function restartWorkers(
19+
delayBetweenIndividualWorkerRestarts: number,
20+
gracefulWorkerRestartTimeout: number | undefined,
21+
) {
1822
log.info('Started scheduled restart of workers');
1923

20-
let delay = 0;
2124
if (!cluster.workers) {
2225
throw new Error('No workers to restart');
2326
}
24-
Object.values(cluster.workers).forEach((worker) => {
25-
const killWorker = () => {
26-
if (!worker) return;
27-
log.debug('Kill worker #%d', worker.id);
28-
// eslint-disable-next-line no-param-reassign -- necessary change
29-
worker.isScheduledRestart = true;
30-
worker.destroy();
31-
};
32-
setTimeout(killWorker, delay);
33-
delay += delayBetweenIndividualWorkerRestarts * MILLISECONDS_IN_MINUTE;
34-
});
35-
36-
setTimeout(() => {
37-
log.info('Finished scheduled restart of workers');
38-
}, delay);
27+
for (const worker of Object.values(cluster.workers).filter((w) => !!w)) {
28+
log.debug('Kill worker #%d', worker.id);
29+
worker.isScheduledRestart = true;
30+
31+
worker.send(SHUTDOWN_WORKER_MESSAGE);
32+
33+
// It's inteded to restart worker in sequence, it shouldn't happens in parallel
34+
// eslint-disable-next-line no-await-in-loop
35+
await new Promise<void>((resolve) => {
36+
let timeout: NodeJS.Timeout;
37+
38+
const onExit = () => {
39+
clearTimeout(timeout);
40+
resolve();
41+
};
42+
worker.on('exit', onExit);
43+
44+
// Zero means no timeout
45+
if (gracefulWorkerRestartTimeout) {
46+
timeout = setTimeout(() => {
47+
log.debug('Worker #%d timed out, forcing kill it', worker.id);
48+
worker.destroy();
49+
worker.off('exit', onExit);
50+
resolve();
51+
}, gracefulWorkerRestartTimeout);
52+
}
53+
});
54+
// eslint-disable-next-line no-await-in-loop
55+
await new Promise((resolve) => {
56+
setTimeout(resolve, delayBetweenIndividualWorkerRestarts * MILLISECONDS_IN_MINUTE);
57+
});
58+
}
59+
60+
log.info('Finished scheduled restart of workers');
3961
};

react_on_rails_pro/packages/node-renderer/src/shared/configBuilder.ts

Lines changed: 10 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -58,6 +58,9 @@ export interface Config {
5858
allWorkersRestartInterval: number | undefined;
5959
// Time in minutes between each worker restarting when restarting all workers
6060
delayBetweenIndividualWorkerRestarts: number | undefined;
61+
// Time in seconds to wait for worker to restart before killing it
62+
// Set it to 0 or undefined to never kill the worker
63+
gracefulWorkerRestartTimeout: number | undefined;
6164
// If the rendering request is longer than this, it will be truncated in exception and logging messages
6265
maxDebugSnippetLength: number;
6366
// @deprecated See https://www.shakacode.com/react-on-rails-pro/docs/node-renderer/error-reporting-and-tracing.
@@ -165,6 +168,10 @@ const defaultConfig: Config = {
165168
? parseInt(env.RENDERER_DELAY_BETWEEN_INDIVIDUAL_WORKER_RESTARTS, 10)
166169
: undefined,
167170

171+
gracefulWorkerRestartTimeout: env.GRACEFUL_WORKER_RESTART_TIMEOUT
172+
? parseInt(env.GRACEFUL_WORKER_RESTART_TIMEOUT, 10)
173+
: undefined,
174+
168175
maxDebugSnippetLength: MAX_DEBUG_SNIPPET_LENGTH,
169176

170177
// default to true if empty, otherwise it is set to false
@@ -195,6 +202,8 @@ function envValuesUsed() {
195202
RENDERER_DELAY_BETWEEN_INDIVIDUAL_WORKER_RESTARTS:
196203
!userConfig.delayBetweenIndividualWorkerRestarts &&
197204
env.RENDERER_DELAY_BETWEEN_INDIVIDUAL_WORKER_RESTARTS,
205+
GRACEFUL_WORKER_RESTART_TIMEOUT:
206+
!userConfig.gracefulWorkerRestartTimeout && env.GRACEFUL_WORKER_RESTART_TIMEOUT,
198207
INCLUDE_TIMER_POLYFILLS: !('includeTimerPolyfills' in userConfig) && env.INCLUDE_TIMER_POLYFILLS,
199208
REPLAY_SERVER_ASYNC_OPERATION_LOGS:
200209
!userConfig.replayServerAsyncOperationLogs && env.REPLAY_SERVER_ASYNC_OPERATION_LOGS,
@@ -209,6 +218,7 @@ function sanitizedSettings(aConfig: Partial<Config> | undefined, defaultValue?:
209218
password: aConfig.password != null ? '<MASKED>' : defaultValue,
210219
allWorkersRestartInterval: aConfig.allWorkersRestartInterval || defaultValue,
211220
delayBetweenIndividualWorkerRestarts: aConfig.delayBetweenIndividualWorkerRestarts || defaultValue,
221+
gracefulWorkerRestartTimeout: aConfig.gracefulWorkerRestartTimeout || defaultValue,
212222
}
213223
: {};
214224
}

react_on_rails_pro/packages/node-renderer/src/shared/utils.ts

Lines changed: 2 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -11,6 +11,8 @@ import type { RenderResult } from '../worker/vm';
1111

1212
export const TRUNCATION_FILLER = '\n... TRUNCATED ...\n';
1313

14+
export const SHUTDOWN_WORKER_MESSAGE = 'NODE_RENDERER_SHUTDOWN_WORKER';
15+
1416
export function workerIdLabel() {
1517
// eslint-disable-next-line @typescript-eslint/no-unnecessary-condition -- worker is nullable in the primary process
1618
return cluster?.worker?.id || 'NO WORKER ID';

react_on_rails_pro/packages/node-renderer/src/worker.ts

Lines changed: 3 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -17,6 +17,7 @@ import type { FastifyInstance, FastifyReply, FastifyRequest } from './worker/typ
1717
import checkProtocolVersion from './worker/checkProtocolVersionHandler';
1818
import authenticate from './worker/authHandler';
1919
import { handleRenderRequest, type ProvidedNewBundle } from './worker/handleRenderRequest';
20+
import handleGracefulShutdown from './worker/handleGracefulShutdown';
2021
import {
2122
errorResponseResult,
2223
formatExceptionMessage,
@@ -127,6 +128,8 @@ export default function run(config: Partial<Config>) {
127128
...fastifyServerOptions,
128129
});
129130

131+
handleGracefulShutdown(app);
132+
130133
// We shouldn't have unhandled errors here, but just in case
131134
app.addHook('onError', (req, res, err, done) => {
132135
// Not errorReporter.error so that integrations can decide how to log the errors.
Lines changed: 49 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,49 @@
1+
import cluster from 'cluster';
2+
import { FastifyInstance } from './types';
3+
import { SHUTDOWN_WORKER_MESSAGE } from '../shared/utils';
4+
import log from '../shared/log';
5+
6+
const handleGracefulShutdown = (app: FastifyInstance) => {
7+
const { worker } = cluster;
8+
if (!worker) {
9+
log.error('handleGracefulShutdown is called on master, expected to call it on worker only');
10+
return;
11+
}
12+
13+
let activeRequestsCount = 0;
14+
let isShuttingDown = false;
15+
16+
process.on('message', (msg) => {
17+
if (msg === SHUTDOWN_WORKER_MESSAGE) {
18+
log.debug('Worker #%d received graceful shutdown message', worker.id);
19+
isShuttingDown = true;
20+
if (activeRequestsCount === 0) {
21+
log.debug('Worker #%d has no active requests, killing the worker', worker.id);
22+
worker.destroy();
23+
} else {
24+
log.debug(
25+
'Worker #%d has "%d" active requests, disconnecting the worker',
26+
worker.id,
27+
activeRequestsCount,
28+
);
29+
worker.disconnect();
30+
}
31+
}
32+
});
33+
34+
app.addHook('onRequest', (_req, _reply, done) => {
35+
activeRequestsCount += 1;
36+
done();
37+
});
38+
39+
app.addHook('onResponse', (_req, _reply, done) => {
40+
activeRequestsCount -= 1;
41+
if (isShuttingDown && activeRequestsCount === 0) {
42+
log.debug('Worker #%d served all active requests and going to be killed', worker.id);
43+
worker.destroy();
44+
}
45+
done();
46+
});
47+
};
48+
49+
export default handleGracefulShutdown;

0 commit comments

Comments
 (0)