Fix descriptor closed error caused by workers restart while request is in progress (#1970)

AbanoubGhadban · web-flow · commit 6c2b0a92607e · 2025-11-12T12:50:02.000+02:00
A `descriptor closed` error happens when the node renderer restarts while it's still handling an in-progress request (especially if it's a streaming request that may take more time to handle).
Implemented a fix that master doesn't kill the worker directly; instead, it sends a graceful shutdown message to the worker.
When a worker receives a shutdown message, it disconnects from the cluster, preventing it from receiving further requests.
Then, the worker waits until all active requests are handled and then shuts itself down. The master initiates a new worker.
When a worker receives the shutdown message, if it doesn't shut down during the `gracefulWorkerRestartTimeout`, the master forcibly kills it.
diff --git a/react_on_rails_pro/CHANGELOG.md b/react_on_rails_pro/CHANGELOG.md
@@ -40,6 +40,7 @@ You can find the **package** version numbers from this repo's tags and below in
 
 ### Fixed
 - Fixed an issue where, when React Server Components (RSC) support was disabled, the Node Renderer unnecessarily requested bundles on every render. Now, bundles are only requested when actually needed, improving performance and reducing redundant network traffic. [PR 545](https://github.com/shakacode/react_on_rails_pro/pull/545) by [AbanoubGhadban](https://github.com/AbanoubGhadban).
+- **Fix `descriptor closed` error**: The errors happens when the node renderer restarts while it's still handling an in-progress request (especially if it's a streaming request that may take more time to handle). Implemented a fix that makes worker shutdown gracefully after it finishes all active requests. When a worker receives the shutdown message, if it doesn't shut down during the `gracefulWorkerRestartTimeout`, the master forcibly kills it. [PR #1970][https://github.com/shakacode/react_on_rails/pull/1970] by [AbanoubGhadban](https://github.com/AbanoubGhadban).
 
 ### Changed
 - Upgraded HTTPX dependency from 1.3.4 to ~> 1.5 (currently 1.5.1). [PR 520](https://github.com/shakacode/react_on_rails_pro/pull/520) by [AbanoubGhadban](https://github.com/AbanoubGhadban).
diff --git a/react_on_rails_pro/docs/installation.md b/react_on_rails_pro/docs/installation.md
@@ -129,6 +129,9 @@ const config = {
   // allWorkersRestartInterval: 15,
   // time in minutes between each worker restarting when restarting all workers
   // delayBetweenIndividualWorkerRestarts: 2,
+  // Also, you can set he parameter gracefulWorkerRestartTimeout to force the worker to restart
+  // If it's the time for the worker to restart, the worker waits until it serves all active requests before restarting
+  // If a worker stuck because of a memory leakage or an infinite loop, you can set a timeout that master waits for it before killing the worker
 }
 
 // Renderer detects a total number of CPUs on virtual hostings like Heroku
diff --git a/react_on_rails_pro/docs/node-renderer/js-configuration.md b/react_on_rails_pro/docs/node-renderer/js-configuration.md
@@ -22,6 +22,7 @@ Here are the options available for the JavaScript renderer configuration object,
 If no password is set, no authentication will be required.
 1. **allWorkersRestartInterval** (default: `env.RENDERER_ALL_WORKERS_RESTART_INTERVAL`) - Interval in minutes between scheduled restarts of all workers. By default restarts are not enabled. If restarts are enabled, `delayBetweenIndividualWorkerRestarts` should also be set.
 1. **delayBetweenIndividualWorkerRestarts** (default: `env.RENDERER_DELAY_BETWEEN_INDIVIDUAL_WORKER_RESTARTS`) - Interval in minutes between individual worker restarts (when cluster restart is triggered). By default restarts are not enabled. If restarts are enabled, `allWorkersRestartInterval` should also be set.
+1. **gracefulWorkerRestartTimeout**: (default: `env.GRACEFUL_WORKER_RESTART_TIMEOUT`) - Time in seconds that the master waits for a worker to gracefully restart (after serving all active requests) before killing it. Use this when you want to avoid situations where a worker gets stuck in an infinite loop and never restarts. This config is only usable if worker restart is enabled. The timeout starts when the worker should restart; if it elapses without a restart, the worker is killed.
 1. **maxDebugSnippetLength** (default: 1000) - If the rendering request is longer than this, it will be truncated in exception and logging messages.
 1. **supportModules** - (default: `env.RENDERER_SUPPORT_MODULES || null`) - If set to true, `supportModules` enables the server-bundle code to call a default set of NodeJS global objects and functions that get added to the VM context: 
 `{ Buffer, TextDecoder, TextEncoder, URLSearchParams, ReadableStream, process, setTimeout, setInterval, setImmediate, clearTimeout, clearInterval, clearImmediate, queueMicrotask }`.
diff --git a/react_on_rails_pro/docs/node-renderer/troubleshooting.md b/react_on_rails_pro/docs/node-renderer/troubleshooting.md
@@ -1,3 +1,5 @@
 # Node Renderer Troubleshooting
 
 - If you enabled restarts (having `allWorkersRestartInterval` and `delayBetweenIndividualWorkerRestarts`), you should set it with a high number to avoid the app from crashing because all Node renderer workers are stopped/killed.
+
+- If your app contains streamed pages that take too much time to be streamed to the client, ensure to not set the `gracefulWorkerRestartTimeout` parameter or set to a high number, so the worker is not killed while it's still serving an active request.
diff --git a/react_on_rails_pro/packages/node-renderer/src/master.ts b/react_on_rails_pro/packages/node-renderer/src/master.ts
@@ -19,7 +19,12 @@ export = function masterRun(runningConfig?: Partial<Config>) {
 
   // Store config in app state. From now it can be loaded by any module using getConfig():
   const config = buildConfig(runningConfig);
-  const { workersCount, allWorkersRestartInterval, delayBetweenIndividualWorkerRestarts } = config;
+  const {
+    workersCount,
+    allWorkersRestartInterval,
+    delayBetweenIndividualWorkerRestarts,
+    gracefulWorkerRestartTimeout,
+  } = config;
 
   logSanitizedConfig();
 
@@ -48,9 +53,15 @@ export = function masterRun(runningConfig?: Partial<Config>) {
       allWorkersRestartInterval,
       delayBetweenIndividualWorkerRestarts,
     );
-    setInterval(() => {
-      restartWorkers(delayBetweenIndividualWorkerRestarts);
-    }, allWorkersRestartInterval * MILLISECONDS_IN_MINUTE);
+
+    const allWorkersRestartIntervalMS = allWorkersRestartInterval * MILLISECONDS_IN_MINUTE;
+    const scheduleWorkersRestart = () => {
+      void restartWorkers(delayBetweenIndividualWorkerRestarts, gracefulWorkerRestartTimeout).finally(() => {
+        setTimeout(scheduleWorkersRestart, allWorkersRestartIntervalMS);
+      });
+    };
+
+    setTimeout(scheduleWorkersRestart, allWorkersRestartIntervalMS);
   } else if (allWorkersRestartInterval || delayBetweenIndividualWorkerRestarts) {
     log.error(
       "Misconfiguration, please provide both 'allWorkersRestartInterval' and " +
diff --git a/react_on_rails_pro/packages/node-renderer/src/master/restartWorkers.ts b/react_on_rails_pro/packages/node-renderer/src/master/restartWorkers.ts
@@ -5,6 +5,7 @@
 
 import cluster from 'cluster';
 import log from '../shared/log';
+import { SHUTDOWN_WORKER_MESSAGE } from '../shared/utils';
 
 const MILLISECONDS_IN_MINUTE = 60000;
 
@@ -14,26 +15,47 @@ declare module 'cluster' {
   }
 }
 
-export = function restartWorkers(delayBetweenIndividualWorkerRestarts: number) {
+export = async function restartWorkers(
+  delayBetweenIndividualWorkerRestarts: number,
+  gracefulWorkerRestartTimeout: number | undefined,
+) {
   log.info('Started scheduled restart of workers');
 
-  let delay = 0;
   if (!cluster.workers) {
     throw new Error('No workers to restart');
   }
-  Object.values(cluster.workers).forEach((worker) => {
-    const killWorker = () => {
-      if (!worker) return;
-      log.debug('Kill worker #%d', worker.id);
-      // eslint-disable-next-line no-param-reassign -- necessary change
-      worker.isScheduledRestart = true;
-      worker.destroy();
-    };
-    setTimeout(killWorker, delay);
-    delay += delayBetweenIndividualWorkerRestarts * MILLISECONDS_IN_MINUTE;
-  });
-
-  setTimeout(() => {
-    log.info('Finished scheduled restart of workers');
-  }, delay);
+  for (const worker of Object.values(cluster.workers).filter((w) => !!w)) {
+    log.debug('Kill worker #%d', worker.id);
+    worker.isScheduledRestart = true;
+
+    worker.send(SHUTDOWN_WORKER_MESSAGE);
+
+    // It's inteded to restart worker in sequence, it shouldn't happens in parallel
+    // eslint-disable-next-line no-await-in-loop
+    await new Promise<void>((resolve) => {
+      let timeout: NodeJS.Timeout;
+
+      const onExit = () => {
+        clearTimeout(timeout);
+        resolve();
+      };
+      worker.on('exit', onExit);
+
+      // Zero means no timeout
+      if (gracefulWorkerRestartTimeout) {
+        timeout = setTimeout(() => {
+          log.debug('Worker #%d timed out, forcing kill it', worker.id);
+          worker.destroy();
+          worker.off('exit', onExit);
+          resolve();
+        }, gracefulWorkerRestartTimeout);
+      }
+    });
+    // eslint-disable-next-line no-await-in-loop
+    await new Promise((resolve) => {
+      setTimeout(resolve, delayBetweenIndividualWorkerRestarts * MILLISECONDS_IN_MINUTE);
+    });
+  }
+
+  log.info('Finished scheduled restart of workers');
 };
diff --git a/react_on_rails_pro/packages/node-renderer/src/shared/configBuilder.ts b/react_on_rails_pro/packages/node-renderer/src/shared/configBuilder.ts
@@ -58,6 +58,9 @@ export interface Config {
   allWorkersRestartInterval: number | undefined;
   // Time in minutes between each worker restarting when restarting all workers
   delayBetweenIndividualWorkerRestarts: number | undefined;
+  // Time in seconds to wait for worker to restart before killing it
+  // Set it to 0 or undefined to never kill the worker
+  gracefulWorkerRestartTimeout: number | undefined;
   // If the rendering request is longer than this, it will be truncated in exception and logging messages
   maxDebugSnippetLength: number;
   // @deprecated See https://www.shakacode.com/react-on-rails-pro/docs/node-renderer/error-reporting-and-tracing.
@@ -165,6 +168,10 @@ const defaultConfig: Config = {
     ? parseInt(env.RENDERER_DELAY_BETWEEN_INDIVIDUAL_WORKER_RESTARTS, 10)
     : undefined,
 
+  gracefulWorkerRestartTimeout: env.GRACEFUL_WORKER_RESTART_TIMEOUT
+    ? parseInt(env.GRACEFUL_WORKER_RESTART_TIMEOUT, 10)
+    : undefined,
+
   maxDebugSnippetLength: MAX_DEBUG_SNIPPET_LENGTH,
 
   // default to true if empty, otherwise it is set to false
@@ -195,6 +202,8 @@ function envValuesUsed() {
     RENDERER_DELAY_BETWEEN_INDIVIDUAL_WORKER_RESTARTS:
       !userConfig.delayBetweenIndividualWorkerRestarts &&
       env.RENDERER_DELAY_BETWEEN_INDIVIDUAL_WORKER_RESTARTS,
+    GRACEFUL_WORKER_RESTART_TIMEOUT:
+      !userConfig.gracefulWorkerRestartTimeout && env.GRACEFUL_WORKER_RESTART_TIMEOUT,
     INCLUDE_TIMER_POLYFILLS: !('includeTimerPolyfills' in userConfig) && env.INCLUDE_TIMER_POLYFILLS,
     REPLAY_SERVER_ASYNC_OPERATION_LOGS:
       !userConfig.replayServerAsyncOperationLogs && env.REPLAY_SERVER_ASYNC_OPERATION_LOGS,
@@ -209,6 +218,7 @@ function sanitizedSettings(aConfig: Partial<Config> | undefined, defaultValue?:
         password: aConfig.password != null ? '<MASKED>' : defaultValue,
         allWorkersRestartInterval: aConfig.allWorkersRestartInterval || defaultValue,
         delayBetweenIndividualWorkerRestarts: aConfig.delayBetweenIndividualWorkerRestarts || defaultValue,
+        gracefulWorkerRestartTimeout: aConfig.gracefulWorkerRestartTimeout || defaultValue,
       }
     : {};
 }
diff --git a/react_on_rails_pro/packages/node-renderer/src/shared/utils.ts b/react_on_rails_pro/packages/node-renderer/src/shared/utils.ts
@@ -11,6 +11,8 @@ import type { RenderResult } from '../worker/vm';
 
 export const TRUNCATION_FILLER = '\n... TRUNCATED ...\n';
 
+export const SHUTDOWN_WORKER_MESSAGE = 'NODE_RENDERER_SHUTDOWN_WORKER';
+
 export function workerIdLabel() {
   // eslint-disable-next-line @typescript-eslint/no-unnecessary-condition -- worker is nullable in the primary process
   return cluster?.worker?.id || 'NO WORKER ID';
diff --git a/react_on_rails_pro/packages/node-renderer/src/worker.ts b/react_on_rails_pro/packages/node-renderer/src/worker.ts
@@ -17,6 +17,7 @@ import type { FastifyInstance, FastifyReply, FastifyRequest } from './worker/typ
 import checkProtocolVersion from './worker/checkProtocolVersionHandler';
 import authenticate from './worker/authHandler';
 import { handleRenderRequest, type ProvidedNewBundle } from './worker/handleRenderRequest';
+import handleGracefulShutdown from './worker/handleGracefulShutdown';
 import {
   errorResponseResult,
   formatExceptionMessage,
@@ -127,6 +128,8 @@ export default function run(config: Partial<Config>) {
     ...fastifyServerOptions,
   });
 
+  handleGracefulShutdown(app);
+
   // We shouldn't have unhandled errors here, but just in case
   app.addHook('onError', (req, res, err, done) => {
     // Not errorReporter.error so that integrations can decide how to log the errors.
diff --git a/react_on_rails_pro/packages/node-renderer/src/worker/handleGracefulShutdown.ts b/react_on_rails_pro/packages/node-renderer/src/worker/handleGracefulShutdown.ts
@@ -0,0 +1,49 @@
+import cluster from 'cluster';
+import { FastifyInstance } from './types';
+import { SHUTDOWN_WORKER_MESSAGE } from '../shared/utils';
+import log from '../shared/log';
+
+const handleGracefulShutdown = (app: FastifyInstance) => {
+  const { worker } = cluster;
+  if (!worker) {
+    log.error('handleGracefulShutdown is called on master, expected to call it on worker only');
+    return;
+  }
+
+  let activeRequestsCount = 0;
+  let isShuttingDown = false;
+
+  process.on('message', (msg) => {
+    if (msg === SHUTDOWN_WORKER_MESSAGE) {
+      log.debug('Worker #%d received graceful shutdown message', worker.id);
+      isShuttingDown = true;
+      if (activeRequestsCount === 0) {
+        log.debug('Worker #%d has no active requests, killing the worker', worker.id);
+        worker.destroy();
+      } else {
+        log.debug(
+          'Worker #%d has "%d" active requests, disconnecting the worker',
+          worker.id,
+          activeRequestsCount,
+        );
+        worker.disconnect();
+      }
+    }
+  });
+
+  app.addHook('onRequest', (_req, _reply, done) => {
+    activeRequestsCount += 1;
+    done();
+  });
+
+  app.addHook('onResponse', (_req, _reply, done) => {
+    activeRequestsCount -= 1;
+    if (isShuttingDown && activeRequestsCount === 0) {
+      log.debug('Worker #%d served all active requests and going to be killed', worker.id);
+      worker.destroy();
+    }
+    done();
+  });
+};
+
+export default handleGracefulShutdown;
diff --git a/react_on_rails_pro/spec/dummy/spec/system/integration_spec.rb b/react_on_rails_pro/spec/dummy/spec/system/integration_spec.rb
@@ -11,7 +11,7 @@ def change_text_expect_dom_selector(dom_selector, expect_no_change: false)
     find("input").set new_text
     within("h3") do
       if expect_no_change
-        expect(subject).to have_no_content new_text
+        expect(subject).not_to have_content new_text
       else
         expect(subject).to have_content new_text
       end
@@ -110,7 +110,7 @@ def change_text_expect_dom_selector(dom_selector, expect_no_change: false)
   it "changes name in message according to input" do
     visit "/client_side_hello_world"
     change_text_expect_dom_selector("#HelloWorld-react-component-0")
-    click_link "Hello World Component Server Rendered, with extra options" # rubocop:disable Capybara/ClickLinkOrButtonStyle
+    click_link "Hello World Component Server Rendered, with extra options"
     change_text_expect_dom_selector("#my-hello-world-id")
   end
 end
@@ -174,19 +174,19 @@ def change_text_expect_dom_selector(dom_selector, expect_no_change: false)
 
   before do
     visit "/"
-    click_link "React Router" # rubocop:disable Capybara/ClickLinkOrButtonStyle
+    click_link "React Router"
   end
 
   context "when rendering /react_router" do
     it { is_expected.to have_text("Woohoo, we can use react-router here!") }
 
     it "clicking links correctly renders other pages" do
-      click_link "Router First Page" # rubocop:disable Capybara/ClickLinkOrButtonStyle
+      click_link "Router First Page"
       expect(page).to have_current_path("/react_router/first_page")
       first_page_header_text = page.find(:css, "h2#first-page").text
       expect(first_page_header_text).to eq("React Router First Page")
 
-      click_link "Router Second Page" # rubocop:disable Capybara/ClickLinkOrButtonStyle
+      click_link "Router Second Page"
       expect(page).to have_current_path("/react_router/second_page")
       second_page_header_text = page.find(:css, "h2#second-page").text
       expect(second_page_header_text).to eq("React Router Second Page")
@@ -239,12 +239,12 @@ def change_text_expect_dom_selector(dom_selector, expect_no_change: false)
   end
 end
 
-describe "Manual client hydration", :js do
+describe "Manual client hydration", :js, type: :system do
   before { visit "/xhr_refresh" }
 
   it "HelloWorldRehydratable onChange should trigger" do
     within("form") do
-      click_button "refresh" # rubocop:disable Capybara/ClickLinkOrButtonStyle
+      click_button "refresh"
     end
     within("#HelloWorldRehydratable-react-component-1") do
       find("input").set "Should update"
@@ -396,39 +396,38 @@ def change_text_expect_dom_selector(dom_selector, expect_no_change: false)
 
   it "hydrates the component" do
     visit path
-    expect(page.html).to include("client-bundle.js")
+    expect(page.html).to match(/client-bundle[^\"]*.js/)
     change_text_expect_dom_selector(selector)
   end
 
   it "renders the page completely on server and displays content on client even without JavaScript" do
     # Don't add client-bundle.js to the page to ensure that the app is not hydrated
     visit "#{path}?skip_js_packs=true"
-    expect(page.html).not_to include("client-bundle.js")
+    expect(page.html).not_to match(/client-bundle[^\"]*.js/)
     # Ensure that the component state is not updated
     change_text_expect_dom_selector(selector, expect_no_change: true)
 
-    expect(page).to have_no_text "Loading branch1"
-    expect(page).to have_no_text "Loading branch2"
-    expect(page).to have_no_text(/Loading branch1 at level \d+/)
+    expect(page).not_to have_text "Loading branch1"
+    expect(page).not_to have_text "Loading branch2"
+    expect(page).not_to have_text(/Loading branch1 at level \d+/)
     expect(page).to have_text(/branch1 \(level \d+\)/, count: 5)
   end
 
   it "doesn't hydrate status component if packs are not loaded" do
     # visit waits for the page to load, so we ensure that the page is loaded before checking the hydration status
     visit "#{path}?skip_js_packs=true"
     expect(page).to have_text "HydrationStatus: Streaming server render"
-    expect(page).to have_no_text "HydrationStatus: Hydrated"
-    expect(page).to have_no_text "HydrationStatus: Page loaded"
+    expect(page).not_to have_text "HydrationStatus: Hydrated"
+    expect(page).not_to have_text "HydrationStatus: Page loaded"
   end
 end
 
-describe "Pages/stream_async_components_for_testing", :js,
-         skip: "Flaky test replaced by Playwright E2E tests in e2e-tests/streaming.spec.ts" do
+describe "Pages/stream_async_components_for_testing", :js do
   it_behaves_like "streamed component tests", "/stream_async_components_for_testing",
                   "#AsyncComponentsTreeForTesting-react-component-0"
 end
 
-describe "React Router Sixth Page", :js, skip: "Work in progress in another branch: justin808/fix-body-dup-retry" do
+describe "React Router Sixth Page", :js do
   it_behaves_like "streamed component tests", "/server_router/streaming-server-component",
                   "#ServerComponentRouter-react-component-0"
 end

Original file line number	Diff line number	Diff line change
`@@ -129,6 +129,9 @@ const config = {`
`129`	`129`	`// allWorkersRestartInterval: 15,`
`130`	`130`	`// time in minutes between each worker restarting when restarting all workers`
`131`	`131`	`// delayBetweenIndividualWorkerRestarts: 2,`
	`132`	`+ // Also, you can set he parameter gracefulWorkerRestartTimeout to force the worker to restart`
	`133`	`+ // If it's the time for the worker to restart, the worker waits until it serves all active requests before restarting`
	`134`	`+ // If a worker stuck because of a memory leakage or an infinite loop, you can set a timeout that master waits for it before killing the worker`
`132`	`135`	`}`
`133`	`136`
`134`	`137`	`// Renderer detects a total number of CPUs on virtual hostings like Heroku`