node-eventloop-watchdog

Node.js apps do not crash when they hang. They just stop responding.
node-eventloop-watchdog detects event loop stalls and can trigger recovery before production goes silent.

Why This Exists

Most Node monitoring tells you the event loop is slow. That is useful, but it does not answer the production question:

If the event loop is blocked, then what happens?

node-eventloop-watchdog is a small production safety layer for that exact moment. It can log, emit events, call your handler, post a webhook, exit, or terminate a stuck process so a supervisor such as Kubernetes, systemd, PM2, Docker, or a platform runtime can restart it.

What Makes It Different

Tool category	What it usually does	Limitation
Event loop metrics	Tracks lag, averages, percentiles	Tells you something is wrong, but does not act
Native watchdogs	Kill or supervise the process	Often require native dependencies or separate setup
Simple timers	Detect lag after the loop resumes	Cannot handle a loop that never comes back
`node-eventloop-watchdog`	Detects stalls, adds context, and can act	Zero runtime dependencies, opt-in recovery

Ecosystem

node-eventloop-watchdog is part of a small Node.js observability ecosystem you can adopt independently or together:

node-actuator-lite — Spring Boot-style /actuator/health, /info, /metrics, /env, /threaddump, /heapdump, and /prometheus endpoints.
node-eventloop-watchdog — Detects event-loop stalls, captures stack traces and hotspots, and triggers recovery.
node-request-trace — Per-request timelines, browser dashboard, and CLI without OpenTelemetry.

When all three are installed:

This watchdog automatically registers /actuator/eventloop, /actuator/eventloop/history, /actuator/eventloop/hotspots, and /actuator/eventloop/metrics under node-actuator-lite.
Block events include the active request id, route, and method captured by node-request-trace.

Runnable example: node-actuator-lite/examples/ecosystem.

Quickest setup: Use node-observability-lite to wire the three packages together with production-safe presets in one line.
const observability = require('node-observability-lite');
observability.express(app, {
  preset: 'production',
  auth: req => req.get('authorization') === `Bearer ${process.env.OPS_TOKEN}`,
});

Install

npm install node-eventloop-watchdog

CommonJS and bundled TypeScript declarations are included.

const watchdog = require('node-eventloop-watchdog');

Quick Start: Observe Mode

Use start() when you want safe, backwards-compatible monitoring. It logs blocked event loop events and keeps history, metrics, hotspots, and request context.

const watchdog = require('node-eventloop-watchdog');

watchdog.start();

When a block crosses the threshold, you get a structured event:

[node-eventloop-watchdog] [WARN] Event Loop Blocked
  Duration: 142ms
  Severity: warning
  Threshold: 50ms
  Action: log
  Route: POST /checkout

  Suspected Blocking Operation
  JSON.stringify

  Location
  checkoutService.js:84

Production Mode: Protect

Use protect() when you want opinionated production behavior. It enables recovery defaults designed for apps already managed by a process supervisor.

const watchdog = require('node-eventloop-watchdog');

watchdog.protect();

Default protection behavior:

Trigger	Default action
Event loop lag >= `100ms`	Log warning, record metrics, emit `block` event
Event loop lag >= `500ms`	Mark event critical and terminate with `SIGTERM`
Main event loop never resumes for `1000ms`	Worker-backed hard watchdog terminates with `SIGTERM`

The intended production pattern is simple: the watchdog terminates the unhealthy process, and your supervisor restarts it.

watchdog.protect({
  recovery: {
    action: 'kill',
    signal: 'SIGTERM',
    hardTimeout: 1000
  }
});

Brutal Demo

This demo intentionally freezes the main event loop forever. A normal timer-based monitor cannot recover from this because the timer callback never runs. protect() also starts a worker-backed hard watchdog, so the process can still be terminated.

node examples/brutal-demo.js

const watchdog = require('node-eventloop-watchdog');

watchdog.protect({
  criticalThreshold: 100,
  recovery: {
    enabled: true,
    action: 'kill',
    hardTimeout: 500,
    signal: 'SIGTERM'
  }
});

setTimeout(() => {
  while (true) {}
}, 2000);

Expected output:

Watchdog armed. This process will freeze in 2 seconds.
Expected result: the hard watchdog logs the stall and terminates the process.
[node-eventloop-watchdog] [ERROR] Event loop hard-stalled for 500ms. Action: kill
Terminated: 15

Trigger To Action

You can choose the action that matches your runtime:

Action	What happens	Good for
`log`	Record and log the event only	Local dev, dashboards, low-risk rollout
`callback`	Call `recovery.handler(event)`	Custom alerting or diagnostics
`webhook`	POST the event as JSON	Alertmanager, incident bots, automation
`exit`	Stop the monitor and call `process.exit(exitCode)`	Graceful process-manager restart
`kill`	Send a signal to the process	Kubernetes, systemd, PM2, Docker restart
`abort`	Hard watchdog aborts the process	Core dumps and severe failure analysis

watchdog.start({
  warningThreshold: 100,
  criticalThreshold: 500,
  recovery: {
    enabled: true,
    minSeverity: 'critical',
    action: 'webhook',
    webhookUrl: 'https://alerts.example.com/event-loop-block'
  }
});

watchdog.start({
  recovery: {
    enabled: true,
    action: 'callback',
    handler(event) {
      pagerDuty.alert({
        summary: `Event loop blocked for ${event.duration}ms`,
        route: event.request?.route,
        location: event.location
      });
    }
  }
});

Real Problems This Solves

Infinite loops that leave a Node process alive but useless.
CPU-heavy synchronous code blocking requests.
Large JSON serialization or parsing on hot paths.
Synchronous filesystem, crypto, compression, or child-process calls in request handlers.
Stuck production servers that pass process liveness checks but stop serving traffic.
Incidents where you need recent block history, request correlation, and likely hotspots after recovery.

API

`watchdog.start(config?)`

Starts observe mode. This is the safest default for adding visibility without changing process lifecycle behavior.

watchdog.start({
  warningThreshold: 50,
  criticalThreshold: 100,
  captureStackTrace: true,
  historySize: 50,
  enableMetrics: true,
  detectBlockingPatterns: true,
  checkInterval: 20,
  logLevel: 'warn',
  jsonLogs: false,
  onBlock: null,
  recovery: false
});

`watchdog.protect(config?)`

Starts protect mode with opinionated recovery defaults.

watchdog.protect({
  warningThreshold: 100,
  criticalThreshold: 500,
  recovery: {
    action: 'kill',
    hardTimeout: 1000,
    signal: 'SIGTERM'
  }
});

`watchdog.stop()`

Stops monitoring and disables the hard watchdog worker.

`watchdog.on('block', listener)`

Subscribe to block events.

watchdog.on('block', (event) => {
  console.log(event.duration, event.severity, event.action.type);
});

`watchdog.getStats()`

Returns runtime state, lag metrics, memory snapshot, and active mode.

watchdog.getStats();
// {
//   avgLag: 12,
//   maxLag: 121,
//   minLag: 1,
//   totalBlocks: 14,
//   blocksLastMinute: 6,
//   running: true,
//   config: { mode: 'protect', warningThreshold: 100, criticalThreshold: 500, recoveryAction: 'kill' },
//   memory: { heapUsed: 42, heapTotal: 64, rss: 91, external: 2, arrayBuffers: 1 }
// }

`watchdog.getRecentBlocks(count?)`

Returns the most recent blocking events.

`watchdog.getBlockingHotspots(limit?)`

Returns best-effort user-code locations captured when blocks were detected.

watchdog.getBlockingHotspots();
// [
//   { file: 'reportService.js', line: 142, blocks: 18, maxLag: 221, avgLag: 145 },
//   { file: 'orderController.js', line: 51, blocks: 7, maxLag: 94, avgLag: 62 }
// ]

`watchdog.middleware()`

Returns Connect / Express-style middleware for request correlation.

const express = require('express');
const watchdog = require('node-eventloop-watchdog');

const app = express();

watchdog.start();
app.use(watchdog.middleware());

app.post('/checkout', (req, res) => {
  res.json({ ok: true });
});

Configuration

Option	Type	Default	Description
`mode`	`'observe' \| 'protect'`	`'observe'`	Runtime posture
`warningThreshold`	number	`50`	Lag in ms before warning
`criticalThreshold`	number	`100`	Lag in ms before critical event
`captureStackTrace`	boolean	`true`	Capture best-effort stack context
`historySize`	number	`50`	Max blocking events retained
`enableMetrics`	boolean	`true`	Collect lag and memory metrics
`detectBlockingPatterns`	boolean	`true`	Identify likely sync blocking patterns
`checkInterval`	number	`20`	Poll interval in ms
`logLevel`	string	`'warn'`	`debug`, `info`, `warn`, `error`, or `silent`
`jsonLogs`	boolean	`false`	Emit JSON logs
`onBlock`	function	`null`	Callback for every block
`recovery.enabled`	boolean	`false`	Enable recovery actions
`recovery.action`	string	`'log'`	`log`, `callback`, `webhook`, `exit`, `kill`, or `abort`
`recovery.minSeverity`	string	`'critical'`	Minimum severity before action runs
`recovery.hardTimeout`	number	`0`	Worker-backed timeout for never-returning stalls
`recovery.signal`	string	`'SIGTERM'`	Signal for `kill` action
`recovery.exitCode`	number	`1`	Exit code for `exit` action
`recovery.webhookUrl`	string	`null`	URL for `webhook` action
`recovery.handler`	function	`null`	Function for `callback` action

Blocking Pattern Hints

The watchdog looks for common synchronous patterns in captured stack context:

Pattern	Category
`JSON.stringify` / `JSON.parse`	Serialization
`fs.readFileSync`, `fs.writeFileSync`, etc.	Sync filesystem
`crypto.pbkdf2Sync`, `crypto.scryptSync`, `crypto.createHash`	Sync crypto
`zlib.*Sync`	Sync compression
`child_process.execSync`, `spawnSync`	Sync child process
`RegExp.exec`	Regex backtracking

Important Attribution Note

Timer-based lag detection runs after the event loop resumes. Stack traces, location, userFrame, and hotspots are therefore best-effort context captured around detection time, not guaranteed blame for the exact blocking line.

For a loop that never resumes, enable recovery.hardTimeout through protect() or explicit recovery config. The hard watchdog runs in a worker thread and can terminate the process even when the main event loop is permanently stuck.

Integrations

JSON Logs

watchdog.start({ jsonLogs: true });

node-request-trace

If node-request-trace is installed, active request data is automatically attached to block events.

node-actuator-lite

If node-actuator-lite is installed, these endpoints are registered automatically:

Endpoint	Description
`GET /actuator/eventloop`	Status, metrics, top hotspots
`GET /actuator/eventloop/history`	Recent blocking events
`GET /actuator/eventloop/hotspots`	Hotspot ranking
`GET /actuator/eventloop/metrics`	Lag and memory metrics

Operational Guidance

Use start() first when rolling out to an existing app.
Use protect() when the app runs under a supervisor that restarts failed processes.
Keep hardTimeout comfortably above normal CPU spikes to avoid killing legitimate long work.
Prefer SIGTERM for graceful runtime restarts; use abort only when you need crash diagnostics.
Run npm run bench in your own workload if overhead matters.

Development

npm ci
npm run lint
npm run typecheck
npm test
npm run test:coverage:check

The CI gate requires at least 90% coverage across statements, branches, functions, and lines.

License

MIT

Name		Name	Last commit message	Last commit date
Latest commit History 14 Commits
.github/workflows		.github/workflows
bench		bench
examples		examples
scripts		scripts
src		src
test		test
.gitignore		.gitignore
LICENSE		LICENSE
README.md		README.md
index.d.ts		index.d.ts
index.js		index.js
package-lock.json		package-lock.json
package.json		package.json
tsconfig.json		tsconfig.json

Folders and files

Latest commit

History

Repository files navigation

node-eventloop-watchdog

Why This Exists

What Makes It Different

Ecosystem

Install

Quick Start: Observe Mode

Production Mode: Protect

Brutal Demo

Trigger To Action

Real Problems This Solves

API

watchdog.start(config?)

watchdog.protect(config?)

watchdog.stop()

watchdog.on('block', listener)

watchdog.getStats()

watchdog.getRecentBlocks(count?)

watchdog.getBlockingHotspots(limit?)

watchdog.middleware()

Configuration

Blocking Pattern Hints

Important Attribution Note

Integrations

JSON Logs

node-request-trace

node-actuator-lite

Operational Guidance

Development

License

About

Topics

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases 3

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

`watchdog.start(config?)`

`watchdog.protect(config?)`

`watchdog.stop()`

`watchdog.on('block', listener)`

`watchdog.getStats()`

`watchdog.getRecentBlocks(count?)`

`watchdog.getBlockingHotspots(limit?)`

`watchdog.middleware()`

Packages