Description
Version
v22.14.0
Platform
Darwin macbook-JPF9H46PN7 23.5.0 Darwin Kernel Version 23.5.0: Wed May 1 20:12:58 PDT 2024; root:xnu-10063.121.3~5/RELEASE_ARM64_T6000 arm64
Subsystem
http
What steps will reproduce the bug?
Repository with reproduction - https://github.com/SuperOleg39/node-js-perf-problem-reproduction
About repo:
server.js
- a http server with:
/
,/test/
and/unknown/
endpoinds with cpu-heavy work emulation- 5 parallel request processing limit for this endpoints for event loop overload protection
/readyz
health check endpoint/metrics
metrics endpoint/
,/test/
and/unknown/
handlers wrapped insetImmediate
to split heavy tasks by different event loop tasks
server.test.js
- performance test case for this server:
autocannon
make a 1000 concurrent requests for 1 minute to server endpoints/
,/test/
and/unknown/
- in parallel run 5 seconds interval with request to
/readyz
For reproduction, need to run server.js
with different node.js versions, and run server.test.js
for every try (more in repository README.md)
How often does it reproduce? Is there a required condition?
It works correctly at 16, 18 and 19 Node.js versions, but not works at 20 and 22 versions.
What is the expected behavior? Why is that the expected behavior?
Event loop lag is small, incoming health check requests is processed fast, <500ms
What do you see instead?
Event loop lag is small, but incoming health check requests is processed slowly, >10s
Additional information
Hello!
We have a integration test for custom requests limiter, working on the principle of awesome under-pressure library - in summary with this requests limiter Node.js application can be responsive even under high load.
This test emulates a high load and endpoints with CPU-heavy work, and expects that fast requests to health checks or metrics will be processed almost immediately.
After update Node.js to v20 in our CI, we saw a failed test with very slow healthcheck responses.
I spend a few days to debug, and main results is:
- event loop is not overloaded, lag is less than 500ms (okay for this test)
- in CPU profile I can see microtasks between
setImmediate
calls - but incoming
/readyz
requests start to be processed after 10-20 seconds - problem somewere in core modules, because this lag starts from the request handler -
http.createServer(handler)
- and also deeper, event before internalserver.emit('request')
call. - if we split healthchecks and metrics endpoints to different http server in the same Node.js process, it respond fast in all Node.js versions - one more confirmation that there is some queue at the level of a particular http server