Description
- Version: v8.2.1
- Platform: Linux ronomon.com 3.13.0-106-generic [stream] - callback based API additions #153-Ubuntu SMP Tue Dec 6 15:44:32 UTC 2016 x86_64 x86_64 x86_64 GNU/Linux
- Subsystem: child_process
spawn()
is not asynchronous while launching a child process.
I think this is a known issue but it took a few months to track down.
Our event loop was blocking frequently:
2017-08-18T12:18:03.935Z INFO Event loop blocked for 258ms.
2017-08-18T12:18:17.200Z INFO Event loop blocked for 223ms.
2017-08-18T12:18:45.417Z INFO Event loop blocked for 288ms.
2017-08-18T12:19:11.981Z INFO Event loop blocked for 296ms.
2017-08-18T12:19:39.142Z INFO Event loop blocked for 282ms.
2017-08-18T12:19:43.750Z INFO Event loop blocked for 329ms.
2017-08-18T12:19:48.655Z INFO Event loop blocked for 247ms.
2017-08-18T12:19:55.191Z INFO Event loop blocked for 266ms.
2017-08-18T12:20:52.000Z INFO Event loop blocked for 269ms.
2017-08-18T12:21:18.157Z INFO Event loop blocked for 294ms.
2017-08-18T12:21:20.388Z INFO Event loop blocked for 272ms.
2017-08-18T12:21:31.557Z INFO Event loop blocked for 276ms.
2017-08-18T12:21:32.529Z INFO Event loop blocked for 320ms.
2017-08-18T12:21:36.400Z INFO Event loop blocked for 307ms.
The system has many components and the RSS is around 14 GB. We had had issues with GC pause times so we moved everything off-heap and reduced the number of pointers. We may still have issues with GC pause times, but I thought it might also be something else.
In addition, over time, the majority of components were rewritten to run in the threadpool.
I thought it was finally down to deduplication chunking and hashing, but when this was recently moved to the threadpool, I took another look at a spawn routine which was calling out to clamdscan
(moving to a socket protocol was the long term plan but spawning clamdscan
was the first draft):
var stdout = '';
var stderr = '';
var child = Node.child.spawn('clamdscan', ['-']);
child.stdin.on('error', function(error) {}); // Fix ECONNRESET bug with Node.
child.stdin.write(buffer);
child.stdin.end();
child.stdout.on('data', function(data) { stdout += data; });
child.stderr.on('data', function(data) { stderr += data; });
child.on('error',
function(error) {
end(undefined, false, '', '');
}
);
child.on('exit',
function(code) {
end(undefined, code === 1, stdout, stderr);
}
);
I came across nodejs/node-v0.x-archive#9250 today and https://github.com/davepacheco/node-spawn-async seems to explain it in terms of the large heap or RSS size:
While workloads that make excessive use of fork(2) are hardly high-performance to begin with, the behavior of blocking the main thread for hundreds of milliseconds each time is unnecessarily pathological for otherwise reasonable workloads.
Wrapping the spawn code above in a simple timer shows:
2017-08-18T12:18:03.935Z INFO MDA: Blocking: scan: 271ms
2017-08-18T12:18:17.196Z INFO MDA: Blocking: scan: 263ms
2017-08-18T12:18:45.417Z INFO MDA: Blocking: scan: 296ms
2017-08-18T12:19:11.980Z INFO MDA: Blocking: scan: 293ms
2017-08-18T12:19:39.142Z INFO MDA: Blocking: scan: 306ms
2017-08-18T12:19:43.750Z INFO MDA: Blocking: scan: 293ms
2017-08-18T12:19:48.655Z INFO MDA: Blocking: scan: 276ms
2017-08-18T12:19:55.191Z INFO MDA: Blocking: scan: 299ms
2017-08-18T12:20:52.000Z INFO MDA: Blocking: scan: 277ms
2017-08-18T12:21:18.156Z INFO MDA: Blocking: scan: 294ms
2017-08-18T12:21:20.388Z INFO MDA: Blocking: scan: 292ms
2017-08-18T12:21:31.557Z INFO MDA: Blocking: scan: 290ms
2017-08-18T12:21:32.529Z INFO MDA: Blocking: scan: 318ms
2017-08-18T12:21:36.400Z INFO MDA: Blocking: scan: 315ms
All correlated with the event loop blocks reported above.
I understand that spawn is not meant to be called every few seconds, but I never expected average latency of 300ms?
Is there any way to improve spawn along the lines of what Dave Pacheco has done?
Or at least to document that spawn()
will block the event loop for around 300ms depending on heap size?